A common misconception has emerged alongside the rise of AI-powered search: that Large Language Models (LLMs) actively seek out or prefer web pages containing structured data, or schema. This belief often stems from a misunderstanding of how these complex systems operate. In reality, LLMs are the final step in a process that begins with a search engine’s established retrieval methods. They do not have a “preference” for schema because the systems that feed them information do not use it as a primary ranking signal. The mechanism at play is one of query fan-out and synthesis, where the LLM works with the results it is given, with or without schema.
Table of Contents
ToggleThe Perplexing lack of Schema documentation on Perplexity?
There’s absolutely no content on Perplexity’s ENTIRE site about Schema in target documents for the mythization process!
The core of how models like Google’s Gemini generate AI Overviews lies in a technique called “query fan-out.” When a user enters a complex query, the system does not perform a single search. Instead, it intelligently breaks the user’s intent down into multiple, more specific sub-queries. For example, a search for “best family-friendly vacation spots in Florida with a pool” might be fanned out into separate, simultaneous searches for “top family resorts in Florida,” “Florida hotels with water parks,” and “reviews of kid-friendly activities in Orlando.” The system then gathers the top-ranking pages for each of these distinct queries. This collection of pages becomes the source material for the LLM.
Google says No to schema and ranking
This is where Google’s long-standing guidance on structured data becomes critical. For years, Google has been clear and consistent: schema markup does not directly improve your search ranking. While it can make a page eligible for rich snippets and enhance its appearance in search results, it does not give the page an inherent boost in the ranking algorithm. Since the query fan-out process relies on this very ranking system to select which pages to send to the LLM, there is no systemic preference for pages with schema. The search engine retrieves the most relevant and authoritative pages for the sub-queries based on hundreds of other signals; whether those pages contain JSON-LD or Microdata is not a determining factor in their selection for the fan-out.
The LLM’s role begins only after this retrieval is complete. It is a synthesis engine, not a selection engine. It is handed a package of content from the top-ranking pages and tasked with a singular goal: construct a comprehensive, coherent answer to the original user query using only the information provided. The model processes the raw text, images, and other content from these pages, identifying key facts, summarizing arguments, and weaving them together. Its remarkable ability is in understanding and manipulating natural language, not in parsing the underlying HTML structure of its source material. A well-written article on a page without any schema is just as valuable to the LLM as a page with perfectly implemented structured data, because the model is processing the final, rendered content.
Ultimately, the idea that LLMs hunt for schema is a misattribution of intelligence. The intelligence lies in the entire system: the search engine’s ability to deconstruct a query and rank pages, and the LLM’s ability to synthesize language. Because the ranking component does not prioritize schema, the LLM is never in a position to show a preference. It simply, and powerfully, makes sense of the information it is given