Sitemaps are just a document that tells you % of pages you THINK you published are indexed, that’s all. The sitemap does very little. This comes from building up Google more that what it is – as Gary Ylles keeps pointing out – its incredibly basic.
Table of Contents
ToggleSEO Myth One: Sitemaps “Tell” Google what to do
They absolutely do not. What they do is tell any bot where there are pages. If a bot has accessed a sitemap before, it can use that date to see if a new page has been added. It “Can” trust Lastmod update or not. this is literally a binary decision (yes, you can google it). It may or may not index it. Indexing it will be based on whether you have authority. How often it reads it: based on authority. Y/N trust on lastmod: how often has it lied – in other words, how often has the date been changed but the character difference between the two versions not changed by much.
Its that BASIC!
SEO Myth Two: Google follows your sitemap, budget,crawler engineering
So Google does have a crawl budget – based on site authority (frequency, size) and how many pages you have. But it doesnt “spider” pages – this is an anthropomorphism that people have crated. Cralwer budgets are another – especially based from European SEOs where budgets are a bigger priority but not grounded in any truth.
What does happen is partial crawls – Google grabs 10%, 25% – 67% of your HTML document -by different bots with different operations or a general bot that updates certain sections (snippets, discover, news) with whatever partial or complete segments it retrieved. this is why your page title updates but your meta-description doesnt.
Googlebots actually operate from long crawl lists, that are vast sheets of URLs constantly with thousands of URLs being added from different components/operations and removed as they are cralwed, updated by bots, often in partial states: in other words, bots have so little time that they frequently just grab a % of a document and update the differfent systems with whatever they had time to grab.
Another SEO myth – the page “spider” and the whole document render myth
From the idea of spidering the web, reading how people characterize or talk about crawling, it seems that a large faction of new-to-SEOs as well as experienced ones is taht Google downloads a whole document and parses it in one go and reads it like a person, in a “bot-browser” and then “clicks around the page like a virtual person” – this is absolute nonsense and just a common human trait called “anthropomorphism.” Google doesnt. Google might explode parts of the site that require scripts to render it in order to grab text but it will start with whatever text and links it finds in a page and then update crawler lists.
Some Googlebots are XML listeners that send a crawler on a new page being added – this is how Caffeine works (think CNN, news sites, etc). Pages with organic traffic are auto-updated
How Google Crawling works
Google keeps lists of URLs to crawl – either live pages it serves up, from listening agents, or newly discovered URLs – that’s where most crawling comes from
Google doesn’t sit there and go sitemap to sitemap and crawl all of the pages – for the most part, I’d say 90% of sitemaps don’t have a listener because they don’t get traffic.
Instead, URLs are crawled in triage – the top 1% of the web is called every hour (source: Matt Cutts), then the middle of the web – the next 9% and then the base of the pyramid – 90% on a an increasing almost never basis. This 90% is page-level, not just domain-level.
Crawlers and Indexing
Crawlers aren’t that separate from indexing because crawlers will also update where the link was found and what kind of link it is – 301, external or internal etc. Indexing and Crawling happen at the same time and together this is “search” – because Google doesnt search anything by the time the user gets to it – it just dumps an index which is a list of URLs it found and indexed and sorted WHEN it was found. the position is based on how much authority the page has for that index and this is updated with every incoming link found as the bots and indexers parse whole or parital documents.
How to test this in GSC
Inside your Google Search Console (GSC), go to Pages and look at your page index history – you can see the oldest pages and how long it’s been since they were crawled.
Then go to your Perofrmance Overview and look at he top 5-10 pages with the most clicks in the last week. Inspect each page and see when they were indexed. Pages with more than 10 clicks a week are likely to be crawled more frequently and automatically.
Everything else is just ignored.
What XML Sitemaps really do
Sitemaps are just a way for your CMS to issue you a receipt of what you published and for Google to let you know precisely which ones it indexed. It is not a control list – a Google list of URLs comes from:
- Chrome URLs accessed (not from user behavior, just from users visiting new URLs)
- Discovery via indexing other pages
- This is why if your blog is crawled from CNN, you will go to “page 1”
- Discovery in internal links
- Via site refresh
- Via a listener or repeat crawl (like Caffeine) if you are on Google News or discover
- Via triage via XML
Sitemaps don’t actually form the major part of where Google spends its time crawling: which is new pages
Google Crawler Triage
- New content from highly authoritative sites
- Function Specific crawlers
- Caffeine
- News, Discover, Products
- Caffeine
- Refresh
- Hourly – Highly authoritative domains,
- LstMod trusted = true
- Daily – Tier 2
- Monthly – Tier 3
- Daily and monthly mean every other day or month, not every day
- Hourly – Highly authoritative domains,