Send a canonical link header instead of a noindex header.
Sending a canonical header likely has the same advantage on crawling budget as sending a noindex header - without the SEO implications of excluding urls which might have backlinks via noindex.
If you can configure your server, you can use a rel=“canonical” HTTP header (rather than an HTML tag) to indicate the canonical URL for a document supported by Search, including non-HTML documents such as PDF files.
We can configure our server.
Does use a rel="canonical" HTTP header rather than an HTML tag emphasise a preference for the HTTP header solution?
Googlebot handles no-index headers very elegantly. It advises to leave as many routes as possible open and uses headers for high fidelity rules regarding indexes.
Maybe Google handles canonical link headers equaly elegantly to no-index headers.
This is rolling back the recently introduced site setting into what’s basically a noop (moving what we send as a head tag in a header, no semantic change).
html link-tag canonical (might be ignored - but anyway same as header)
The whole idea behind this:
Set canonical as http-header to get the same benefit like the noindexhttp-header - namely faster crawling.
Thereby this might make noindex obsolet with its uncertain implications.
Another point on noindex vs. canonical:
noindex is more than a very strong signal to not put the page into search index.
But with noindex the page content is still processed by Google Bot to extract links (there is the extra option nofollow to disable this).
canonical is a strong signal that the content to be crawled is on some other canonical url.
In case Google Bot decides to accept this signal for one page, there is a big chance it does not process the page content at all – and only processes the canonical url.
This is a ‘thought experiment’. It’s nowhere implemented - and I never recommend to implement it:
header noindex
html meta-tag noindex (instead of: html link-tag canonical)
This change is not a ‘noop’:
Google might handle headers and html content in different stages of its processing-queues. By sending headers we might skip further processing-queues (e.g. Render Queue) and thereby free up crawl budget for more important pages.
Not strongly against this but it feels so minor. Google is always downloading content these days, I doubt saving an HTML parse is really going to make any material difference.
Lots of other areas need focus first, the microdata is probably the first place that needs TLC.