Big websites operate at a different scale – millions of pages, little unique content, and complex structures. Unlike editorial sites, these pages lack EEAT, backlinks, and stable indexation due to dynamic factors like stock or filters.
For SEOs, it’s a constant juggling act: rigid platforms, slow teams, and limited control – but the same pressure to grow traffic.
When I ran a job aggregator with millions of pages, I’d wake up thinking only of three things: crawl budget, indexation, and traffic.
At that scale, SEO isn’t just a task – it’s a constant pressure.
During the last 7 years, we have analyzed hundreds of billions of log lines for many big websites and found three key components that influence Googlebot behavior:
Internal linking builds the foundation of any website. It creates the site structure, helps Googlebot to navigate all pages, crawl them, index them, and in the end, rank in SERP.
The core metric is how many internal links a page has from the whole website.
Think about that: a website has 100k pages, and every page on ecom could have about 250 links on it. In total, we have 25 million internal links on a website.
If a page A has only one internal link from the whole website, is this page important to you? If the page does not have any internal links at all and can be found only in the sitemap? Why should Googlebot crawl it and put it into the index?
Technically, it’s a very hard job for Googlebot to understand how valuable such types of pages are. No unique content, no authors, and no backlinks to those millions of pages.
The referring internal links are one of the key factors for it. By managing internal linking, you redistribute the ‘importance’ to pages within a website.
1/ What I saw all those years ago, the first problem is a lack of understanding of what the state of internal linking on a website is.
It’s a massive amount of data that is very hard to analyze. For example, the table with all links on a relatively medium website with less than a million pages has a size of about 160GB of data. If an SEO does not have experience working with databases, the mission is impossible. (We can discuss here partial crawls, but let’s leave it for another article.)
2/ The second is the crawl budget. If Googlebot does not crawl a page, it cannot be in the Google Index, and obviously, there is no organic traffic.
Usually, less than half of a website has enough internal links. The rest are either orphaned (can be found only in sitemaps) or have a few internal links from the whole website. Googlebot simply ignores those pages.
3/ Product pages (PDP) are especially affected by underlinking. Except for some top products, which are good interlinked, the only way to reach the rest is through pagination. But they are closed from indexation by a meta tag or a non-canonical tag.
PDPs may not be your priority. But Google understands the quality of a website as a whole. Product pages bring a significant amount of value to the picture.
Big websites are usually old, with huge tech debt, and many devs work on them. The implementation of even simple SEO tasks could easily take months. What to say about changes to existing internal linking?
The key principle of the practical approach is not to touch what already works.
Patch it!
Add an additional interlinking block instead of changing the logic of the existing one, which may affect hundreds of thousands of pages.
The second is to move in small steps: implementing a new logic for the whole website at once sounds scary. Better to choose a set of 10k pages and start with them. Get the results and move forward to more pages.
The third is the data approach. We live in the world of technical SEO that has tons of valuable data: full crawl of a website brings site structure, current internal linking, and content factors. Logs give us the crawl budget: what Googlebot crawls, how often, and what is left on the side. Google Search Console API shows how pages perform in SERP: the number of impressions, and positions
By combining all available data, we distinguish powerful pages and weak pages and treat them differently.
Data opens a new way to measure the impact of tech SEO tasks. On a test set of pages, only 40% were crawled by Googlebot, and after implementing a new internal linking scheme, it increased to 70%, which is reflected in indexation changes and organic traffic. Every step could be measured, and you’ll have the opportunity to make more or less realistic predictions for SEO tasks’ results.
Basically, donors’ pages are strong pages. They have a crawl budget and impressions in Google Search.
Acceptors are weak pages, they may have little or no crawl budget and indexation. But they’re valuable, and you want them to be indexed and bring organic traffic from long-tail keywords.
By putting links from donor pages to acceptor pages, we transfer authority (PageRank, link juice, you name it) that boosts them.
The opposite is also possible: linking a bunch of top-performing pages from tens of thousands of low-performing pages to give an additional boost to the top.
Google understands and takes into account link placement within a page for 20 years, the same as page semantics. It does not make any sense for a user, as for Googlebot, to interlink mobile phones with cooking equipment.
Even a few years ago, page semantics analysis was not an easy task. There are many technologies present to compare how similar two pages are, like simhash, Levenshtein, etc, but they didn’t have enough quality.
Thanks to modern LLM, now we have AI Text Embedding that has made a huge improvement in understanding the content of pages.
However, again, big websites have their specifics, and usually, even Text Embeddings struggle to ‘understand’ a page with a lot of chrome content and a few lines of product description, what to say about category pages.
The solution is the proper page categorisation before applying the text embeddings. It can be done in various ways, including using Schema.org data, breadcrumbs, and custom extraction for additional fields.
Usually, the interlinking blocks on websites work dynamically. They render the list of links during the page render and build the list based on some logic. It’s the easiest way to develop, but not the most efficient one.
Creating a list of links may require a lot of queries into the database, which could take time that impacts the overall page’s performance.
But the biggest drawback is that it leads to disproportionate linking. Some top products and categories may receive a significant amount of internal links, while the rest of the pages get almost nothing.
Fixed internal linking means saving the schema into the database. Basically, it’s a table with three columns:
Donor URL (source page) | Anchor Text | Acceptor URL (target page)
This approach has a lot of benefits:
A table with millions of rows may look scary, but from a developer’s perspective, it’s not even big data.
Working with internal linking gives particularly good results with long-tail, low-query demand pages. Every page may bring tens or hundreds of visits per year, but when a website has hundreds of thousands or millions of such pages, it’s a lot of traffic.
From our experience, what we saw, reaching 30% more pages to be crawled by Googlebot is possible.
It is worth mentioning that internal linking is not a silver bullet. When a website is just a catalog of low-quality pages, even a world-class link algorithm does not change anything. The website has just become a well-interlinked but still low-quality website.
TechSEO optimization is not a one-time task, it’s a non-stop process of improving the website. Internal Linking is not an exception. Products are gone from stock, pages deindexed and removed. All those changes should be taken into account, and internal linking updated accordingly.
There is a huge field for more sophisticated algorithms, like a weighted-based approach. When the importance of a page is determined by the frequency of queries, it ranges. Depending on it, the number of needed internal links is calculated.
Another technique is tracking page/query position changes. When a bunch of pages in a group show some growth, those pages are marked as more important, and they have a boost with an additional amount of internal links.
Taking semantics from keywords page ranked in SERP by using them in the anchor text of links again is one of the ways to achieve a few positions higher in search.
In conclusion, I believe, technical SEO on big websites is one of the most interesting fields in which to work.
Everything is about the scale. Some not-so-visible changes can bring a significant traffic increase.
With all available modern tools, data, and AI, it has become easier and doable than ever.