I’m certain that you’ve repeatedly seen the headlines “40% of websites use WordPress”, “10% of websites sit on Cloudflare”, “PHP X.X. is the most widely used software”.
However, there is usually no indication of the type of websites that were analyzed nor their size.
So is it really true that almost half of the Internet runs on WordPress?
Around three years ago, I published an article about how we analyzed the home pages of more than 250 million available domains.
At the beginning of 2021, we did new data collection, added technology detection, pixel tracking, along with improved content and link analysis.
This article is an overview of the current state of the core metrics: how many sites are running online in general, what percentage are using HTTPS, what technologies are the most popular, and whether everyone is actually using redirects for promotion, and more.
Parsing and scraping data is generally not that difficult of a task. There are a lot of “we-collected-data-from-the-top-X-million(thousand)-websites-and-we-found-this” studies.
Yet would a selection of the top websites be representative of the entire Internet? Can we pick the top 1 million richest people in the world and use them to assess the welfare of the rest of the world’s population?
That is why I wanted to scour and analyze the main pages of absolutely all the domains on the Internet. Three years ago we did it for the first time and we published the result here. This project has since served as a testbed for JetOctopus, on which we tested new technologies for the crawler.
The very first question you would ask me is where did we get the list of all domains on the Internet? There are quite a few providers from whom they can be purchased. Subdomains are not included in the sample.
The second important point is the domain-host bond. One domain gives us 4 hosts:
It looks like all these should be the same site or at least redirects to the canonical version. In practice, there may be completely different websites on HTTP and HTTPS. This occurs frequently due to configuration errors of the webserver.
There are 252 million domains in the database, of which 200 million are responding to ports 80 and 443, the 200 status code has 148 million domains.
That is, this is the number of domains that are actually working.
Since the last analysis there has been almost no change in the number of IP addresses: In 2018 there were 13.2 million, in 2021 there are 14.3 million unique IP addresses, which are listed in the A record.
Consequently, an average of 15 domains is tied to each IP address.
It’s worth noting again that hosts site.com and www.site.com, as well as https://site.com, can have completely different websites. Therefore, let’s move on to the operation of the concepts of domain and host.
The total sum is more than the overall number of domains, as each host can give 4 different code statuses (combinations of www/non-www, HTTP/HTTPS).
What surprised me is that there hasn’t been much growth of the websites (hosts) that run on HTTPS. I expected that after three years everyone should have switched to HTTPS by now, however, the growth is 106/86 – 1 = 23%.
For more than 10 years I’ve probably registered about a hundred domains and none of them have used www as the main mirror. It seems, however, that having a www is so common to everyone that I’d say it’s triumphed.
The redirects from non-www to www have 50 million hosts, while those from www to non-www have only 37 million.
We used a Wappalyzer to identify technologies. This is a popular Chrome extension that has a publicly available technology library. The detection algorithm is rather simple: the HTML code, URL js, and CSS are checked for substrings characteristic of the technology.
Thus, it turns out that there are 23 out of 148 million domains on WordPress corresponding to code 200 which equals 15% of the domains on the Internet.
Or 55 out of 295 million hosts corresponding to code 200 which equals 18% of the hosts on the internet.
The Cloudflare figure is very surprising.
At the same time, we see about 10 million hosts with Cloudflare. Perhaps they also consider subdomains in their statistics, which we, unfortunately, do not have in the database.
Redirects are considered the holy grail of SEO. Get a drop domain, redirect it to the main domain — it is a profit!
There are a total of 7.5 million domains that have more than 1 inbound redirect from a third-party host (i.e. www/non-www redirects are excluded).
It is 5% of the total number of working domains.
If we look at the top 20 domains by the number of redirects, we would expect to see hosters, domain registrars, and an odd host on amazonaws.com.
The median value in redirects is 4. Whereas, 10% of the group have 8 or more inbound redirects.
On the whole, I’m more inclined to believe that redirects are in use, but are nowhere near as ubiquitous as they are talked about. The exceptions are the highly competitive and grey niches.
The Internet is not a static environment. Every day thousands of websites are being launched and shut down. But when comparing the current figures with the data from three years ago, there are actually not that many changes. The total number of the working websites has remained at the same level. Apparently, the world population is growing faster than the number of websites on the Internet.
I would be happy to hear your comments, thoughts, and observations on the topic.
If you want to get more interesting data about the internet, connect with me on Twitter.
Additional Resources:
If you found the study interesting, check out additional links below: