When there was a need to crawl 5 mln. pages at our first business (jobagregator) and we calculated the budget for it (all money of the world) we decided to write our own crawler. What happened next is the most interesting.
Julia and I have spent almost six years of our lives working on different sorts of aggregators – job, price, auto, real estate. In 2013 we started our own business and launched a job aggregator. Within 3 years without any investments we’ve made quite a success and started over 30 websites in different countries including satellites and other auxiliary projects.
Do you know what aggregators are in the context of SEO? Lots and lots of pages - about 60 mln in total. And you’ve got to work with them somehow.
Moreover, we didn’t have any investments. That means we couldn’t work with the best (i.e. very expensive) cloud servers such as Amazon or Google. Every second we received around 20 queries - that makes 1.7 mln queries a day. Every day we processed gigabytes of XML, updated the 10 gigs search index, sent 100-170 thousand letters, paying 600$ per month for DigitalOcean and other hosting servers.
Thanks to that we’ve gained broad experience in data processing and base optimization. But when it came to optimization of our sites interlinking I studied the prices for the existing SaaS crawlers. After some calculations it became clear that it’d be much cheaper to buy a jet-plane or a three-deck yacht.
A month later I came up with the first prototype of a crawler developed just for our needs –
with its help we could also successfully analyze our competitors watching them do SEO.
Within less than a year we processed around 100 mln pages.
After a while I started thinking of creating a standalone product – a crawler for big sites.
So it happened. At the end of 2016 I continued with websites data processing and I was just testing various bases, crawling technologies,
link graph processing technologies and etc. when it hit me –
I don’t have to
discover the continents again or use Java.
All I need is a distributed homogeneous system – the same as they do when building architecture in Google or Yandex.
Thanks to it our crawler starts working at once – you don’t have to wait for several hours for another service to start the virtual machine in the cloud. Thanks to it we can positively claim that we can crawl up to 200 pages a second.
Actually we are able to set a figure of 100k-150k pages a second in the index, but unfortunately not many sites can even handle the load of 200.
For a year we were busy developing the system backend and cooperating with really big sites from different countries – providing them with reports, exports and optimizations.
The system was constantly being tested on sites with 1-5 mln pages. So, when I hear some services say “a big site is a site with 100k pages” I smile. Crawling of 1 mln pages for JetOctopus is like starting a car and warming up the engine without actually driving anywhere.
Websites with 30-40 mln pages and dozens of billions of links - that’s what I call really big sites that are interesting to work with and that help you to get a clearer picture of how Googlebot works.
Our Plans for the Future
We intend to provide big sites, SEO agencies, SEO consultants with all the power and opportunities of the SaaS crawler at a reasonable price.
That’s why our motto is “Crawling without limits”. JetOctopus allows you to create an unlimited number of projects, domains, crawls and segments. Just choose a package with the required number of pages and forget about the limits.
Being able to process and analyze huge amounts of data we research the entire Internet – that’s about 210 mln domains – as well as separate big sites in various ways. We are convinced that sharing information even with our competitors It’s the right way to develop the industry.Stay tuned
Read more: TOP SEO influencer wrote about Jetoctopus.
ABOUT THE AUTHOR
Serge Bezborodov is a CTO of JetOctopus. He is a professional programmer with 9+ years of experience. He has worked with aggregators for 5 years - vacancies, real estate, cars. Also, he has experience in design DB architecture and query optimization. Serge has crawled more than 160 mln pages, 40 TB of data with JetOctopus.