Our Story


Jun 12, 2018
  • Please, share!

Our Story

When there was a need to crawl 5 mln. pages at our first business (jobagregator) and we calculated the budget for it (all money of the world) we decided to write our own crawler. What happened next is the most interesting.

Our projects

Julia and I have spent almost six years of our lives working on different sorts of aggregators – job, price, auto, real estate. In 2013 we started our own business and launched a job aggregator. Within 3 years without any investments we’ve made quite a success and started over 30 websites in different countries including satellites and other auxiliary projects.

Do you know what aggregators are in the context of SEO? Lots and lots of pages - about 60 mln in total. And you’ve got to work with them somehow.

Moreover, we didn’t have any investments. That means we couldn’t work with the best (i.e. very expensive) cloud servers such as Amazon or Google. Every second we received around 20 queries - that makes 1.7 mln queries a day. Every day we processed gigabytes of XML, updated the 10 gigs search index, sent 100-170 thousand letters, paying 600$ per month for DigitalOcean and other hosting servers.

Thanks to that we’ve gained broad experience in data processing and base optimization. But when it came to optimization of our sites interlinking I studied the prices for the existing SaaS crawlers. After some calculations it became clear that it’d be much cheaper to buy a jet-plane or a three-deck yacht.

A month later I came up with the first prototype of a crawler developed just for our needs – with its help we could also successfully analyze our competitors watching them do SEO. Within less than a year we processed around 100 mln pages.
After a while I started thinking of creating a standalone product – a crawler for big sites.

JetOctopus begins

So it happened. At the end of 2016 I continued with websites data processing and I was just testing various bases, crawling technologies, link graph processing technologies and etc. when it hit me – I don’t have to discover the continents again or use Java. All I need is a distributed homogeneous system – the same as they do when building architecture in Google or Yandex.

Thanks to it our crawler starts working at once – you don’t have to wait for several hours for another service to start the virtual machine in the cloud. Thanks to it we can positively claim that we can crawl up to 200 pages a second.

Actually we are able to set a figure of 100k-150k pages a second in the index, but unfortunately not many sites can even handle the load of 200.

For a year we were busy developing the system backend and cooperating with really big sites from different countries – providing them with reports, exports and optimizations.

The system was constantly being tested on sites with 1-5 mln pages. So, when I hear some services say “a big site is a site with 100k pages” I smile. Crawling of 1 mln pages for JetOctopus is like starting a car and warming up the engine without actually driving anywhere.

Websites with 30-40 mln pages and dozens of billions of links - that’s what I call really big sites that are interesting to work with and that help you to get a clearer picture of how Googlebot works.

Our Plans for the Future

We intend to provide big sites, SEO agencies, SEO consultants with all the power and opportunities of the SaaS crawler at a reasonable price.

That’s why our motto is “Crawling without limits”. JetOctopus allows you to create an unlimited number of projects, domains, crawls and segments. Just choose a package with the required number of pages and forget about the limits.

Being able to process and analyze huge amounts of data we research the entire Internet – that’s about 210 mln domains – as well as separate big sites in various ways. We are convinced that sharing information even with our competitors It’s the right way to develop the industry.

Stay tuned

Read more: TOP SEO influencer wrote about Jetoctopus.

ABOUT THE AUTHOR

Serge Bezborodov is a CTO of JetOctopus. He is a professional programmer with 9+ years of experience. He has worked with aggregators for 5 years - vacancies, real estate, cars. Also, he has experience in design DB architecture and query optimization. Serge has crawled more than 160 mln pages, 40 TB of data with JetOctopus.

  • Please, share!
auto.ria.com
Auto classified, 20m pages crawled
Duplications
What problem was your SEO department working at when you decided to try our crawler?
We needed to detect all possible errors in no time because Google Search Console shows only 1000 results per day.

What problems did you find
That’s quite a broad question. We managed to detect old, unsupported pages and errors related to them. We also found a large number of duplicated pages and pages with 404 response code.

How quickly did you implement the required changes after the crawl?
We are still implementing them because the website is large and there are lots of errors on it. There are currently 4 teams working on the website. In view of this fact we have to assign each particular error to each particular team and draw up individual statements of work.

And what were they?
It’s quite difficult to measure results right now because we constantly work on the website and make changes. But a higher scan frequency by bots would mean the changes are productive. However, around one and a half months ago we enabled to index all the paginated pages and this has already affected our statistics.

Having seen the crawl report, what was the most surprising thing you found? (Were there errors you’d never thought you’d find?)
I was surprised to find so many old, unsupported pages which are out of the website’s structure. There were also a large number of 404 pages. We are really glad we’ve managed to get a breakdown of the website subdirectories. Thus we’ve made a decision which team we will start working with in the beginning.

You have worked with different crawlers. Can you compare JetOctopus with the others and assess it?
Every crawler looks for errors and finds them. The main point is the balance between the scanned pages and the price. JetOctopus is one of the most affordable crawlers.

Would you recommend JetOctopus to your friends?
We’re going to use it within our company from time to time. I would recommend the crawler to my friends if they were SEO optimizers.

Your suggestions for JetOctopus.
To refine the web version ASAP. There are a few things we were missing badly:
Thank you very much for such a detailed analysis. Currently we have been reflecting on a redirects' problem.
ultra.by
Do you wan’t
more SEO traffic?
All sites have technical errors which block your SEO traffic boost
I’ts an axiom
Find out the most epic errors ot your site and start fixing them
The amount of the provided information is quite enough and you can see at once where and what the problems are
auto.ria.com