July 14, 2022
by Sofia

Custom Extraction Guide with multiple use-cases

Custom extraction is needed to find certain elements in the HTML code of the page. Our web scraper will search selected elements while crawling your website. With custom extraction, you can use JetOctopus more efficiently and get non-standard information about your website.

With JetOctopus, you can configure up to 50 custom extraction rules during a crawl (every crawl = 50 rules). You can choose the desired extract type: CSS or regex. Pay attention to the fact that JetOctopus extracts data from regular HTML. If you need to get data from processed JavaScript, activate the “Execute JavaScript” checkbox in the basic settings during the crawl configuration.

More information: How to set up a crawl for a JavaScript website.

How to set up custom extraction with JetOctopus

To set up the custom extraction rules, start a new crawl and configure it. You can crawl the whole website or URL list. After basic settings, go to the “Custom Extraction” menu and click “+ Add Rule”. Here you can set up basic rules for extraction.

Custom Extraction Guide with multiple use-cases - JetOctopus - 1

Basic settings will be enough to find and extract an element on the page.

In the “Title” field, enter a title (it will be more convenient to analyze the results of custom extraction with clear titles, especially if you use a lot of rules.), and below – “CSS Rule”.

Custom Extraction Guide with multiple use-cases - JetOctopus - 2

You can use custom extraction by id or class of HTML element.

To do this, select the desired element on the page with the mouse. Right-click on it and select “Inspect the element”. DevTools will open. Copy the id or class of the element.

Custom Extraction Guide with multiple use-cases - JetOctopus - 3

Then right-click on the code and select Copy – Copy Selector. This will allow you to select the exact element with the exact location, rather than all the elements of the class.

Custom extraction - JetOctopus

Paste the information into the CSS rule field. Before CSS rule enter # for id or . (dot) for class. If you copied the selector from Dev Tools, it will already contain a dot or #.

Custom extraction - JetOctopus - 2

You can also insert a regex rule here. You can try to create a regex rule with regex101.com.

Custom Extraction Guide with multiple use-cases - JetOctopus - 5

Advanced custom extraction

Evaluate all possibilities of custom extraction using extended settings. Go to extended settings by clicking on the corresponding button.

Custom Extraction Guide with multiple use-cases - JetOctopus - 6

In the “Extract Type” field, you can select the exact type: CSS Rule or regex. The latter is used for specific cases. For example, you need to extract a specific element without id/class from all available ones.

If you choose CSS rule:
Css Extract Mode – allows you to choose what you want to extract: the value of a tag or a specific attribute of this tag.
If you choose an attribute, add its name in the “Css Extract Attribute”.

Custom Extraction Guide with multiple use-cases - JetOctopus - 7

If you choose Regex:

  • enter a regular expression without escaping forward slashes;
  • Regex Is Ignore Case – if this checkbox is activated, the crawler will be case-insensitive (the crawler will extract data that matches the rule, but will ignore whether it is lower or uppercase).
Custom Extraction Guide with multiple use-cases - JetOctopus - 8

What data can JetOctopus get when crawling with custom extraction

Select in the “Action” field what exactly you want to do.

  • Get First Element – the results will contain the first element that matches the given rule.
  • Get First 2 Elements, Get First 3 Elements, Get First 4 Elements, Get First 5 Elements, and Get All – JetOctopus will extract the first 2 (3,4,5) or all elements that match the given rule.
  • Count Elements – will count how many elements there are in the HTML code that match the rule.
  • Check is Exists – will check the page for the presence of a certain element. For example, you can check if there are reviews on a page.
  • Count Symbols, Count Words, Count Unique Words – will count symbols, words and unique words respectively. You can use this option as an additional way to analyze content. Check whether there is enough content on the page or the uniqueness of this content.
Custom Extraction Guide with multiple use-cases - JetOctopus - 9

Select the data type you want to see in the results.

  • Text – all text inside the tag/attribute.
  • Integer – an integer (whole number) without text, if it is in the element. Note that the number will not be rounded, but simply output without a comma between the integer and the decimal. If getting an exact price is important to you, for example, choose “Decimal”.
  • Decimal – JetOctopus will get floating numbers if they are in the element.
  • Boolean – yes or no (checks for an element that matches the rule).

Important points to remember

Important: check each rule before running the crawl. We made a special button for this. Click “Test This Rule”, enter the URL for the test and run it. If you see the desired result below, then the custom extraction rule is correct.

Custom Extraction Guide with multiple use-cases - JetOctopus - 10

CSS rule or regex may differ in page code with rendered JavaScript and regular HTML. To copy a CSS rule for a page with regular HTML, disable JavaScript rendering in your browser. Run the “Command Line” in DevTools (Ctrl + Shift + P) and select “Disable JavaScript”.

Custom Extraction Guide with multiple use-cases - JetOctopus - 11

And one more important thing. Some websites may display different code for different web browsers. So always check your rule before starting a crawl.

The scraper will extract information only if the page returns a 200 response code.

If you want to make a custom extraction for the URL list, select “URL list” сrawl mode in “Basic settings”.

Where you can find results of custom extraction

When the crawl is finished, you can find the data in several reports. To see if there is duplicate content in the crawled items, go to the “Duplication” – “Custom Extraction” report.

Custom Extraction Guide with multiple use-cases - JetOctopus - 12

General information is available in the menu “Crawler” – “Custom Extraction” report. Here you can filter the results by a rule (“Field” is the title of your rule) and see the number of pages that match the rule.

Custom Extraction Guide with multiple use-cases - JetOctopus - 13

To conduct a detailed analysis of the pages with extracted elements, go to the “Data Tables”. In the “Custom Extraction” filter, select the required rule and adjust the value. Apply.

Custom Extraction Guide with multiple use-cases - JetOctopus - 14

In the data tables, you will see a column that will contain the extracted element.

Custom Extraction Guide with multiple use-cases - JetOctopus - 15

You can export the results in a convenient format for you: CSV, Excel or GoogleSheets.

Custom Extraction Guide with multiple use-cases - JetOctopus - 16

Custom extraction examples

Product name

JetOctopus collects titles and headings, but sometimes product names may not be headings and may not be contained in the title. Inspect the element on the page and select the desired id or class. Enter a # before the id and . (dot) before the class.

Custom Extraction Guide with multiple use-cases - JetOctopus - 17

You can extract the product stock status in the same way.

Number of ratings

Select the desired element on the page and find class or id, enter the rule in the “CSS Rule” line. If you want to see a number in the results, go to “Extended Settings” and select “Integer” for “Data Type”. In the results, you will get only the number of reviews, without additional words.

Custom Extraction Guide with multiple use-cases - JetOctopus - 18

Price extraction

Select the desired element on the page, enter it in the “CSS Rule” line. If the price contains additional symbols, such as currency, go to “Extended Settings”. In “Data Type”, select “Decimal” to extract floating numbers.

Custom Extraction Guide with multiple use-cases - JetOctopus - 19

How to count the number of elements

With the custom extraction, you can count the number of required elements on the page. This can be the number of products on a category page, the number of brands, the number of filters, headings, specific elements, etc.

Select the desired class/id, go to “Extended Settings”. In “Action”, select “Count elements”. You will get an integer in the results.

How to check if there is a price

This way you can check if the desired element is on the pages. It can be a product, review, title, heading, price, etc. Select the desired element on the page where it is displayed. In “Action”, select “Check is Exists”. In the results, you will get 1 (means that the element is on the page) or 0 (the element was not found on the page).

Custom Extraction Guide with multiple use-cases - JetOctopus - 20

How to count words inside the needed element

Pages with rich content rank better in search engines. Searching for pages with the least or most words will help you improve your content. We remind you that you can find most of the content checks in the crawl results. And with the help of custom extraction, you can count the number of words in a certain element. For example, you can count how many words are in the product title or product characteristics.

To do this, select the desired id/class, go to extended settings and select “Count Words” in “Action”. You can also count the unique words (to avoid duplicates) and the number of characters. And if there is no content on the page at all, but the page responds to 200, it can become a soft 404.

Custom Extraction Guide with multiple use-cases - JetOctopus - 21

What additional data can be found using custom extraction

Custom extraction is a flexible tool that allows you to extract all the data you need, including:

  • number of products in the category;
  • number of words per page;
  • the author of the article;
  • date of publication of the page/article;
  • article categories;
  • the number of headings on the page (we analyze H1 and H2, but with the custom extraction you can count the number of headings of any level);
  • amount of breadcrumbs, etc.

About Sofia
Technical SEO specialist. Sofia has almost 10 years of experience, of which the last 5 years in JavaScript SEO. She is convinced that SEO is a very technical part of digital marketing. And without logs and in-depth data analysis, you can't do effective SEO.

Search

Categories

Get exclusive tech SEO insights
We are tech SEO geeks who believe that SEO is predictable and numeric. Don’t miss our insigths!