Custom extraction is needed to find certain elements in the HTML code of the page. Our web scraper will search selected elements while crawling your website. With custom extraction, you can use JetOctopus more efficiently and get non-standard information about your website.
With JetOctopus, you can configure up to 50 custom extraction rules during a crawl (every crawl = 50 rules). You can choose the desired extract type: CSS or regex. Pay attention to the fact that JetOctopus extracts data from regular HTML. If you need to get data from processed JavaScript, activate the “Execute JavaScript” checkbox in the basic settings during the crawl configuration.
More information: How to set up a crawl for a JavaScript website.
To set up the custom extraction rules, start a new crawl and configure it. You can crawl the whole website or URL list. After basic settings, go to the “Custom Extraction” menu and click “+ Add Rule”. Here you can set up basic rules for extraction.
Basic settings will be enough to find and extract an element on the page.
In the “Title” field, enter a title (it will be more convenient to analyze the results of custom extraction with clear titles, especially if you use a lot of rules.), and below – “CSS Rule”.
You can use custom extraction by id or class of HTML element.
To do this, select the desired element on the page with the mouse. Right-click on it and select “Inspect the element”. DevTools will open. Copy the id or class of the element.
Then right-click on the code and select Copy – Copy Selector. This will allow you to select the exact element with the exact location, rather than all the elements of the class.
Paste the information into the CSS rule field. Before CSS rule enter # for id or . (dot) for class. If you copied the selector from Dev Tools, it will already contain a dot or #.
You can also insert a regex rule here. You can try to create a regex rule with regex101.com.
Evaluate all possibilities of custom extraction using extended settings. Go to extended settings by clicking on the corresponding button.
In the “Extract Type” field, you can select the exact type: CSS Rule or regex. The latter is used for specific cases. For example, you need to extract a specific element without id/class from all available ones.
If you choose CSS rule:
Css Extract Mode – allows you to choose what you want to extract: the value of a tag or a specific attribute of this tag.
If you choose an attribute, add its name in the “Css Extract Attribute”.
If you choose Regex:
Select in the “Action” field what exactly you want to do.
Select the data type you want to see in the results.
Important: check each rule before running the crawl. We made a special button for this. Click “Test This Rule”, enter the URL for the test and run it. If you see the desired result below, then the custom extraction rule is correct.
CSS rule or regex may differ in page code with rendered JavaScript and regular HTML. To copy a CSS rule for a page with regular HTML, disable JavaScript rendering in your browser. Run the “Command Line” in DevTools (Ctrl + Shift + P) and select “Disable JavaScript”.
And one more important thing. Some websites may display different code for different web browsers. So always check your rule before starting a crawl.
The scraper will extract information only if the page returns a 200 response code.
If you want to make a custom extraction for the URL list, select “URL list” сrawl mode in “Basic settings”.
When the crawl is finished, you can find the data in several reports. To see if there is duplicate content in the crawled items, go to the “Duplication” – “Custom Extraction” report.
General information is available in the menu “Crawler” – “Custom Extraction” report. Here you can filter the results by a rule (“Field” is the title of your rule) and see the number of pages that match the rule.
To conduct a detailed analysis of the pages with extracted elements, go to the “Data Tables”. In the “Custom Extraction” filter, select the required rule and adjust the value. Apply.
In the data tables, you will see a column that will contain the extracted element.
You can export the results in a convenient format for you: CSV, Excel or GoogleSheets.
Product name
JetOctopus collects titles and headings, but sometimes product names may not be headings and may not be contained in the title. Inspect the element on the page and select the desired id or class. Enter a # before the id and . (dot) before the class.
You can extract the product stock status in the same way.
Number of ratings
Select the desired element on the page and find class or id, enter the rule in the “CSS Rule” line. If you want to see a number in the results, go to “Extended Settings” and select “Integer” for “Data Type”. In the results, you will get only the number of reviews, without additional words.
Price extraction
Select the desired element on the page, enter it in the “CSS Rule” line. If the price contains additional symbols, such as currency, go to “Extended Settings”. In “Data Type”, select “Decimal” to extract floating numbers.
How to count the number of elements
With the custom extraction, you can count the number of required elements on the page. This can be the number of products on a category page, the number of brands, the number of filters, headings, specific elements, etc.
Select the desired class/id, go to “Extended Settings”. In “Action”, select “Count elements”. You will get an integer in the results.
How to check if there is a price
This way you can check if the desired element is on the pages. It can be a product, review, title, heading, price, etc. Select the desired element on the page where it is displayed. In “Action”, select “Check is Exists”. In the results, you will get 1 (means that the element is on the page) or 0 (the element was not found on the page).
How to count words inside the needed element
Pages with rich content rank better in search engines. Searching for pages with the least or most words will help you improve your content. We remind you that you can find most of the content checks in the crawl results. And with the help of custom extraction, you can count the number of words in a certain element. For example, you can count how many words are in the product title or product characteristics.
To do this, select the desired id/class, go to extended settings and select “Count Words” in “Action”. You can also count the unique words (to avoid duplicates) and the number of characters. And if there is no content on the page at all, but the page responds to 200, it can become a soft 404.
Custom extraction is a flexible tool that allows you to extract all the data you need, including: