May 25, 2023
by Sofia

How to use Regex to filter data?

You may have noticed that in all data tables, you can use regex rules to filter data, also known as regular expressions. In this article, we will discuss the basic rules for using regex and its applications.

What can you use regular expressions for? 

In JetOctopus, you can utilize regex rules to filter data in various reports, including logs, crawl results, the Google Analytics section, and the Google Search Console. Regular expressions can be used to filter a wide range of data, create page segments, and more. Here are some examples of the data that can be filtered using regular expressions.

  • Canonicals
  • Hreflangs
  • Titles, meta descriptions, and headings
  • URLs
  • Anchors
  • Redirect locations
  • HTML Lang, and more.

For instance, you can use the regular expression [0-9]+$ to find all URLs that end with numbers.

Regular expressions are powerful tools that allow you to filter data more efficiently and easily. Now, let’s explore how to use regex to filter data in JetOctopus.

How to use regex to filter data in JetOctopus?

To use regex, follow these steps in the required data table.

1. Select the desired filter, such as “Page URL.”

2. Choose whether you want to obtain a list of all URLs that contain a part matching the regular expression (Contain by REGEXP) or exclude pages matching the rule from the results (NOT Contain by REGEXP).

To illustrate, let’s consider an example. Suppose you want to filter all URLs that contain the word “example” followed by any combination of letters or numbers. You can use the regex rule “example[a-zA-Z0-9]+” to filter this.

How to use Regex to filter data - JetOctopus SEO Crawler and Log Analyzer - 1

Starting with simpler expressions and gradually progressing to more complex ones is recommended. Here are some basic regex rules.

Dot (.): The dot is a special character that matches any character. For example, “exam.le” matches “example,” “exam3le,” and “examble.”

Asterisk (*): Represents zero or more occurrences of the preceding character or expression. For instance, “[0-9]*” means that the URL can contain any number of numbers or zero numbers.

Plus (+): Indicates one or more occurrences of the preceding character or expression. For example, “[0-9]+” means that the URL must contain at least one or more numbers.

Question mark (?): Denotes zero or one occurrence of the preceding character or expression. “[0-9]?” means that the URL can contain either zero or one number.

Character classes: Placed within square brackets, they represent a range of characters that can be matched. “[0-9]” matches any number between 0 and 9. “[a-z]” matches any lowercase letter between a and z.

Pipe (|): Separates characters or groups of characters that may be present in a formula, functioning as an “or” operator. “2023|2022” means that the string can contain either “2023” or “2022,” but only one group of characters should be present.

\d: Matches a decimal digit character. “d\d” matches “d1” and “d2” but not “dd.”

$: Indicates the end of a line. If you know that your URLs end with two numbers, a slash or any others, you can use $. For example, “example\.[a-z]{2}$” matches “example.ua” and “example.uk” but not “example.com.”  

{2}: Specifies that the preceding wildcard must be used exactly twice. “[0-9]{2}” matches “22,” “39,” “00,” etc.

It is important to escape all special characters. For instance, if your URL contains a dot that you want to filter, use “\.” to specifically match a dot character. So, all special characters such as {}^$.|*+?/ must be escaped because they have special meanings in regex syntax.

Examples of using regex for SEO

Example 1: Finding all URLs on a subdomain

“[a-z]+\.exampledomain\.(com|net)” matches “ua.exampledomain.com,” “usa.exampledomain.net,” and “staging.exampledomain.net,” but not “exampledomain.com.”

Example 2: Finding nested URLs of the first level

If you use nested URLs and know what characters are used in the URLs, using regex, you can filter all URLs by level. For example, if your URLs use hyphens and letters, you can use the following regex to filter nested first-level URLs:

“\.com\/[a-z-]+$” matches “exampledomain.com/black-top,” “exampledomain.com/black,” and “exampledomain.com/a-top.”

Example 3: Finding nested second-level URLs

If your URLs contain both letters and numbers, you can use the following regex to filter nested second-level URLs:

“\.com\/[0-9a-z-]+\/[0-9a-z-]+$” matches “exampledomain.com/black-top/size-1,” “exampledomain.com/black-1-size/xxl,” and “exampledomain.com/a-top/2345.”

Example 4: Finding all titles with months of the year

If you want to highlight titles, URLs, or H1 tags that contain the month of the year (or cities, states and so on), use the following regex:

“.*(january|february|march|april|may|june|july|august|september|october|november|december)” matches “Best offers in January”, “2 bestsellers in March”, and so on.

In conclusion, regular expressions provide a powerful way to filter and manipulate data in JetOctopus. By understanding the basic rules and applying them effectively, you can extract valuable insights and streamline your data analysis process.

About Sofia
Technical SEO specialist. Sofia has almost 10 years of experience, of which the last 5 years in JavaScript SEO. She is convinced that SEO is a very technical part of digital marketing. And without logs and in-depth data analysis, you can't do effective SEO.

Search

Categories

Get exclusive tech SEO insights
We are tech SEO geeks who believe that SEO is predictable and numeric. Don’t miss our insigths!