You may have noticed that in all data tables, you can use regex rules to filter data, also known as regular expressions. In this article, we will discuss the basic rules for using regex and its applications.
In JetOctopus, you can utilize regex rules to filter data in various reports, including logs, crawl results, the Google Analytics section, and the Google Search Console. Regular expressions can be used to filter a wide range of data, create page segments, and more. Here are some examples of the data that can be filtered using regular expressions.
For instance, you can use the regular expression [0-9]+$ to find all URLs that end with numbers.
Regular expressions are powerful tools that allow you to filter data more efficiently and easily. Now, let’s explore how to use regex to filter data in JetOctopus.
To use regex, follow these steps in the required data table.
1. Select the desired filter, such as “Page URL.”
2. Choose whether you want to obtain a list of all URLs that contain a part matching the regular expression (Contain by REGEXP) or exclude pages matching the rule from the results (NOT Contain by REGEXP).
To illustrate, let’s consider an example. Suppose you want to filter all URLs that contain the word “example” followed by any combination of letters or numbers. You can use the regex rule “example[a-zA-Z0-9]+” to filter this.
Starting with simpler expressions and gradually progressing to more complex ones is recommended. Here are some basic regex rules.
Dot (.): The dot is a special character that matches any character. For example, “exam.le” matches “example,” “exam3le,” and “examble.”
Asterisk (*): Represents zero or more occurrences of the preceding character or expression. For instance, “[0-9]*” means that the URL can contain any number of numbers or zero numbers.
Plus (+): Indicates one or more occurrences of the preceding character or expression. For example, “[0-9]+” means that the URL must contain at least one or more numbers.
Question mark (?): Denotes zero or one occurrence of the preceding character or expression. “[0-9]?” means that the URL can contain either zero or one number.
Character classes: Placed within square brackets, they represent a range of characters that can be matched. “[0-9]” matches any number between 0 and 9. “[a-z]” matches any lowercase letter between a and z.
Pipe (|): Separates characters or groups of characters that may be present in a formula, functioning as an “or” operator. “2023|2022” means that the string can contain either “2023” or “2022,” but only one group of characters should be present.
\d: Matches a decimal digit character. “d\d” matches “d1” and “d2” but not “dd.”
$: Indicates the end of a line. If you know that your URLs end with two numbers, a slash or any others, you can use $. For example, “example\.[a-z]{2}$” matches “example.ua” and “example.uk” but not “example.com.”
{2}: Specifies that the preceding wildcard must be used exactly twice. “[0-9]{2}” matches “22,” “39,” “00,” etc.
It is important to escape all special characters. For instance, if your URL contains a dot that you want to filter, use “\.” to specifically match a dot character. So, all special characters such as {}^$.|*+?/ must be escaped because they have special meanings in regex syntax.
Example 1: Finding all URLs on a subdomain
“[a-z]+\.exampledomain\.(com|net)” matches “ua.exampledomain.com,” “usa.exampledomain.net,” and “staging.exampledomain.net,” but not “exampledomain.com.”
Example 2: Finding nested URLs of the first level
If you use nested URLs and know what characters are used in the URLs, using regex, you can filter all URLs by level. For example, if your URLs use hyphens and letters, you can use the following regex to filter nested first-level URLs:
“\.com\/[a-z-]+$” matches “exampledomain.com/black-top,” “exampledomain.com/black,” and “exampledomain.com/a-top.”
Example 3: Finding nested second-level URLs
If your URLs contain both letters and numbers, you can use the following regex to filter nested second-level URLs:
“\.com\/[0-9a-z-]+\/[0-9a-z-]+$” matches “exampledomain.com/black-top/size-1,” “exampledomain.com/black-1-size/xxl,” and “exampledomain.com/a-top/2345.”
Example 4: Finding all titles with months of the year
If you want to highlight titles, URLs, or H1 tags that contain the month of the year (or cities, states and so on), use the following regex:
“.*(january|february|march|april|may|june|july|august|september|october|november|december)” matches “Best offers in January”, “2 bestsellers in March”, and so on.
In conclusion, regular expressions provide a powerful way to filter and manipulate data in JetOctopus. By understanding the basic rules and applying them effectively, you can extract valuable insights and streamline your data analysis process.