How-To: Extract Patterns With the Smart Pattern Builder¶
Regular expressions, or regex, are character sequences arranged in a pattern which can be very useful for finding, extracting, and managing sets of strings in a dataset that correspond to a specific pattern. In order to use them, however, one needs to know (or find) the exact expression that will extract all the matches that correspond to a pattern.
The smart pattern builder makes it easier to find and formulate regular expressions by automatically generating suggestions for regular expressions that extract information similar to the one you selected. Let’s see how it works in practice.
Some familiarity with basic data preparation in Dataiku DSS (we recommend having completed the Core Concepts course series beforehand);
An instance of Dataiku DSS - version 9.0 and above.
Detect Regex Patterns Automatically¶
To detect a pattern with the smart pattern builder, you first need to find an example of a substring that matches the pattern and highlight it.
Then, from the displayed options, you need to select Extract text like [the substring you highlighted].
In this example, we want to identify all mentions of Boeing plane models in the flight reviews and extract them in a new column. To do this:
In the first row of the content column, we find
Boeing 787and highlight it.
From the displayed options, we select Extract text like Boeing 787.
The smart pattern window pops open. Under Detected patterns, it displays a list of regular expression suggestions that it has detected which extract substring patterns similar to the one you highlighted. The Match rate bar shows you what part of the rows contain substrings that correspond to the suggested regex pattern.
In the Input sample section, you can highlight additional examples in order to help the smart pattern builder generate a better pattern suggestion.
Under Input sample, we find and highlight
Boeing Maxas well, in order to feed the smart pattern builder some additional examples.
The (Boeing[a-zA-Z0-9_]+) regex now appears at the top and has been ranked as the most likely matching pattern. We can now select it with reasonable confidence.
The reason why (Boeing[a-zA-Z0-9_]+) is a good fit for the use case in this example and ([A-Z][a-z]+[0-9]+) is not is that the latter extracts any combination of letters followed by any combination of numbers, while the former extracts any occurrence of “Boeing” followed by any combination of letters and/or numbers.
While the smart pattern builder helps users find appropriate regular expressions even if they have a limited knowledge of regex syntax, in order to verify if the suggestions truly satisfy one’s use case, it’s helpful to understand the basics of regex and/or use regex testers/explainers such as this one.
This creates a new step in the script which takes input from the content column and uses the regex we selected to extract occurrences of the word “Boeing” followed by a string of letters and/or numbers. It stores the occurrences in a column named by default content_extracted_1.
You can choose to rename the prefix of the output column. Additionally, you can choose to extract all occurrences within each row, and not just the first one, by activating the Extract all occurrences checkbox.