Tutorial | Smart pattern builder for string pattern extraction¶
Regular expressions, or regex, are character sequences arranged in a pattern which can be very useful for finding, extracting, and managing sets of strings in a dataset that correspond to a specific pattern.
In order to use them, however, one needs to craft the exact expression that will extract all the matches (and only those matches) that correspond to a pattern.
Dataiku’s smart pattern builder makes it easier to formulate regular expressions by dynamically generating suggested regular expressions based on text selections you provide in the interactive tool. Let’s see how it works in practice.
You can also find a short video walkthrough of this feature in the video below.
Some familiarity with basic data preparation in Dataiku.
An instance of Dataiku - version 9.0 and above.
Download this CSV file of airline reviews and upload it to any Dataiku project.
Our goal is to identify rows that mention Boeing plane models. Because we don’t already have a neat categorical column of plane models, or know which models might exist, we want to use a regular expression to find any mentions of Boeing models. Once we extract this pattern, we can easily filter for these rows.
Upload the airline_reviews CSV file into a Dataiku project.
From this dataset, create a Prepare recipe.
Drag to expand the width of the content column, where the natural language data is found, so that Boeing 787 is visible in the first row.
We don’t know exactly what regular expression to write to capture all of the possible variations of Boeing models, and so let’s use the smart pattern builder.
In the first row of the content column, find
Boeing 787, and highlight it.
From the displayed options, select Extract text like Boeing 787.
Another way to create the same step would be to add an Extract with regular expression step from the processor library. Choose a column, and then select “Generate with Smart Pattern”.
The smart pattern window pops open. Under Detected patterns, it displays a list of suggested regular expressions based on the current selections. From just this one selection, however, the first suggestion is too broad. It only matches some amount of characters before the number ‘787’.
The Match rate bar shows you what percentage of the rows in the current sample contain substrings that correspond to the suggested regex pattern.
Let’s add more examples of what we’re looking for and see how the suggested patterns change.
Click Wrap lines to more easily read each entry in the Input sample.
Use the Cmd/Ctrl+F on your browser to find
Boeing 777. Highlight it to add a second selection.
Do the same for
The (Boeing[a-zA-Z0-9_]+) regex now appears at the top and has been ranked as the most likely matching pattern.
Scroll through the Input sample to see what the currently suggested regular expression selects and does not select.
When satisfied that this pattern matches the strings we are after, click OK.
The reason why (Boeing[a-zA-Z0-9_]+) is a good fit for the use case in this example and ([A-Z][a-z]+[0-9]+) is not is that the latter extracts any combination of letters followed by any combination of numbers, while the former extracts any occurrence of “Boeing” followed by any combination of letters and/or numbers.
Although the smart pattern builder helps users find appropriate regular expressions even if they have a limited knowledge of regex syntax, in order to verify if the suggestions truly satisfy one’s use case, it’s helpful to understand the basics of regex and/or use regex testers/explainers such as this one.
Clicking OK creates a new step in the script which takes input from the content column and uses the selected regex to extract occurrences of the word “Boeing” followed by a string of letters and/or numbers. It stores the occurrences in a column named by default content_extracted_1.
Once you have chosen a regular expression using the smart pattern builder, there are a few other settings that can be configured within this processor, such as extracting all occurrences within each row, and not just the first one, by activating the Extract all occurrences checkbox.
Select the option to Create a special ‘found’ column. This will make our goal of filtering based on matches very easy!