Concept: Regular Expressions in Dataiku

In this lesson, we’ll look at:

  • what are regular expressions,

  • when they can be used in Dataiku,

  • and how Dataiku’s smart pattern builder can assist in crafting regular expressions that achieve your tasks.

String Data

String data is often messy. You may not have perfectly organized categories. Or you may be searching for very specific text attributes within a large corpus of natural language.

Regular expressions often play a key role in allowing you to derive value from this kind of data.

Regular expressions are sequences of characters arranged in specific patterns so that you can extract components of string data.

The idea is to define a pattern that matches certain string characters, and then use the matches found by that pattern for some operation, such as filtering or flagging rows.

Slide depicting how a regular expression can extract text.

Regular Expression Components

Consider that, within a trove of text data, you might want build a pattern around:

  • Anchors, like the start or end of a string;

  • Classes of characters, like digits, characters, or whitespace;

  • Groups and ranges;

  • And quantifiers to select a certain quantity of some kind of character.

Slide depicting common components in a regular expression.

For example, if we wanted to extract all of the possible variations of iPhone models, we might use a regular expression that looks like this: (i[pP]hone\ *[0-9]*[sS]*[xX]*).

Regular Expressions within Dataiku

Regular expressions can be used in many places within Dataiku, particularly in the Prepare recipe.

For example, some processor steps, such as those for transforming strings or removing empty values, can be applied to multiple columns using a regular expression. Instead of having to manually choose all of the columns to which a step should apply, the step will be applied to any column matching the regular expression pattern.

Slide depicting Prepare recipe processors that can be applied to columns based on a regular expression pattern.

Another processor makes it easy to count the occurrences of matches to a regular expression pattern. This example uses the previous pattern to count iPhone mentions.

Slide depicting a Prepare recipe processor that counts the occurrences of a regular expression pattern.

Smart Pattern Builder

This still leaves us though with the challenge of crafting the correct regular expression. For that, we can turn to the smart pattern builder, available in the “Extract with regular expression” processor.

While in a Prepare recipe, add the “Extract with regular expression” step from the processor library or directly begin highlighting examples of the text you want to extract.

Dataiku screenshot of a Prepare recipe with text highlighted showing the option to extract text like this.

Choosing the “Extract text like” option makes the highlighted text the first selection for Dataiku’s smart pattern builder.

From this selection, Dataiku suggests possible regular expressions that can match it. Select more examples of the text you wish to extract, and the suggested regular expressions will update.

Dataiku screenshot of the smart pattern builder dialog.

This particular pattern may be too narrow depending on our objective, but now that you’ve seen the basics you can start experimenting with regular expressions on your own!