Concept | Prepare recipe¶
The Prepare recipe is a visual recipe in Dataiku that allows you to create data cleansing, normalization, and enrichment scripts in an interactive way.
To prepare your data, you must add steps to the recipe script.
Using the processor library¶
An essential advantage of the Prepare recipe is its library of more than 100 data processors. Most processors are designed to handle one specific task, such as filtering rows, rounding numbers, extracting regular expressions, concatenating or splitting columns, and much more.
Processors empower you to perform a huge variety and combination of tasks. One processor, for example, is a Formula language, similar to what you might find in a spreadsheet program, which you can use to create new columns from those already present, drawing on a range of built-in functions.
Another processor even lets you create a Python function for each row.
In addition to directly adding steps from the processor library, you can add steps to the script in a number of other ways.
Using the Analyze window¶
Another method to add steps to the script is through the Analyze window. Within a Prepare recipe, the Analyze window can guide data preparation, for example merging categorical values.
Manually moving the columns¶
You can also directly drag columns to adjust their order, or switch from the Table view to the Columns view to apply certain steps to more than one column at a time.
When adding new steps to the script, you’ll notice how the step output is immediately visible. This is possible because the step is being applied to the same sample of the dataset found in the Explore tab. The quick feedback allows you to work incrementally, quickly modifying your transformation steps.
Notice that steps in the script constitute a list of instructions. These instructions are not immediately applied to the dataset itself. For example, adding a Delete Column step removes that column from the step preview, but it does not actually delete the column in the dataset, as it would in a spreadsheet. Only when you choose to actually run the recipe will Dataiku execute the instructions on the full input dataset, and thereby produce a new output dataset. The original input dataset always remains.
If a script starts to grow in complexity, a number of features can help you manage them.
You can disable steps.
You can organize individual steps into groups of steps.
You can add colors and comments to steps in order to send reminders to yourself and colleagues.
You can even copy and paste steps within the same recipe or to another recipe, even if that recipe is in another project or another Dataiku instance.