Prepare data#
See a screencast covering this section’s steps
Once you have a sense of the dataset, you’re ready to start cleaning and creating new features using a Prepare recipe.
Important
Recipes in Dataiku contain the transformation steps, or processing logic, that act upon datasets. They can be visual or code (Python, R, SQL, etc).
From the job_postings dataset, click the Actions button.
From the menu of visual recipes, select Prepare.
Click Create Recipe, accepting the default output name and storage location.
Add a step from the processor library#
You can add steps to the empty script of the Prepare recipe in many different ways. Let’s start with the processor library, the complete collection of available data transformation steps.
Click + Add a New Step to open the processor library.
Filter for Split / Extract, and select the Split column processor.
After adding the step to the script, provide
location
as the column and,
as the delimiter.Click Truncate, and set the maximum number of columns to keep to
3
.Observe the preview output columns highlighted in blue.
Add a step from the column header#
You can also add frequent and suggested steps directly from the column header.
Still in the Script tab of the Prepare recipe, click on the location_0 column header to open a dropdown menu of actions.
Click Rename.
Enter
country
as the new name, and click OK.Repeat this process for the columns location_1 and location_2, entering the new names
state
andcity
.When finished, you’ll have one step with three column renamings.
Add a step to multiple columns#
It’s also possible to add certain steps to multiple columns at once.
Switch from the Table view to the Columns view.
Check the box next to four natural language columns: company_profile, description, requirements, and benefits.
Click Actions to open the dropdown.
From the menu, choose Simplify text.
In the script, observe four new steps, each one normalizing a natural language column.
Add a Formula step#
Dataiku also has its own spreadsheet-like Formula language with common mathematical and string operations, comparison and logical operators, and conditional statements that you can use to create new columns.
Switch back to the Table view.
Click + Add a New Step.
Remove the Split / Extract filter, search for a Formula step, and add it.
In the script, name the output column
len_company_profile
.Click Open Editor Panel.
Start typing
length(company_profile)
, noting the availability of auto-completion.Click Apply, and observe the new column preview highlighted in blue.
Duplicate a step#
The Prepare recipe has many other conveniences, such as grouping, coloring, commenting, duplicating, and copy-pasting steps.
Click the horizontal three dots at the right of the last step.
Click Duplicate step.
Change the output column name to
len_description
.Change the expression to
length(description)
.
Run the Prepare recipe#
You could definitely keep preparing this dataset, but let’s stop here. This recipe is a repeatable record of transformation steps that you can apply to the entire input dataset whenever required.
Click the Run button (or type the keyboard shortcut
@
+r
+u
+n
).When the job is finished, click Explore dataset job_postings_prepared.
After inspecting the output dataset, navigate back to the Flow (
g
+f
).