Prepare data#

See a screencast covering this section’s steps

Once you have a sense of the dataset, you’re ready to start cleaning and creating new features using a Prepare recipe.

Important

Recipes in Dataiku contain the transformation steps, or processing logic, that act upon datasets. They can be visual or code (Python, R, SQL, etc).

  1. From the job_postings dataset, click the Actions button.

  2. From the menu of visual recipes, select Prepare.

  3. Click Create Recipe, accepting the default output name and storage location.

Dataiku screenshot of the dialog for a Prepare recipe

Add a step from the processor library#

You can add steps to the empty script of the Prepare recipe in many different ways. Let’s start with the processor library, the complete collection of available data transformation steps.

  1. Click + Add a New Step to open the processor library.

  2. Filter for Split / Extract, and select the Split column processor.

  3. After adding the step to the script, provide location as the column and , as the delimiter.

  4. Click Truncate, and set the maximum number of columns to keep to 3.

  5. Observe the preview output columns highlighted in blue.

Dataiku screenshot of the Split column step in a Prepare recipe.

Add a step from the column header#

You can also add frequent and suggested steps directly from the column header.

  1. Still in the Script tab of the Prepare recipe, click on the location_0 column header to open a dropdown menu of actions.

  2. Click Rename.

  3. Enter country as the new name, and click OK.

  4. Repeat this process for the columns location_1 and location_2, entering the new names state and city.

  5. When finished, you’ll have one step with three column renamings.

Dataiku screenshot of the rename column step in a Prepare recipe.

Add a step to multiple columns#

It’s also possible to add certain steps to multiple columns at once.

  1. Switch from the Table view to the Columns view.

  2. Check the box next to four natural language columns: company_profile, description, requirements, and benefits.

  3. Click Actions to open the dropdown.

  4. From the menu, choose Simplify text.

  5. In the script, observe four new steps, each one normalizing a natural language column.

Dataiku screenshot of the Simplify text step in a Prepare recipe.

Add a Formula step#

Dataiku also has its own spreadsheet-like Formula language with common mathematical and string operations, comparison and logical operators, and conditional statements that you can use to create new columns.

  1. Switch back to the Table view.

  2. Click + Add a New Step.

  3. Remove the Split / Extract filter, search for a Formula step, and add it.

  4. In the script, name the output column len_company_profile.

  5. Click Open Editor Panel.

  6. Start typing length(company_profile), noting the availability of auto-completion.

  7. Click Apply, and observe the new column preview highlighted in blue.

Dataiku screenshot of the Formula step in a Prepare recipe.

Duplicate a step#

The Prepare recipe has many other conveniences, such as grouping, coloring, commenting, duplicating, and copy-pasting steps.

  1. Click the horizontal three dots at the right of the last step.

  2. Click Duplicate step.

  3. Change the output column name to len_description.

  4. Change the expression to length(description).

Dataiku screenshot of the duplicate step option in a Prepare recipe.

Run the Prepare recipe#

You could definitely keep preparing this dataset, but let’s stop here. This recipe is a repeatable record of transformation steps that you can apply to the entire input dataset whenever required.

  1. Click the Run button (or type the keyboard shortcut @ + r + u + n).

  2. When the job is finished, click Explore dataset job_postings_prepared.

  3. After inspecting the output dataset, navigate back to the Flow (g + f).

Dataiku screenshot of the Run button in a Prepare recipe.