Quick Start | Dataiku for data preparation#

Get started#

Recent advancements in generative AI have made it easy to apply for jobs. But be careful! Scammers have also been known to create fake job applications in the hopes of stealing personal information. Let’s see if you — with Dataiku’s help — can spot a real job posting from a fake one!

Objectives#

In this quick start, you’ll:

  • Interactively explore a dataset.

  • Clean data and create new features.

  • Import new data into a Dataiku project.

  • Join two datasets together.

  • Optional: Write code in a notebook and insert it into a visual Flow.

Tip

To check your work, you can review a completed version of this entire project from data preparation through MLOps on the Dataiku gallery.

Create an account#

To follow along with the steps in this tutorial, you need access to a 11.0+ Dataiku instance. If you do not already have access, you can get started in one of two ways:

  • Start a 14 day free trial. See this how-to for help if needed.

  • Install the free edition locally for your operating system.

Open Dataiku#

The first step is getting to the homepage of your Dataiku Design node.

  1. Go to the Launchpad.

  2. Click Open Instance in the Design node tile of the Overview panel once your instance has powered up.

  3. See this how-to for any difficulties.

Important

If using a self-managed version of Dataiku, including the locally-downloaded free edition on Mac or Windows, open the Dataiku Design node directly in your browser.

Create the project#

Let’s start by creating a Dataiku project that already includes a labeled dataset of real and fake job postings.

  1. From the Dataiku Design homepage, click + New Project.

  2. Click DSS tutorials in the dropdown menu.

  3. In the dialog, click Quick Starts on the left hand panel.

  4. Choose Data Preparation Quick Start, and then click OK.

Dataiku screenshot of the dialog for creating a new project.

Note

You can also download the starter project from this website and import it as a zip file.

Explore data#

See a screencast covering this section’s steps

Before actually preparing the data, let’s explore it briefly.

  1. If not already there, from the left-most menu in the top navigation bar, click on the Flow (or use the keyboard shortcut g + f).

  2. Double click on the job_postings dataset to open it.

Dataiku screenshot of the Flow showing the job postings dataset.

Tip

There are many other keyboard shortcuts beyond g + f. Type ? to pull up a menu or see the complete list in the reference documentation.

Compute the row count#

One of the first things to recognize when exploring a dataset in Dataiku is that you are viewing only a sample of the data. This enables you to work interactively even with very large datasets.

  1. From within the Explore tab of the job_postings dataset, click the Sync icon. Sync icon (near the top left of the screen) to compute the row count of the entire dataset.

  2. Click the Sample button to open the Sample settings panel.

  3. Click the Sampling method dropdown to explore options besides the default first records. No changes are required.

  4. When ready, click the Sample button again to collapse the panel.

Dataiku screenshot of the sampling tab for a dataset.

Analyze column distributions#

When exploring a dataset, you’ll also want a quick overview of the columns’ distributions and summary statistics.

  1. Click on the header of the first column job_id to open a menu of options.

  2. Select Analyze.

  3. Use the arrows at the top left of the dialog to cycle through presentations of each column distribution, including the target variable fraudulent column.

Dataiku screenshot of the Analyze tool.

Tip

Applying the Analyze tool on the fraudulent column shows that about 5% of records in the sample are labeled as fake (a value of 1), whereas the remaining 95% are real job postings (a value of 0).

At the top of the window, you can adjust the dropdown from sample to Whole data, and click Save and Compute to see if this statistic differs for the whole dataset.

Prepare data#

See a screencast covering this section’s steps

Once you have a sense of the dataset, you’re ready to start cleaning and creating new features using a Prepare recipe.

Important

Recipes in Dataiku contain the transformation steps, or processing logic, that act upon datasets. They can be visual or code (Python, R, SQL, etc).

  1. From the job_postings dataset, click the Actions button near the top right.

  2. From the menu of visual recipes, select Prepare.

  3. Click Create Recipe, accepting the default output name and storage location.

Dataiku screenshot of the dialog for a Prepare recipe

Add a step from the processor library#

You can add steps to the empty script of the Prepare recipe in many different ways. Let’s start with the processor library, the complete collection of available data transformation steps.

  1. At the bottom left, click + Add a New Step to open the processor library.

  2. Filter for Split / Extract, and select the Split column processor.

  3. After adding the step to the script, provide location as the column and , as the delimiter.

  4. Click Truncate, and set the maximum number of columns to keep to 3.

  5. Observe the preview output columns highlighted in blue.

Dataiku screenshot of the Split column step in a Prepare recipe.

Add a step from the column header#

You can also add frequent and suggested steps directly from the column header.

  1. Still in the Script tab of the Prepare recipe, click on the location_0 column header to open a dropdown menu of actions.

  2. Click Rename.

  3. Enter country as the new name.

  4. Repeat this process for the columns location_1 and location_2, entering the new names state and city.

  5. When finished, you’ll have one step with three column renamings.

Dataiku screenshot of the rename column step in a Prepare recipe.

Add a step to multiple columns#

It’s also possible to add certain steps to multiple columns at once.

  1. Near the top right, switch from the Table view to the Columns view.

  2. Check the box next to four natural language columns: company_profile, description, requirements, and benefits.

  3. Near the top left, click Actions to open the dropdown.

  4. From the menu, choose Simplify text.

  5. In the Script on the left, observe four new steps, each one normalizing a natural language column.

Dataiku screenshot of the Simplify text step in a Prepare recipe.

Add a Formula step#

Dataiku also has its own spreadsheet-like Formula language with common mathematical and string operations, comparison and logical operators, and conditional statements that you can use to create new features.

  1. Switch back to the Table view.

  2. Click + Add a New Step near the bottom left of the screen.

  3. Remove the Split / Extract filter, search for a Formula step, and add it.

  4. In the script, name the output column len_company_profile.

  5. Click Open Editor Panel.

  6. Start typing length(company_profile), noting the availability of auto-completion.

  7. Click Apply, and observe the new column preview highlighted in blue.

Dataiku screenshot of the Formula step in a Prepare recipe.

Duplicate a step#

The Prepare recipe has many other conveniences, such as grouping, coloring, commenting, duplicating, and copy-pasting steps.

  1. Click the horizontal three dots at the right of the last step.

  2. Click Duplicate step.

  3. Change the output column name to len_description.

  4. Change the expression to length(description).

Dataiku screenshot of the duplicate step option in a Prepare recipe.

Run the Prepare recipe#

You could definitely keep preparing this dataset, but let’s stop here. This recipe is a repeatable record of transformation steps that you can apply to the entire input dataset whenever required.

  1. At the bottom left, click the green Run button (or type the keyboard shortcut @ + r + u + n).

  2. When the job is finished, click Explore dataset job_postings_prepared at the bottom of the screen.

  3. After inspecting the output dataset, navigate back to the Flow (g + f).

Dataiku screenshot of the Run button in a Prepare recipe.

Import data#

See a screencast covering this section’s steps

If you look at the Flow, you’ll notice different icons on the blue squares representing datasets. These icons represent the underlying storage connection for each dataset. Dataiku supports connections to many possible data sources that an instance administrator can manage.

For example, the initial job_postings dataset is an uploaded file, but the output to the Prepare recipe is the default location on your instance. It might be cloud storage, a SQL database, or even a local filesystem.

Upload a file#

For now, let’s demonstrate importing a new dataset into Dataiku.

  1. Download the earnings_by_education.csv file.

  2. From the Flow, click + Dataset near the top right of the screen.

  3. Choose Upload your files.

    Dataiku screenshot of the Flow highlighting data connections.
  4. Click Select Files, and choose the earnings_by_education.csv file.

  5. Rather than immediately create it, click Configure Format.

Dataiku screenshot of the data import process.

Infer the dataset’s schema#

Configuring the format allows you to adjust the dataset’s schema, or the name and storage type of columns in this context.

  1. While creating the new uploaded files dataset, navigate to the Schema subtab.

  2. Click Infer Types from Data so Dataiku can try to guess the correct storage types based on the current sample.

  3. Click Confirm, and notice how the median_weekly_earnings_usd column changed from a string to an integer.

  4. Near the top right, click Create to finish importing the dataset.

  5. After looking at the new dataset, navigate back to the Flow (g + f).

Dataiku screenshot of the data import process.

Join data#

See a screencast covering this section’s steps

This second dataset documents the median weekly earnings in USD according to the Bureau of Labor Statistics.

The categories in the education_level column in the earnings_by_education dataset are the same as the categories in the required_education column in the job_postings_prepared dataset. Based on this relationship, let’s enrich the job_postings_prepared dataset with the earnings data.

Create a Join recipe#

Let’s do this with another kind of visual recipe: a Join recipe.

  1. From the Flow, click to select job_postings_prepared. (This will be the “left” dataset).

  2. Open the Actions tab on the right, and click Join with from the menu of visual recipes.

  3. On the left hand side of the dialog, click No dataset selected, and provide earnings_by_education. (This will be the “right” dataset).

  4. Click Create Recipe.

Dataiku screenshot of the dialog for a Join recipe.

Configure the Join step#

You have a number of different options when deciding how datasets should be joined.

Choose a join type#

Dataiku can perform the standard types of joins familiar to SQL users depending on which matches between the datasets you wish to retain.

  1. On the Join step, click on Left join to view the various join options.

  2. Leave the default Left join so that the output dataset retains all records of job postings even if a matching education level is not found.

Dataiku screenshot of the join types in a Join recipe.

Add a join condition#

Once you know the type of join required for your use case, you can define the conditions for the join key.

  1. On the Join step, click Add a Condition.

  2. For the job_postings_prepared dataset on the left, change the join column to required_education.

  3. Click OK.

Dataiku screenshot of the Join step for a Join recipe.

Handle unmatched rows#

By default, the Join recipe drops any unmatched rows, but you can also send unmatched rows to another dataset by clicking the dropdown menu beneath the join key. This can be especially useful for verifying assumptions about the datasets or, more generally, anytime you want to handle unmatched rows in some particular way.

  1. On the Join step, select the dropdown menu underneath the join condition.

  2. Select Send unmatched rows to other output dataset(s).

  3. Click + Add Dataset.

  4. Give the name unmatched.

  5. Click Use Dataset.

Dataiku screenshot of the Join step showing unmatched rows sent to another dataset.

Select columns for the output dataset#

The next step is to choose which columns should appear in the output.

  1. On the left hand panel, navigate to the Selected columns step.

  2. Under earnings_by_education, open the dropdown menu, and choose Manually select columns.

  3. Uncheck education_level since we already have the same information in the required_education column of the left dataset.

  4. Click Save (or cmd/ctrl + s).

Dataiku screenshot of the Selected columns step for a Join recipe.

Check the computation engine for the Join recipe#

Before running the Join recipe, you may be curious where the actual computation will take place.

  1. At the bottom left, click on the Gear icon. Gear icon to open the recipe engine settings.

  2. Click Show to view the non-selectable engines as well.

  3. Click Close when finished.

Dataiku screenshot of a Join recipe highlighting the computation engine.

In most cases, you’ll never need to adjust the recipe engine setting manually, but it can be helpful to know what’s happening underneath Dataiku.

You can think of Dataiku as an orchestrator on top of your data infrastructure. In most cases, your instance administrator will create connections to this infrastructure. Then, you’ll manipulate data through recipes (visual or code), but you often won’t need to move the data into Dataiku.

Tip

For this tutorial, you’ll use the DSS engine. However, you could use a Sync or a Prepare recipe to move the data into a storage location that would enable more powerful engine options, such as the in-database SQL engine.

Build a dataset from the Flow#

In many cases, you won’t want to open an individual recipe in order to run it. Instead, you can instruct Dataiku to build whatever dataset you want, and Dataiku will run whatever recipes are necessary to make that happen.

  1. Before running the Join recipe, navigate back to the Flow (g + f).

  2. Click once to select the empty (unbuilt) job_postings_prepared_joined dataset.

  3. In the Actions panel, click Build.

  4. Since all build modes are the same in this case, leave the default “Build Only This” mode, and click Build Dataset.

Dataiku screenshot of the build dialog for a dataset.

Note

In other learning resources, such as the MLOps quick start, you’ll learn how to automate builds of datasets and other Flow objects with scenarios.

Check unmatched rows#

In addition to the main output dataset, the Join recipe also created a dataset containing any unmatched rows.

  1. When the job finishes, refresh the screen.

  2. From the Flow, double click the unmatched dataset to open it.

  3. Return to the Flow when finished viewing it.

Dataiku screenshot of a dataset of unmatched rows.

The value “Some high school coursework” in the earnings_by_education dataset did not have an exact match in the job_postings_prepared dataset, and so this row appears here.

Tip

Aside from changing the data values, you could fix this by returning to the join condition in the Join recipe, and checking the option for a Case insensitive match. Feel free to try this out yourself and re-run the Join recipe or move ahead!

Optional: Write code#

Note

This section is entirely optional. If you don’t wish to get started coding with Dataiku, feel free to skip ahead to the conclusion.

See a screencast covering this section’s steps

Dataiku’s visual tools can handle a wide variety of data preparation tasks. They also enable team members with diverse skill sets to collaborate on the same platform. However, certain tasks might require a highly customized solution. Or, in some cases, you might just prefer to code! That choice is always yours.

Create a code notebook#

To create a code recipe, you’ll often start in a code notebook. Here you’ll use Python, but the process is very similar for languages like R or SQL.

  1. From the Flow, select the job_postings_prepared dataset (not the job_postings_prepared_joined dataset).

  2. In the Actions tab, click on the Lab button. Alternatively, navigate to the Lab tab of the right side panel (shown here).

  3. Under Code Notebooks, click New.

  4. From the available types of notebooks, select Python.

  5. Click Create with the default settings, including the builtin code environment.

Dataiku screenshot of the dialog for creating a Python notebook.

Note

This notebook uses the builtin code environment, but you can also create your own Python or R code environments with a custom set of packages.

Write code in a notebook#

You now have a Jupyter notebook. What’s special though is how the Dataiku API provides a direct way to a pandas DataFrame of the dataset. Regardless of the data’s storage location, this line of code is the same.

  1. Run the starter cells in the notebook.

  2. Replace the last cell with the following code block that creates a numeric feature min_salary out of the string feature salary_range:

# Define a function to extract the minimum salary
def extract_min_salary(salary_range):
    if pd.isna(salary_range):  # Keep missing values as missing
        return None
    try:
        min_salary = int(salary_range.split('-')[0])  # Extract minimum salary
        return min_salary
    except:
        return None  # Handle invalid values

# Apply the function to create the "min_salary" column
df['min_salary'] = df['salary_range'].apply(extract_min_salary).astype('Int64')

Note

This column could use more data preparation! Some values appear to round off (50 instead of 50,000). Is it US dollars, euros, pounds, etc? A few values are dates. But you get the idea!

Convert a code notebook to a code recipe#

This code notebook only exists in the Lab, a space for experimental prototyping. To include it in the project’s Flow, you’ll need to convert the notebook into a recipe.

  1. Near the top right of the notebook, click + Create Recipe, and click OK with the default Python recipe selected.

  2. Under Outputs, click +Add, and name the output job_postings_python.

  3. Click Create Dataset.

  4. Click Create Recipe.

Dataiku screenshot highlighting code recipe creation in a notebook.

Once in the recipe interface, there is just one adjustment to make.

  1. Edit the last line of code in the recipe to match the following.

job_postings_python.write_with_schema(df)
  1. Click Run (or type @ + r + u + n).

  2. When the job finishes, navigate back to the Flow (g + f) to see the code recipe and its output.

Note

As your project evolves, you’ll iterate back and forth between experimental work in a notebook and production work in a recipe. Once your code recipe stabilizes, you may consider turning it into a plugin recipe, a reusable component which places your custom code behind a visual interface so more users can benefit from it.

Adjust the Flow#

You now have converted code in a notebook to a recipe in the Flow! But the input to the Join recipe needs to change for this work to be included in the same data pipeline.

  1. Double click on the Join recipe to open it.

  2. On the Join step, click on job_postings_prepared to replace it with another dataset.

  3. Choose job_postings_python as the replacement.

  4. Click Replace Dataset.

  5. Click Run again, and navigate back to the Flow when the job finishes.

Dataiku screenshot of the Join dialog including the Python recipe output.

You now have inserted Python code into a visual Flow!

Dataiku screenshot of a Flow including visual and code recipes.

Tip

This was just the very beginning of what Dataiku has to offer coders. To find more resources for coders, please see the quick start tutorial in the Developer Guide.

What’s next?#

Congratulations! You’ve taken your first steps toward preparing data with Dataiku.

You’re now ready to begin the Core Designer learning path and challenge yourself to earn the Core Designer certificate.

Another option is to dive into the world of machine learning. In the Quick Start | Dataiku for machine learning, you can use the dataset you’ve just prepared to train a model that classifies job postings as real or fake.

Note

You can also find more resources on data preparation in the following spaces: