Optional: Write code#
Important
If you don’t wish to get started coding with Dataiku, feel free to skip ahead to the conclusion.
See a screencast covering this section’s steps
Dataiku’s visual tools can handle a wide variety of data preparation tasks. They also enable team members with diverse skill sets to collaborate on the same platform. However, certain tasks might require a highly customized solution. Or, in some cases, you might just prefer to code! That choice is always yours.
Create a code notebook#
To create a code recipe, you’ll often start in a code notebook. Here you’ll use Python, but the process is very similar for languages like R or SQL.
From the Flow, select the job_postings_prepared dataset (not the job_postings_prepared_joined dataset).
Navigate to the Lab tab of the right side panel.
Under Code Notebooks, click New.
From the available types of notebooks, select Python.
Click Create with the default settings, including the built-in code environment.
Note
This notebook uses the built-in code environment, but you can also create your own Python or R code environments with a custom set of packages.
Write code in a notebook#
You now have a Jupyter notebook. What’s special though is how the Dataiku API provides a direct way to a pandas DataFrame of the dataset. Regardless of the data’s storage location, this line of code is the same.
When the kernel is ready, run the starter cells in the notebook.
Replace the last cell with the following code block that creates a numeric feature min_salary out of the string feature salary_range:
# Define a function to extract the minimum salary
def extract_min_salary(salary_range):
if pd.isna(salary_range): # Keep missing values as missing
return None
try:
min_salary = int(salary_range.split('-')[0]) # Extract minimum salary
return min_salary
except:
return None # Handle invalid values
# Apply the function to create the "min_salary" column
df['min_salary'] = df['salary_range'].apply(extract_min_salary).astype('Int64')
Note
This column could use more data preparation! Some values appear to round off (50 instead of 50,000). Is it US dollars, euros, pounds, etc? A few values are dates. But you get the idea!
Convert a code notebook to a code recipe#
This code notebook only exists in the Lab, a space for experimental prototyping. To include it in the project’s Flow, you’ll need to convert the notebook into a recipe.
Click + Create Recipe, and click OK with the default Python recipe selected.
Under Outputs, click +Add, and name the output
job_postings_python
.Click Create Dataset.
Click Create Recipe.
Once in the recipe interface, there is just one adjustment to make.
Edit the last line of code in the recipe to match the following.
job_postings_python.write_with_schema(df)
Click Run (or type
@
+r
+u
+n
).When the job finishes, navigate back to the Flow (
g
+f
) to see the code recipe and its output.
Note
As your project evolves, you’ll iterate back and forth between experimental work in a notebook and production work in a recipe. Once your code recipe stabilizes, you may consider turning it into a plugin, a reusable component which wraps your code under a visual interface so more users can benefit from it.
Adjust the Flow#
You now have converted exploratory code in a Lab notebook to production code in a Flow recipe! However, the input to the Join recipe needs to change for this work to be included in the existing data pipeline.
Double click on the Join recipe to open it.
On the Join step, click on the Replace dataset icon next to job_postings_prepared.
Choose job_postings_python as the replacement.
Click Replace Dataset.
Click Run, and navigate back to the Flow when the job finishes.
You now have inserted Python code into a visual Flow! Any user building a dataset like job_postings_prepared_joined is effectively running this Python recipe too – whether or not they know Python.
Tip
This was just the very beginning of what Dataiku has to offer coders! To find more resources for coders, please see the Quickstart Tutorial in the Developer Guide or the Developer learning path in the Academy.