Business Analyst Quick Start¶
Dataiku is a collaborative, end-to-end data science and machine learning platform that unites data analysts, data scientists, data engineers, architects, and business users in a common space to bring faster business insights.
In this quick start tutorial, you will be introduced to many of the essential capabilities of Dataiku by walking through a simple use case: predicting customer churn. Working towards that goal, you will:
connect to data sources,
prepare training data,
build machine learning models,
use one to generate predictions on new, unseen records, and
communicate results in a dashboard.
This hands-on tutorial is geared towards a Business Analyst entirely new to Dataiku. It focuses on using the point-and-click interface. If you are more interested in what Dataiku offers for coders, you might prefer to check out the quick start tutorials for a Data Scientist or a Data Engineer.
When you’re finished, you will have built the workflow below and understand all of its components!
To follow along with the steps in this tutorial, you will need access to a Dataiku instance (version 9.0 or above). If you do not already have access, you can get started in one of two ways:
install the free edition locally, or
start a 14-Day Free Online Trial.
You’ll also need to download this country_gdp.csv file to be uploaded during the tutorial.
A few tips to help you on your way!
Take a moment to arrange these instructions and your Dataiku instance in the most productive way for your and your workstation—for example, using separate tabs or windows.
For each section below, written instructions are recorded in bullet points, but you can also find a screencast at the end of each section that records all of the actions described. Use these short videos as your guide!
New terms have short explanations in blue boxes, but are often linked to more detail in the product documentation.
To compare against your work, you can also find a read-only completed version of the final project in the public gallery.
Create a Project¶
Let’s get started! After creating an account and logging in, the first page you’ll see is the Dataiku homepage. From this page, you’ll be able to browse projects, recent items, dashboards, and applications shared with you.
A Dataiku project is a holder for all work on a particular activity.
You can create a new project in a few different ways. You can start a blank project or import a zip file. You might also have projects already shared with you based on the user groups to which you belong.
From the Dataiku homepage, click on +New Project.
Choose DSS Tutorials > Quick Start > Business Analyst.
Click OK when the tutorial has been successfully created.
You can also download the starter project from this website and import it as a zip file.
Connect to Data¶
After creating a project, you’ll find the project homepage. It is a convenient high-level overview of the project’s status and recent activity. Clicking on the project title in the top left corner lets you return to it at any point.
When someone shares a project with you, it often will already have data inside of it. This project has four datasets in the Flow, but let’s add a new one.
The Flow is the visual representation of how data, recipes (steps for data transformation), and models work together to move data through an analytics pipeline.
A blue square in the Flow represents a dataset. The icon on the square represents the type of dataset, such as an uploaded file, or its underlying storage connection, such as a SQL database or cloud storage.
From the project homepage, click Go to Flow (or use the keyboard shortcut
Click +Dataset in the top right corner of the Flow.
Click Upload your files.
Add the country_gdp.csv file by dragging and dropping or using the button to choose a file.
Before creating it, Dataiku provides a preview of the incoming dataset.
Click the blue Create button to create the dataset within Dataiku with the default name country_gdp.
Navigate back to the Flow from the left-most menu in the top navigation bar (or use the keyboard shortcut
No matter what kind of dataset the blue square represents, the methods and interface in Dataiku for exploring, visualizing, and analyzing it are the same.
Now we have all of the raw data needed for this project. Let’s explore what’s inside these datasets.
From the Flow, double click on the crm_last_year dataset to open it.
Configure a Sample¶
On opening a dataset, you’ll find the Explore tab. It looks similar to a spreadsheet, but with a few key differences. For starters, you are viewing only a sample of the dataset.
By default, the Explore tab shows the first 10,000 rows of the dataset, but many other configurations are possible. Working with a sample makes it easy to work interactively (filtering, sorting, etc.) on even very large datasets.
Near the top left of the screen, click on Configure sample.
As a demonstration, change the Sampling method from “First records” to Random (nb. records).
Click Save and Refresh Sample.
On the right sidebar, click to open the Details panel (marked with an “i”).
Under Status, click Compute, and then click to refresh the count of records.
Although the sample includes only 10,000 records, the actual dataset is larger. Dataiku itself imposes no limit on how large a dataset can be.
Observe the Schema¶
Look closer at the header of the table. At the top of each column is the column name, its storage type, and its meaning.
Datasets in Dataiku have a schema. The schema of a dataset is the list of columns, with their names and types.
A column in Dataiku has two kinds of “types”:
The storage type indicates how the dataset backend should store the column data, such as “string”, “boolean”, or “date”.
The meaning is a rich semantic type automatically detected from the sample content of the column, such as URL, IP address, or sex.
Note how, at this point, the storage type of all columns is a string, but Dataiku has inferred the meanings to be “Text”, “Date (unparsed)”, “Decimal”, “Gender”, and “Integer”.
Click the column heading of the customerid column, and then select Analyze from the menu.
Use the arrows at the top left of the dialog to scroll through each column.
Note how the Analyze tool displays different insights for categorical and numerical variables.
Join the Data¶
Thus far, the Flow only contains datasets. To take action on datasets, you need to apply recipes.
The web and CRM (customer relationship management) data both contain a column of customer IDs. Let’s join these two datasets together using a visual recipe.
Recipes in Dataiku contain the transformation steps, or processing logic, that act upon datasets. A circle connecting the input and output datasets represents a recipe in the Flow.
Visual recipes (yellow circles) accomplish the most common data transformation operations, such as cleaning, grouping, and filtering, through a point-and-click interface.
Code recipes (orange circles) allow you to define your own processing logic in a language such as Python, R, or SQL.
Plugin recipes allow coders to wrap custom code behind a visual interface, thereby extending the native capabilities of Dataiku.
From the Flow, select the web_last_year dataset.
Open the Actions sidebar near the top right of the screen, and choose Join with from the “Visual recipes” section.
Choose crm_last_year as the second input dataset.
Name the output dataset
training_datainstead of the default name.
Create the recipe.
Once the recipe is initiated, you can adjust the settings.
On the Join step, click on Left Join to observe the different types of joins available.
Click Add a Condition to define how the datasets should be joined.
The customer_id column from the web_last_year dataset should be matched with the customerid column from the crm_last_year dataset. Click OK.
With the join condition set, you can adjust which columns to include in the output.
On the left hand side, click to move to the Selected columns step.
Uncheck the customerid column from the crm_last_year dataset (We already have the customer_id column from the left dataset).
On the Output step (further down on the left), note the output column names and storage tpyes.
For a demosntration, before running it, Save the recipe, and return to the Flow.
See how the Join recipe and output dataset are represented in the Flow. The output dataset is drawn in an outline because it has not yet been built. You have only defined the instructions as to how it should be created.
Run a Recipe¶
Now the recipe is ready to be run. However, you can choose to run the recipe with different execution engines, depending on the infrastructure available to your instance and the type of recipe at hand.
Dataiku will select the available execution engine that it detects to be the most optimal. For this tutorial, that is the DSS engine.
If, for example, the input and output datasets were stored in a SQL database, and the recipe was a SQL-compatible operation, then the in-database engine would be chosen. This is not just for SQL code recipes. Many visual recipes and processors in the Prepare recipe can be translated to SQL queries.
In other situations, such as when working with HDFS or S3 datasets, the Spark engine would be available if it is installed.
From the Flow, double click to open the Join recipe.
In the Join recipe, click the gear icon at the bottom left of the screen to view the available (and not available) engines. Keep the DSS engine.
When the recipe is finished running, click to open the output dataset.
Note the icon on the output dataset to the Join recipe. If you are working on Dataiku Online, you will see an S3 icon, as that is where Dataiku is storing the output. If you are working on a local instance, however, you will see the icon for a managed filesystem dataset.
Investigate Missing Data¶
In datasets like crm_last_year and web_last_year, you’ll notice a completely green bar beneath the schema of each column. However, in training_data, this same bar has a small gray portion for columns birth, price_first_item_purchased, sex, and churn.
For each column, this data quality bar indicates the extent to which the current sample is not empty and matches the inferred meaning.
Green represents non-empty “OK” values that match the column’s inferred meaning.
Red indicates non-empty values that do not match the column’s inferred meaning.
Gray represents empty values.
Open training_data to confirm the empty values in columns like churn.
Click the column heading of the churn column, and then select Analyze from the menu.
Change “Sample” to “Whole data”, and click Save and Compute on this column to determine how many values are empty for this column.
Adjust a Recipe¶
The empty values in the output dataset come from rows in web_last_year not having a matching customer ID in crm_last_year. In this case, let’s remove these rows by changing the Join type.
From the training_data dataset, click Parent Recipe near the top right of the screen to return to the settings of the Join recipe.
On the Join step, click on Left Join.
Choose an Inner join to drop rows that do not have a match in both datasets.
All of the other settings remain the same so click Run to execute the recipe with the new Join type.
Open the output dataset once more, and confirm that there are no longer any empty values in columns like churn.
The system of dependencies in the Flow between input datasets, recipes, and output datasets makes it possible to adjust or re-run recipes as needed without fear of altering the input data. As you will see later, it also provides the groundwork for a variety of different build strategies.
Data cleaning and preparation is typically one of the most time-consuming tasks for anyone working with data. Let’s see how this critical step can be accomplished with visual tools in Dataiku.
You can use the Prepare recipe for data cleaning and feature generation in a visual and interactive way.
In a Prepare recipe, the inferred meanings for each column allow Dataiku to suggest relevant actions in many cases. Let’s see how this works for a date column.
With the training_data dataset selected, initiate a Prepare recipe from the Actions sidebar on the right.
The name of the output dataset,
training_data_prepared, fits well so just create the recipe.
Dataiku recognizes the birth column as an unparsed date. The format though is not yet confirmed.
From the birth column header dropdown, select Parse date.
In the “Smart Date” dialog, click Use Date Format to accept the detected format.
Return to the birth column header dropdown, and select Delete.
From the birth_parsed column header dropdown, select Compute time since.
On the left side of the screen, change the “Output time unit” to Years.
Change the name of the “Output column” to
You deleted the birth column in the Prepare script above. However, this column remains in the input dataset. It will only be removed from the output dataset. Unlike a spreadsheet tool, this system of input and output datasets makes it easy to track what changes have occurred.
The age column has some suspiciously large values. The same Analyze tool used to explore data can also be used to take certain actions in a Prepare recipe.
From the age column header dropdown, select Analyze.
In the Actions dropdown of the Outliers section, choose Clear rows outside 1.5 IQR to filter out rows with abnormally high and low ages.
In addition to the suggested actions for each column, you can also click +Add New Step at the bottom of the script on the left to search the processors library for more than 100 different transformations.
Resolve IP Addresses¶
Let’s try out a few more processors to further enrich the dataset.
From the ip column header dropdown, select Resolve GeoIP.
Extract only the country, GeoPoint, and continent code.
Delete the original ip column.
Fuzzy Join Data¶
In the earlier Join recipe, the customer IDs in each input dataset needed to match exactly. In some cases though, the ability to join based on a “close” match is what you really need.
Return to the Flow, and open the country_gdp dataset to remember what it contains.
Country names or abbreviations are one example where fuzzy matching may be a solution. For example, we would be able to match “UK” in one dataset with “U.K.” in another.
From the Flow, double click to open the Prepare recipe.
At the bottom of the recipe script, click to +Add a New Step.
Select Fuzzy join with other dataset.
ip_countryas the “Join column (here)”.
Add country_gdp as the “Dataset to join with”.
Countryas the “Join column (in other dataset)”.
GDP_capas the “Column to retrieve”.
Remove the prefix
Select “Normalize text”.
This processor works fine for joining this small lookup table, but there is also a more powerful Fuzzy Join recipe that works with larger datasets.
Write a Formula¶
You might also want to create new features from those already present. For this purpose, Dataiku has a Formula language, similar to what you would find in a spreadsheet tool.
Dataiku formulas are a powerful expression language available in many places on the platform to perform calculations, manipulate strings, and much more. This language includes:
common mathematical functions, such as round, sum and max;
comparison operators, such as
logical operators, such as
tests for missing values, such as
string operations with functions like
conditional if-then statements.
At the bottom of the recipe script, click to +Add a New Step.
Name the “Output column”
Click Open Editor Panel to open the Formula editor.
Type the formula below, and click Apply when finished.
As this is the last step for now, click Run to execute the Prepare recipe and produce the output dataset.
if(numval("GDP_cap") > 40000, 1, 0)
This is only a small window into what can be accomplished in a Prepare recipe. See the Processors reference for a complete list.
The ability to quickly visualize data is an important part of the data cleaning process. It’s also essential to be able to communicate results.
Create a Chart¶
Let’s visualize how customer churn differs by age and sex.
Open the training_data_prepared dataset.
Navigate to the Charts tab.
From the list of columns on the left, drag the churn column to the Y-axis and age to the X-axis.
Click on the churn dropdown in the Y-axis field to change the aggregation function to “Sum”.
Click on the age dropdown in the X-axis field to change the binning mode to “Fixed size intervals”.
Reduce the bin size to
Click the box to “Generate one tick per bin”.
Drag the sex column to the “And” field.
Be aware that charts, by default, are built on the same sample as in the Explore tab. To see the chart on the entire dataset, you need to change the sampling method.
On the left hand panel, switch from “Columns” to Sampling & Engine.
Uncheck “Use sample sample as explore”.
Change the Sampling method to No sampling (whole data).
Click Save and Refresh Sample.
Create a Map¶
Let’s also explore the geographic distribution of customer churn.
At the bottom of the Charts tab, Click +Chart.
Click the Chart type dropdown, and choose the Globe icon. Then select a Scatter Map.
Drag ip_geopoint to the “Geo” field.
Drag churn to the “Details” field.
Click on the churn dropdown, and click to treat it as an alphanumeric variable.
Click on the color droplet to the left of the churn box, and change the palette to Set2.
In the empty field to the right of the “Details” field, reduce the base radius of the points to 2.
Click to edit the chart title to
Churn by IP Address.
Note how the sampling method (none) remains the same when creating the second chart.
Create a Statistical Worksheet¶
You can also run a number of common statistical tests to analyze your data with more rigor. You retain control over the sampling settings, confidence levels, and where the computation should run.
Navigate to the Statistics tab of the training_data_prepared dataset.
Click +Create your First Worksheet > Bivariate analysis.
From the list of available variables on the left, select ip_continent and price_first_item_purchased.
Click the plus to add them to the selected factor(s) section.
Add churn to the Response section in the same way.
Click Create Card.
From the “Sampling and filtering” dropdown near the top left of screen, change the Sampling method to No sampling (whole data), and click Save and Refresh Sample.
Explore the output of your first card in your first statistical worksheet.
Feel free to add more cards to your existing worksheet or create a new one with a different kind of statistical analysis!
Having explored and prepared the training data, now let’s start building machine learning models to predict which customers will churn.
Dataiku contains a powerful automated machine learning engine that allows you to create highly optimized models with minimal user intervention.
At the same time, you retain full control over all aspects of a model’s design, such as which algorithms to test or which metrics to optimize. A high level of customization can be accomplished with the visual interface or expanded even further with code.
Train a Model¶
In order to train a model, you need to enter the Lab.
The Lab is the place in Dataiku optimized for experimentation and discovery. Keeping experimentation (such as model training) in the Lab helps avoid overcrowding the Flow with unnecessary items that may not be used in production.
From the Flow, select the training_data_prepared dataset.
From the Actions sidebar, click on the Lab.
Under “Visual ML”, choose AutoML Prediction.
Select churn as the feature on which to create the prediction model.
Leave the “Quick Prototypes” setting and the default name, and click Create.
Without reviewing the design, click Train to start building models.
Inspect a Model¶
You have full transparency into how the random forest and logistic regression models were produced in the Design tab, as well as a number of tools to understand their interpretation and performance in the Result tab.
Let’s explore the results of the initial model training session.
From the Result tab of the visual analysis “Quick modeling of churn on training_data_prepared”, see the two results in Session 1 on the left side of the screen.
Click Random forest, which happened to be the highest-performing model in this case.
Explore some of the model report, such as the variable importance chart, confusion matrix, and ROC curve.
Navigate to the Subpopulation analysis panel.
Choose sex as the variable.
Click Compute to examine whether the model behaves identically for male and female customers.
Iterate on a Model¶
You now have a baseline model, but you may be able to improve its performance by adjusting the design in many different ways.
When finished viewing the model report, click on Models near the top of the page to return to the results of the training session.
Navigate from the Result tab to the Design tab near the top center of the page.
Examine some of the conditions under which the first models were trained, such as the split of the train and test sets in the “Train / Test Set ” panel.
In the Features handling panel, turn off two features (ip_geopoint and birth_parsed) for which we have already extracted their information into ip_country and age.
In the Feature generation panel, add an explicit pairwise interaction between ip_country and sex.
In the Algorithms panel, add a new algorithm to the training session, such as “XGBoost”.
When satisfied with your changes, click Train near the top right corner.
On the Training models dialog, click Train once again to build models with these new settings.
When training models, you can adjust many more possible settings covered in the product documentation.
Deploy a Model¶
The amount of time devoted to iterating the model design and interpreting the results can vary considerably depending on your objectives. In this case, let’s assume you are satisfied with the best performing model, and you want to use it to generate predictions for new data on which the model has not been trained.
From the Result tab of the analysis, click to open the report of the best performing model (in this case, the XGBoost model from Session 2).
Click Deploy near the top right corner.
Click Create on the following dialog.
Observe two new green objects in the Flow:
a training recipe (green circle), and
a deployed or saved model (green diamond)
Double click to open the saved model from the Flow.
Once you have a model deployed to the Flow, you can use it to generate predictions on new, unseen data. This new data, however, first needs to be prepared in a certain way because the deployed model expects a dataset with the exact same set of features as the training data.
Although there are more sophisticated ways of preparing a dataset for scoring (such as using the Stack and Split recipes), let’s keep things simple, and send the new data through the same pipeline by copying recipes.
From the Flow, select both the Join and Prepare recipes. (On a Mac, hold Shift to select multiple objects).
From the Actions sidebar, choose Copy.
From the Flow View menu at the bottom left, choose Copy To.
Rename training_data_prepared_1 as
Rename training_data_1 as
The Flow now looks like a mess! That’s because the recipes have been copied to the Flow, but their inputs need to change.
Open the copied Join recipe (compute_training_data_1).
On the Join step, replace the input datasets:
Click on web_last_year, and choose web_this_year as the replacement.
Click on crm_last_year, and choose crm_this_year as the replacement.
Instead of running the recipe right away, click Save, and update the schema.
Return to the Flow to see the parallel pipeline for the scoring data.
When designing a Flow under normal circumstances, a better practice would be to build the dataset after copying the Join recipe by itself. If you open the copied Prepare recipe, you’ll notice errors due to the schema changes. This is done only to introduce the concept below.
Build a Pipeline¶
Now you have a pipeline created for the scoring data, but the datasets themselves are empty. Note how they are transparent with a dotted outline. Let’s instruct Dataiku to build the prepared scoring data, along with any upstream datasets needed to make that happen.
When you make changes to recipes within a Flow, or if there is new data arriving in your source datasets, you will want to rebuild datasets to reflect these changes. There are multiple options for propagating these changes.
A non-recursive option builds only the selected dataset using its parent recipe.
Various recursive options, such as smart reconstruction, check upstream for changes in the Flow and rebuild only those datasets necessary to produce the requested output.
From the Flow, open the outlined scoring_data_prepared dataset.
Dataiku tells you that it is empty. Click Build.
Choose Recursive instead of the default “Non-recursive” option.
Leave “Smart reconstruction” as the Build mode, but choose Preview instead of “Build Dataset”, just for the purpose of demonstration.
Clicking Preview computes job dependencies and shows the Jobs page.
Actions in Dataiku, such as running a recipe or training a model, generate a job. You can view ongoing or recent jobs in the Jobs page from the top navigation bar or using the keyboard shortcut
Examine the two activities in this job preview:
One runs the copied Join recipe.
Another runs the copied Prepare recipe.
At the top right, click Run to initiate the job that will first build scoring_data and then scoring_data_prepared.
When the job has finished, view the scoring_data_prepared dataset.
Return to the Flow to see the previously outlined pipeline now fully built.
The system of dependencies between objects in the Flow, teamed with these various build strategies, makes it possible to automate the process of rebuilding datasets and retraining models using scenarios.
Apply a Score Recipe¶
You now have a dataset with the exact same set of features as the training data. The only difference is that, in the data ready to be scored, the values for the churn column, which is the target variable, are entirely missing, as we do not know which new customers will churn.
Let’s use the model to generate predictions of whether the new customers will churn.
From the Flow, select the scoring_data_prepared dataset.
From the sidebar, click on the schema button (third icon from the top). Observe how it is the same schema as the training_data_prepared dataset.
Open the scoring_data_prepared dataset. Observe how the churn column is empty as expected.
From the Flow, select the deployed prediction model, and add a Score recipe from the Actions sidebar.
Choose scoring_data_prepared as the input dataset.
Name the output scoring_data_scored.
Click Create Recipe.
Leave the default recipe settings, click Run, and then return to the Flow.
Inspect the Scored Data¶
Let’s take a closer look at the scored data.
Open the scoring_data_scored dataset, and observe the three new columns appended to the end:
proba_0 is the probability a customer remains.
proba_1 is the probability a customer churns.
prediction is the model’s prediction of whether the customer will churn.
Use the Analyze tool on the prediction column to see what percentage of new customers the model expects to churn.
How does the predicted churn rate on the new data compare to the churn rate observed in the training data?
Congratulations! In a short amount of time, you were able to:
connect to data sources;
prepare training data;
create visualizations and statistical analyses;
train a model and apply it to new data; and
This simple example was completed using only the native visual tools. However, each of these steps could have been extended or customized with your own code.
Your project does not have to stop here. Following this point, you might want to:
create a scenario to rebuild the datasets and retrain the model on a time-based or event-based trigger;
document your workflow and results on a wiki;
share the output datasets with other Dataiku projects; sync them to other connections; or use plugins to export them to tools like Tableau, Power BI, or Qlik.
This quick start tutorial is only the tip of the iceberg when it comes to the capabilities of Dataiku. To learn more, please visit the Academy, where you can find more courses, learning paths, and certifications to test your knowledge.