Explore the Flow¶
In this section, you will explore a Dataiku DSS project and its Flow.
Explore the Project Homepage¶
Let’s start by navigating to the project homepage to learn about its purpose and contents.
Go back to the Dataiku DSS homepage and open the AI Consumer Quick Start (Tutorial) project by clicking on its tile.
You will land on the project homepage. It is a convenient high-level overview of the project’s status and recent activity.
You can do things like see the project title and tags, or check the overall status of the project. You can also read the description and the list of to-do items, view the project contributors, or check out recent user activity in the Timeline.
In the project homepage, you’ll also find useful collaboration features such as discussions and wikis.
From the project homepage, scroll down a bit to view the project description (which also includes a Go to the Course on Dataiku Academy button for quick access to the tutorial instructions) and the project to-do list.
From the displayed project items above, click Wiki to open the Project Read Me wiki article and read about the purpose and contents of the project.
Click AI Consumer Quick Start (Tutorial) in the top navigation bar to go back to the project homepage.
Explore Flow Items¶
From the project homepage, click Go to Flow. Alternatively, you can use the shortcut G
+ F
.
Note
The Flow is the visual representation of how datasets, recipes (steps for data transformation), and models work together to move data through an analytical pipeline.
The Flow of this project is divided into two Flow zones, “Data Preparation” and “ML Fraud Prediction”. We will dig deeper into the concept of Flow zones in the next section.
Notice that the Flow is composed of several types of elements, or Flow items:
Blue squares represent datasets . The icon on the square represents the type of dataset or its underlying storage connection, such as a file-based system, an SQL database, or cloud storage.
Yellow circles represent visual recipes (or data transformation steps that don’t require coding and can be built with the Dataiku DSS visual UI).
Green elements represent machine learning (ML) elements.
Explore Flow Zones¶
Now, let’s discover the concept of Flow zones, and find out more about the operations performed in each Flow zone. As mentioned above, the Flow of this project is divided into two Flow zones: “Data Preparation” and “ML Fraud Prediction”.
Note
Data science projects tend to quickly become complex, with a large number of recipes and datasets in the Flow. This can make the Flow complex to read and navigate. Large projects can be better managed by dividing them into Flow zones.
Zones can be defined in the Flow, and datasets and recipes can be moved into different zones. You can work within a single zone or the whole Flow, and collapse zones to create a simplified view of the Flow.
First, let’s explore the “Data Preparation” Flow zone.
Click the full screen icon in the upper right corner of the Data Preparation Flow zone window, or double-click anywhere in the white space inside the window, in order to open the Flow zone.
This Flow zone contains three input datasets, which are joined together using a Join recipe and then prepared using a Prepare recipe, resulting in the transactions_joined_prepared dataset.
Notice the dashed lines around the blue square icon of the transactions_joined_prepared dataset. They indicate that this dataset is shared into another Flow zone – in this case, the “ML Fraud Prediction” one.
Before getting started with exploring the data, let’s also briefly look at the “ML Fraud Prediction” Flow zone.
Click the ”X” button in the upper right corner of the screen to exit the Data Preparation Flow zone and go back to the initial view.
Click the full screen icon in the upper right corner of the ML Fraud Prediction Flow zone window, or double-click anywhere in the white space inside the window, in order to open the Flow zone.
By doing this, you have navigated to another zone of the Flow, which starts from the transactions_joined_prepared dataset and contains all the downstream Flow items. As its name suggests, this Flow zone relates to the machine learning tasks in the project.
Note
As an AI Consumer, you do not interact directly with the machine learning (ML) elements of the Flow and you do not need to understand their full extent, but you are nevertheless able to draw, consume and manipulate actionable ML insights (as seen in the dashboards and the Dataiku app of the previous two sections).
In this Flow zone, the transactions_joined_prepared dataset is split into two datasets using a Split recipe:
transactions_known contains the transactions that have been identified as either authorized or unauthorized; and
transactions_unknown contains the transactions for which we don’t know whether they were authorized or not.
The transactions_known dataset is then used to train a machine learning model to predict whether a transaction is authorized or not, and its training is then applied to transactions_unknown, producing the transactions_unknown_scored dataset. This end dataset contains a list of all the previously “unknown” transactions, flagged as either authorized or unauthorized (a.k.a. potentially fraudulent) by the ML model.
Click the ”X” button in the upper right corner of the screen to exit the ML Fraud Prediction Flow zone and go back to the initial view.
View the Flow Using Tags¶
To enhance the interpretability of the Flow, you can use different Flow Views.
For example, you can filter your views based on Tags.
Note
The Tags view lets you see which objects in your Flow are associated with previously-defined tags. Tags help you organize your work and understand the purpose of objects in your Flow.
In this view, objects with an associated tag are highlighted depending on the selected tags. This view can be particularly helpful for understanding large or complicated Flows or when multiple people are working on the same Flow.
To use the Tags view:
Click View: default in the lower left corner and select Tags from the dropdown menu.
Activate all five tags and observe their effects in the Flow:
the input-datasets tag highlights in blue the three input datasets;
the data-prep-for-analysis tag highlights in yellow the intermediary datasets and recipes used for the purpose of preparing the dataset for analysis;
the for-analysis tag highlights in pink the dataset that we will be using for analysis, transactions_joined_prepared.
the data-prep-for-ML tag highlights in light green the recipe and the datasets relevant to the preparation of the data that feeds the machine learning model;
the ML tag highlights in dark green the machine learning operations and the output dataset containing the predictions made by the model.
Click the “X” to close the tags menu and return to the default view.