Review the Flow#
See a screencast covering this section’s steps
One of the first concepts a user needs to understand about Dataiku is the Flow. The Flow is the visual representation of how datasets, recipes (steps for data transformation), and models work together to move data through an analytics pipeline.
See the Flow’s visual grammar#
Dataiku has its own visual grammar to organize AI, machine learning, and analytics projects in a collaborative way.
Shape |
Item |
Icon |
---|---|---|
Dataset |
The icon on the square represents the dataset’s storage location, such as Amazon S3, Snowflake, PostgreSQL, etc. |
|
Recipe |
The icon on the circle represents the type of data transformation, such as a broom for a Prepare recipe or coiled snakes for a Python recipe. |
|
Model |
The icon on the diamond represents the type of modeling task, such as prediction, clustering, time series forecasting, etc. |
Tip
In addition to shape, color has meaning too.
Datasets are blue. Those shared from other projects are black.
Visual recipes are yellow. Code recipes are orange. LLM recipes are pink. Plugin recipes are red.
Machine learning elements are green.
Take a look at the items in the Flow now!
If not already there, from the left-most menu in the top navigation bar, click on the Flow (or use the keyboard shortcut
g
+f
).
Tip
There are many other keyboard shortcuts beyond g
+ f
. Type ?
to pull up a menu or see the Accessibility page in the reference documentation.
Use the right panel to review an item’s details#
To collaborate on a project, you’ll need to quickly get up to speed on what someone else’s Flow accomplishes. Let’s try to figure out the purpose of this one.
Click once on the job_postings dataset to select it.
Click to open the Details icon to learn more about this item.
Click on the Schema tab underneath to see its columns.
Click on the test_scored dataset at the end of the pipeline, and review the same tabs. Note the addition of a prediction column.
Review the recipes that transform job_postings to test_scored beginning with the Prepare recipe at the start of the pipeline. Click once to select each one, and review the Details tabs to help determine what they do.
Note
The model in this project happens to be a simple AutoML model. However, you can think of it as a placeholder for any kind of model — not only those built in Dataiku, but also custom models imported into Dataiku.
You could read the project’s wiki (use the keyboard shortcut g
+ w
) for more information, but from just browsing the Flow, you probably already have a good idea of what this project does. The pipeline prepares some data and builds a prediction model in order to classify a job posting as real or fake.
The readability of the Flow eases the challenge of bringing users of diverse skill sets and responsibilities onto the same platform. For example:
The Flow has visual recipes (in yellow) that can be understood by all, but also custom code (in orange).
The Flow is divided into two interconnected Flow zones, which can be useful for teams focused on different stages of a project.
Build the Flow#
Unlike the initial uploaded datasets, the downstream datasets appear as outlines. This is because they have not been built, meaning that the relevant recipes have not been run to populate these datasets. However, this is not a problem because the Flow contains the recipes required to create these outputs at any time.
Click to open the Flow Actions menu in the bottom right.
Click Build all.
Leaving the default options, click Build to run the recipes necessary to create the items furthest downstream.
When the job completes, refresh the page to see the built Flow.