Tutorial | Flow views¶
Flow views (accessed at the bottom left menu) provide various options for displaying the Flow with different levels of detail. These views can help with organizing large Flows so that they are easier to navigate. They can also guide the optimization of a Flow.
Let’s get started!¶
In this hands-on lesson, you will learn to:
create and manage tags for better data governance;
create and manage Flow Zones to create a higher-level view of the Flow;
use Flow Zones to isolate experimental branches of the Flow; and
leverage available view options to highlight details about your Flow.
Advanced Designer Prerequisites
This lesson assumes that you have basic knowledge of working with Dataiku DSS datasets and recipes.
If not already on the Advanced Designer learning path, completing the Core Designer Certificate is recommended.
You’ll need access to an instance of Dataiku DSS (version 9.0 or above) with the following plugins installed:
Census USA (minimum version 0.3)
These plugins are available through the Dataiku Plugin store, and you can find the instructions for installing plugins in the reference documentation. To check whether the plugins are already installed on your instance, go to the Installed tab in the Plugin Store to see a list of all installed plugins.
We also recommend that you complete the Flow Views: Zones, Tags, & More lesson beforehand.
Plugin Installation for Dataiku Cloud Users
Users of Dataiku Cloud should note that plugin installation follows a different path compared to on-premises or local instances.
Navigate to the Plugins tab of your launchpad.
Click Add a Plugin.
Search for the plugin by name, in this case
US Census. (“Reverse geocoding” is already available by default, and so does not need to be installed).
These tutorials use only a Design node, and so click Install on Design.
After installation, it may take a few minutes before the plugin’s components appear, depending on the number of existing plugins and code environments on the instance.
Create the project¶
You can use a project from the previous Plugin Store hands-on tutorial. If you skipped it, create this project instead:
Click +New Project > DSS Tutorials > Advanced Designer > Flow Views & Automation (Tutorial).
You can also download the starter project from this website and import it as a zip file.
Change Dataset Connections (Optional)
Aside from the input datasets, all of the others are empty managed filesystem datasets.
You are welcome to leave the storage connection of these datasets in place, but you can also use another storage system depending on the infrastructure available to you.
To use another connection, such as a SQL database, follow these steps:
Select the empty datasets from the Flow. (On a Mac, hold Shift to select multiple datasets).
Click Change connection in the “Other actions” section of the Actions sidebar.
Use the dropdown menu to select the new connection.
For a dataset that is already built, changing to a new connection clears the dataset so that it would need to be rebuilt.
Another way to select datasets is from the Datasets page (G+D). There are also programmatic ways of doing operations like this that you’ll learn about in the Developer learning path.
The screenshots below demonstrate using a PostgreSQL database.
Whether starting from an existing or fresh project, ensure that the entire Flow is built.
See Build Details Here if Necessary
To build the entire Flow, click Flow Actions at the bottom right corner of the Flow.
Select Build all.
Build with the default “Build required dependencies” option for handling dependencies.
See the article on Dataset Building Strategies and the product documentation on Rebuilding Datasets to learn more about strategies for building datasets.
Flow zones help to organize large Flows so that they are easier to navigate at a higher level of abstraction. This can help to quickly onboard new team members to projects, as they will be able to grasp the overall purpose of the Flow before getting into the details.
We’ll begin by showing how to:
create Flow zones and move objects into zones to create a higher-level view of the Flow;
manage the contents and properties of existing Flow zones; and
use Flow zones to isolate experimental branches of the Flow.
Creating Flow zones¶
To create your first zone:
From the top right corner of the Flow, click + Zone.
Fraud detectionas the name of the zone.
This creates an empty zone named Fraud detection, and reveals the Default zone, which presently contains all of the Flow objects.
Moving objects to Flow zones¶
To move objects into the Fraud detection Flow zone:
In the default zone, select the branch of the Flow starting with the three Download recipes datasets and ending with the dataset transactions_unknown_scored.
Hold down ‘Shift’ or ‘Command’ while dragging a box to select several Flow items at once, and then click again on unwanted objects while holding shift.
In the right panel, or by right-clicking to open the context menu, select Move to a flow zone.
Confirm in the modal dialog that Fraud detection is selected as the destination zone, and then click Move.
Another tool for managing large flows is Flow folding. You can learn more in the product documentation.
Moving a dataset moves its parent recipe¶
When you move a dataset to a new Flow zone, its parent recipe comes with it.
Let’s create a new Flow zone directly from selected objects in the Flow. For example:
Click to enter the Default zone for a better view.
Select the six objects used primarily in the Plugin Store tutorial:
the managed folder and two datasets about income per census tract: income_per_tract_usa (folder and dataset) and income_per_tract_usa_copy.
the three datasets about merchant information: transactions_by_merchant_id, merchant_census_tracts, and merchant_census_tracts_joined.
From the right panel, select Move to a Flow Zone.
Within the modal dialog, click New Zone, and name it
Even though you did not explicitly select them, the dialog warns that moving these datasets will have the additional effect of moving several recipes into the new zone. These are the parent recipes of the selected datasets. A recipe and its outputs always live in the same zone, and so it can be helpful to think in terms of moving recipes as opposed to moving datasets.
The default zone¶
Finally, we can rename the Default zone to something more descriptive.
Right-click on the Default zone, and select Edit from the context menu.
Transactions analysisas the new name.
Although we have renamed the default zone to something else, Transactions analysis is still the “Default” zone. You can see this fact when viewing this object in the right panel. You can also note that it cannot be deleted unlike the other zones.
This can be helpful to remember in some situations. For example, when deleting a Flow zone, its objects return to the default zone. If you were to delete the Fraud detection zone, objects there would be transferred to the Transactions analysis zone.
Flow zone views¶
To see the zones in a project at any given time, there is a special Flow Zones view:
From the View menu in the lower left corner of the Flow, select Flow Zones.
Click Hide Zones.
This view shows the entire Flow as one, but with the Flow objects colored according to their assigned zone.
Since Flow zones are DSS objects, you can give them descriptions and tags, or hold discussions on them. You can access this functionality in the right panel, as you can with other Flow objects.
Isolating experimental work¶
Lastly, you can use Flow Zones to mark off “experimental” work within a Flow. For this goal, it is helpful to understand the difference between “moving” and “sharing” objects to a Flow zone.
When moving objects to a Flow zone, as done above, the parent recipes also move to the destination zone. Moving objects helps with the high-level organization of the project.
Another option is to share a dataset to a new Flow zone. A dataset shared to a Flow zone is available to be worked upon in a fresh space. However, it has not been “moved” from its original Flow zone.
When sharing a dataset (instead of moving), the parent recipe is not shared, and so you can create a new zone with any dataset of interest without disturbing the main Flow of a project.
Let’s see how this works.
Close the Fraud detection zone to return to the main view of the Flow, where all zones are visible.
Expand the Merchant analysis zone, and right-click on the last output dataset merchant_census_tracts_joined.
Right-click to open the context menu, and select Share to a flow zone.
In the modal dialog, click New Zone, and name it
Recognize how the dataset shared in the Experimental zone is light blue in color. If it had been moved, instead of shared, it would be the normal shade of blue and bring with it its parent recipe.
Try this out for yourself by deleting the Experimental zone and moving, instead of sharing, the dataset merchant_census_tracts_joined.
Other Flow views¶
We have seen Views such as tags and Flow zones, but Dataiku offers many other informative views, such as connections, recipe engines, and code environments.
Dataiku Cloud users will see different specific results than the images below, but the intention is the same.
Let’s have a look at the connections used in this project.
From the “View” menu in the lower left corner of the Flow, select Connections.
Since this Flow leverages only a single connection (in this case, a PostgreSQL database, but in yours possibly the managed filesystem), this view does not bring a lot of information. In production though, we can imagine a situation where different datasets are stored in different connections. In such a situation, the Connections view would provide an overview of where the datasets are stored.
Recipe engines view¶
Another informative view is the Recipe engines view. Since we changed the connection to an SQL database, we expect the recipes to leverage the SQL engine. Let’s check if that is the case.
First expand all Flow Zones by right-clicking the header of any zone, and selecting Expand all.
Then from the View menu, click Recipe engines, and select only the checkbox for the “Sql” engine.
We can see that most recipes that have SQL datasets as input and output leverage the SQL engine.
The reason this is not true for some Prepare and Window recipes varies:
Some Prepare recipes use a non-SQL compatible processor. (See Details on the in-database (SQL) engine to learn more).
One Prepare recipe looks to have issues with possible storage type-casting that could be investigated.
These Window recipes limit the window frame on a time interval, and so prevent use of the SQL engine unlike the other Window recipes.
On your own, try out some of the other available views to see what value they can bring to managing complicated Flows.
Great job! Now you have some hands-on experience working with Tags, Flow Zones, and some of the other available Views.
If you have not already done so, register for the Academy course on Flow Views & Actions to validate your knowledge of this material.