Concept: Connection Changes & Flow Item Reuse¶
In this lesson, you’ll learn to perform advanced Flow actions, such as:
changing dataset connections,
reusing Flow items, and
hiding parts of your Flow.
This content is also included in a free Dataiku Academy course on Flow Views & Actions, which is part of the Advanced Designer learning path. Register for the course there if you’d like to track and validate your progress alongside concept videos, summaries, hands-on tutorials, and quizzes.
Suppose your Flow consists of datasets stored in a filesystem, but you want to store them in another location, such as in a SQL or an HDFS database.
Dataiku DSS allows you to change the connection of multiple datasets at the same time – either from the Datasets page, or directly from the Flow. The available options for the new connection depend on the database connections that have been previously set up on the instance.
When changing connections, you also have the option to “Drop data” from the original storage location — which is useful for preventing unused datasets from taking up storage space.
Further, you have the option to “reuse connection settings if possible” — which allows you to reuse the file format settings that were previously set up in the Format/Preview page of the dataset.
In addition, Dataiku DSS warns you that “Changing dataset connections can break the computations or lead to different results”.
This situation can happen, for instance, if you try to store a dataset with an ‘array’ type column into a PostgreSQL database. Even though you succeed in saving the connection change, you will get an error message when you try to build the dataset, because SQL databases cannot store arrays.
When you change a dataset’s connection, you transfer its schema. After changing the connection, the dataset itself is empty, and needs to be rebuilt.
You can repeat the previous steps to change the connection of the datasets back to a file system, or to a different database, as needed.
Flow Item Reuse¶
Dataiku DSS also allows you to reuse existing recipes in your Flow.
Imagine a visual recipe that takes one input dataset. If you have another dataset to which you want to apply the same recipe, you can simply copy the existing recipe, to duplicate and apply it to the new dataset. By running the recipe, you can then build the output of the duplicate recipe.
You can also duplicate a code recipe. When you duplicate a code recipe, your input and output settings are automatically reflected in the Input/Output section of the recipe. But keep in mind that you must still manually change the input and output dataset names inside your code.
If you want to copy multiple datasets or recipes at once, use the Copy subflow option, from the Flow Actions menu, and select the objects you want to add.
Selecting a recipe also selects its corresponding input dataset. The selected objects are color-coded. For example, you can easily identify which ones will be duplicated (the green objects), or shared (the yellow objects).
If you want an object to be duplicated, not shared, select it and click Add.
Once you finish selecting objects, you can copy them to one of three destinations:
Choosing Existing project, allows you to select a different available project to which you can copy the objects.
Choosing Current project, copies the objects into the same project that you’re working in. For this, you’ll have to rename the objects to avoid duplicate names. You can further change the input datasets of recipes if desired.
Finally, choosing Create project creates a new project to be used as the destination of the copied objects.
When you copy a recipe to an existing project or a new project, the recipe’s input dataset will be shared between the source and the destination projects.
In the destination project, this dataset is colored black.
While in the source project, the dataset will have a curved arrow sign to indicate that it is shared with another project.
To stop sharing the dataset to the destination project, you can take action from the source project. This will delete the dataset from the Flow of the destination project.
Alternatively, you can stop sharing the dataset by taking action from the destination project. This won’t delete the dataset in the Flow of the destination project, but you will lose access to opening it.
Hide/Show Flow Items¶
Finally, Dataiku DSS provides the means to hide or show parts of your Flow, as needed. This is especially useful when working with large Flows.
By right-clicking a dataset or a recipe, you can access options that include Hide all upstream and Hide all downstream. This allows you to hide all the upstream or downstream objects connected to the selected objects.
To show the hidden objects again, simply click the plus (+) sign.
Congrats! Now you’ve seen how to change dataset connections, reuse Flow items, and hide or show parts of your Flow as needed.