Reference | Relocatable datasets#

What relocatable means#

When creating new datasets, Dataiku takes the settings of the chosen connection to determine where it will be created and under which table or name.

Dataiku Cloud makes sure managed datasets are relocatable by default, as it is good practice. In short, this means that if a user creates another dataset within the same connection, both datasets will not overlap, thus avoiding any potential conflicts.

Otherwise, conflicts could arise when:

  • Creating two datasets with the same name in different projects.

  • Duplicating a project in a Dataiku instance.

  • Publishing a project in the Automation node.

How Dataiku Cloud makes managed datasets relocatable#

To avoid overlapping, it is a good practice to use variables in the connection settings for creating new datasets. Even if a user makes two datasets with the same name in the same connection, the variable ensures they will be different.

By default, Dataiku Cloud adds the variables ${projectKey} and ${node} to ensure the datasets are relocatable. Your connection will make all datasets it contains relocatable by default unless you change those settings.

These variables are included in the following dataset fields:

  • For SQL databases: Table prefix

  • For filesystem databases: Path prefix

Limits and exceptions#

If your connection was created before the ${node} variable mechanism was implemented, its datasets will not be relocatable when transferring them between the Design and the Automation node. Therefore, when publishing a project in the Automation node without remapping the connections (i.e., all datasets use the same connection in both nodes), the two projects will write in the same datasets and cause conflicts.

In the same way, if you edit the table in the settings of a dataset, the connection or the dataset will no longer be relocatable.

How can I check if my connection makes managed datasets relocatable?#

If your connection was created after January 6, 2023, or is a Redshift, BigQuery, or a managed Snowflake connection, Dataiku Cloud makes datasets relocatable by default; you have nothing to do.

If you have any doubt about whether your connection makes datasets relocatable, as described above, you can ask Support to verify that it uses the above-mentioned variables (table prefix or path prefix) in the settings.

You can also go to a dataset setting tab created in the connection to verify that the dataset has been created with the variables ${projectKey} and ${node} in:

  • The path in the bucket for filesystem connections

  • The table in SQL connections

What if my connection does not support relocatable datasets between the Design and Automation nodes?#

If your connection was created before January 6, 2023, and is not a Redshift, BigQuery, or a managed Snowflake connection, then the datasets contained in that connection are not relocatable between the Design and Automation nodes.

We strongly recommend using different connections between the Design and the Automation nodes in this case. Having two separate schemas or buckets will ensure that the datasets are not created in the same place and will not overlap.

To do that:

  1. When creating or editing a connection in the Launchpad, the form allows you to specify if you want the connection to be used on the Design node, the Automation node, or both.

  2. Remap the connections when deploying a project to the Automation node in the Deployer > Deployments > Select the Deployment > Settings > Connections.

  3. You will then need to update the deployment and execute your Flow in the Automation node to create the datasets in the new connection.