Reference | Relocatable datasets#
What relocatable means#
When creating new datasets, Dataiku takes the settings of the chosen connection to determine where to create it and under which table or name.
Dataiku Cloud makes sure managed datasets are relocatable by default, as it’s good practice. In short, this means that if a user creates another dataset within the same connection, both datasets won’t overlap, thus avoiding any potential conflicts.
Otherwise, conflicts could arise when:
Creating two datasets with the same name in different projects.
Duplicating a project in a Dataiku instance.
Publishing a project in the Automation node.
How Dataiku Cloud makes managed datasets relocatable#
To avoid overlapping, it’s a good practice to use variables in the connection settings for creating new datasets. Even if a user makes two datasets with the same name in the same connection, the variable ensures they will be different.
By default, Dataiku Cloud adds the variables ${projectKey}
and ${node}
to ensure the datasets are relocatable. Your connection will make all datasets it contains relocatable by default unless you change those settings.
These variables are included in the following dataset fields:
For SQL databases: Table prefix
For filesystem databases: Path prefix
Limits and exceptions#
If your connection was created before the ${node}
variable mechanism was implemented, its datasets won’t be relocatable when transferring them between the Design and the Automation node. Therefore, when publishing a project in the Automation node without remapping the connections (that is, all datasets use the same connection in both nodes), the two projects will write in the same datasets and cause conflicts.
In the same way, if you edit the table in the settings of a dataset, the connection or the dataset will no longer be relocatable.
How to check if a connection makes managed datasets relocatable?#
If your connection was created after January 6, 2023, or is a Redshift, BigQuery, or a managed Snowflake connection, Dataiku Cloud makes datasets relocatable by default. You’ve nothing to do.
If you have any doubt about whether your connection makes datasets relocatable, as described above, you can ask Support to verify that it uses the above-mentioned variables (table prefix or path prefix) in the settings.
You can also go to a dataset setting tab created in the connection to verify that the dataset has been created with the variables ${projectKey}
and ${node}
in:
The path in the bucket for filesystem connections
The table in SQL connections
What if my connection doesn’t support relocatable datasets between the Design and Automation nodes?#
If your connection was created before January 6, 2023, and isn’t a Redshift, BigQuery, or a managed Snowflake connection, then the datasets contained in that connection aren’t relocatable between the Design and Automation nodes.
We strongly recommend using different connections between the Design and the Automation nodes in this case. Having two separate schemas or buckets will ensure that the datasets aren’t created in the same place and won’t overlap.
To do that:
When creating or editing a connection in the Launchpad, the form allows you to specify if you want the connection on the Design node, the Automation node, or both.
Remap the connections when deploying a project to the Automation node in the Deployer > Deployments > Select the Deployment > Settings > Connections.
You will then need to update the deployment and execute your Flow in the Automation node to create the datasets in the new connection.