Data Connections on Dataiku Cloud

Dataiku supports many ways to connect to data sources. Explore resources here for connecting to data sources specifically using Dataiku Cloud.

Reference | Supported data connections

Dataiku Cloud allows you to connect to multiple sources of data as read-only sources or read-and-write storage.

Note

A read-only data source will be used to inform Dataiku how it can access data stored externally. Dataiku remembers the location of the original source datasets. This is read-only; no data is stored or modified in the original system.You typically use these datasets as the entrypoint (leftmost part) to your Flow.

A read-and-write data storage will be used not only to allow Dataiku to read the data, but also to create new datasets (write) and, in SQL data storage, perform in-database computation, thus improving performance.

From Dataiku Cloud, you can connect to the following:

Type

Read Only Data Sources

Read / Write Data Storage

Snowflake

X

X

Azure Synapse

X

X

Google BigQuery

X

X

Amazon Redshift

X

X

PostgreSQL

X

X

Oracle

X

X

SQL Server

X

X

MySQL

X

X

Amazon S3

X

X

Azure Blob Storage

X

X

Google Cloud Storage

X

X

MongoDB

X

With data connector plugins, you can also connect to the following: Salesforce, Zendesk, Google Sheets

Note

Depending on your subscription plan, not all connectors may be available.

How-to | Add a new data connection

  1. From the Launchpad of your space, navigate to the Connections panel.

  2. Click on the button Add a Connection.

  3. Choose your connection type from the Read Only Data Sources or Read/Write Data Storage sections.

    ../../_images/add-a-connection.png

  4. Fill the connection details, and then click on Test.

    ../../_images/snowflake-connection-4.jpg

  5. Once the test is OK, you can add the connection. You will get a confirmation message as well as a message letting you know the IP addresses you might need to whitelist to allow connection.

../../_images/snowflake-connection-5.jpg

Reference | Relocatable datasets

What relocatable means

When creating new datasets, Dataiku takes the settings of the chosen connection to determine where it will be created and under which table or name.

Dataiku Cloud makes sure managed datasets are relocatable by default, as it is good practice. In short, this means that if a user creates another dataset within the same connection, both datasets will not overlap, thus avoiding any potential conflicts.

Otherwise, conflicts could arise when:

  • Creating two datasets with the same name in different projects,

  • Duplicating a project in a Dataiku instance,

  • Publishing a project in the Automation node.

How Dataiku Cloud makes managed datasets relocatable

To avoid overlapping, it is a good practice to use variables in the connection settings for creating new datasets. Even if a user makes two datasets with the same name in the same connection, the variable ensures they will be different.

By default, Dataiku Cloud adds the variables ${projectKey} and ${node} to ensure the datasets are relocatable. Your connection will make all datasets it contains relocatable by default unless you change those settings.

These variables are included in the following dataset fields:

  • For SQL databases: Table prefix

  • For filesystem databases: Path prefix

Limits and exceptions

If your connection was created before the ${node} variable mechanism was implemented, its datasets will not be relocatable when transferring them between the Design and the Automation node. Therefore, when publishing a project in the Automation node without remapping the connections (ie, all datasets use the same connection in both nodes), the two projects will write in the same datasets and cause conflicts.

In the same way, if you edit the table in the settings of a dataset, the connection or the dataset will no longer be relocatable.

FAQ | How can I check if my connection makes managed datasets relocatable?

If your connection was created after January 6, 2023, or is a Redshift, BigQuery, or a managed Snowflake connection, Dataiku Cloud makes datasets relocatable by default; you have nothing to do.

If you have any doubt about whether your connection makes datasets relocatable, as described above, you can ask Support to verify that it uses the above-mentioned variables (table prefix or path prefix) in the settings.

You can also go to a dataset setting tab created in the connection to verify that the dataset has been created with the variables ${projectKey} and ${node} in:

  • the path in the bucket for filesystem connections,

  • the table in SQL connections.

FAQ | What if my connection does not support relocatable datasets between the Design and Automation nodes?

If your connection was created before January 6, 2023, and is not a Redshift, BigQuery, or a managed Snowflake connection, then the datasets contained in that connection are not relocatable between the Design and Automation node.

We strongly recommend using different connections between the Design and the Automation nodes in this case. Having two separate schemas or buckets will ensure that the datasets are not created in the same place and will not overlap.

To do that:

  1. When creating or editing a connection in the Launchpad, the form allows you to specify if you want the connection to be used on the Design, the Automation node, or both.

  2. Remap the connections when deploying a project to the Automation node in the Deployer > Deployments > Select the Deployment > Settings > Connections.

  3. You will then need to update the deployment and execute your Flow in the Automation node to create the datasets in the new connection.