Data Connections on Dataiku Cloud¶
Dataiku supports many ways to connect to data sources. Explore resources here for connecting to data sources specifically using Dataiku Cloud.
Reference | Supported data connections¶
Dataiku Cloud allows you to connect to multiple sources of data as read-only sources or read-and-write storage.
Note
A read-only data source will be used to inform Dataiku how it can access data stored externally. Dataiku remembers the location of the original source datasets. This is read-only; no data is stored or modified in the original system.You typically use these datasets as the entrypoint (leftmost part) to your Flow.
A read-and-write data storage will be used not only to allow Dataiku to read the data, but also to create new datasets (write) and, in SQL data storage, perform in-database computation, thus improving performance.
From Dataiku Cloud, you can connect to the following sources:
Type |
Read Only Data Sources |
Read / Write Data Storage |
---|---|---|
Snowflake |
X |
X |
Azure Synapse |
X |
X |
Google BigQuery |
X |
X |
Amazon Redshift |
X |
X |
PostgreSQL |
X |
X |
Oracle |
X |
X |
SQL Server |
X |
X |
MySQL |
X |
X |
Amazon S3 |
X |
X |
Azure Blob Storage |
X |
X |
Google Cloud Storage |
X |
X |
Databricks |
X |
X |
Athena |
X |
X |
MongoDB |
X |
With data connector plugins, you can also connect to the following: Salesforce, Zendesk, Google Sheets
Note
Depending on your subscription plan, not all connectors may be available.
How-to | Add a new data connection¶
From the Launchpad of your space, navigate to the Connections panel.
Click on the button Add a Connection.
Choose your connection type from the Read Only Data Sources or Read/Write Data Storage sections.
Fill the connection details, and then click on Test.
Once the test is OK, you can add the connection. You will get a confirmation message as well as a message letting you know the IP addresses you might need to whitelist to allow connection.

Reference | Protection of data sources¶
Restrict access to Dataiku Cloud IP addresses¶
Dataiku Cloud always connects to data sources with fixed IP addresses.
To protect access, you can configure an allow list in your data source firewall. Make sure to allow both IP addresses and add them to any database grant.
The IP addresses depend on your instance’s AWS region and are listed in the Launchpad connection forms.
Note
Do not hesitate to contact us if you need assistance.
Access data sources through a VPN server¶
VPN is a feature of the Dataiku Cloud Enterprise edition.
You can configure a VPN tunnel to access private data sources using the OpenVPN protocol. The private VPN server is under the user’s control; it will be connected to your data sources. Dataiku will use the VPN client to reach them.
Requirements:
Dataiku Cloud only supports OpenVPN servers,
The private subnet exposed by your OpenVPN server should not be in the following CIDR ranges: 10.0.0.0/16, 10.1.0.0/16, and 172.20.0.0/16
To configure it, you can go to Launchpad’s Extensions panel and add the VPN extension. You will have to fill in the following:
IP address of your OpenVPN server,
An OpenVPN configuration file for clients.
When this extension is activated, all outgoing traffic from Dataiku will go through the VPN. Ensure that all your private data sources use the VPN and that your VPN can redirect traffic to the internet so your instance can function properly.
Note
Private DNS is not currently available. Only data sources with private addresses or public DNS names can be reached.
To enable VPN tunneling, the Dataiku instance needs to be restarted. This operation could take up to 15min.
Access your Amazon S3 through AWS PrivateLink¶
AWS PrivateLink is a feature of the Dataiku Cloud Enterprise edition.
AWS PrivateLink provides private connectivity between your Dataiku instance and supported AWS services without exposing your traffic to the public internet.
Once activated, Dataiku Cloud will only connect to your S3 using one virtual private cloud (VPC) endpoint.
To configure it:
First, contact our support team so we can provide you with the endpoint to use. You will need to know the AWS region of your S3.
Add or edit an S3 connection in the Launchpad’s connection tab, activate the “Path mode”, and fill in the “Region / Endpoint” field with the value we gave you.
Ensure your S3 policy authorizes access to the endpoint.
Note
Athena’s Glue feature will not work with S3 connections using AWS PrivateLink.
An example of an S3 policy configured to only accept requests from a VPC endpoint:
{
"Version": "2012-10-17",
"Id": "Policy1415115909152",
"Statement": [
{ "Sid": "Access-to-specific-VPCE-only",
"Principal": "ARN-OF-IAM-USER-ASSUMED-BY-DATAIKU",
"Action": "s3:*",
"Effect": "Deny",
"Resource": ["S3-BUCKET-ARN",
"S3-BUCKET-ARN/*"],
"Condition": {"StringNotEquals": {"aws:sourceVpce": "VPCE-ID"}}
}
]
}
Reference | Relocatable datasets¶
What relocatable means¶
When creating new datasets, Dataiku takes the settings of the chosen connection to determine where it will be created and under which table or name.
Dataiku Cloud makes sure managed datasets are relocatable by default, as it is good practice. In short, this means that if a user creates another dataset within the same connection, both datasets will not overlap, thus avoiding any potential conflicts.
Otherwise, conflicts could arise when:
Creating two datasets with the same name in different projects,
Duplicating a project in a Dataiku instance,
Publishing a project in the Automation node.
How Dataiku Cloud makes managed datasets relocatable¶
To avoid overlapping, it is a good practice to use variables in the connection settings for creating new datasets. Even if a user makes two datasets with the same name in the same connection, the variable ensures they will be different.
By default, Dataiku Cloud adds the variables ${projectKey}
and ${node}
to ensure the datasets are relocatable. Your connection will make all datasets it contains relocatable by default unless you change those settings.
These variables are included in the following dataset fields:
For SQL databases: Table prefix
For filesystem databases: Path prefix
Limits and exceptions¶
If your connection was created before the ${node}
variable mechanism was implemented, its datasets will not be relocatable when transferring them between the Design and the Automation node. Therefore, when publishing a project in the Automation node without remapping the connections (ie, all datasets use the same connection in both nodes), the two projects will write in the same datasets and cause conflicts.
In the same way, if you edit the table in the settings of a dataset, the connection or the dataset will no longer be relocatable.
FAQ | How can I check if my connection makes managed datasets relocatable?¶
If your connection was created after January 6, 2023, or is a Redshift, BigQuery, or a managed Snowflake connection, Dataiku Cloud makes datasets relocatable by default; you have nothing to do.
If you have any doubt about whether your connection makes datasets relocatable, as described above, you can ask Support to verify that it uses the above-mentioned variables (table prefix or path prefix) in the settings.
You can also go to a dataset setting tab created in the connection to verify that the dataset has been created with the variables ${projectKey}
and ${node}
in:
the path in the bucket for filesystem connections,
the table in SQL connections.
FAQ | What if my connection does not support relocatable datasets between the Design and Automation nodes?¶
If your connection was created before January 6, 2023, and is not a Redshift, BigQuery, or a managed Snowflake connection, then the datasets contained in that connection are not relocatable between the Design and Automation node.
We strongly recommend using different connections between the Design and the Automation nodes in this case. Having two separate schemas or buckets will ensure that the datasets are not created in the same place and will not overlap.
To do that:
When creating or editing a connection in the Launchpad, the form allows you to specify if you want the connection to be used on the Design, the Automation node, or both.
Remap the connections when deploying a project to the Automation node in the Deployer > Deployments > Select the Deployment > Settings > Connections.
You will then need to update the deployment and execute your Flow in the Automation node to create the datasets in the new connection.