Preferred Connections and Format for Dataset Storage

Introduction

In this article, you will learn about the default connection and format Dataiku DSS (DSS) uses to store datasets.

A dataset is the core object users will manipulate in DSS. A dataset is a series of records with the same schema, analogous to a table in SQL. Datasets are stored in connections that relate to the underlying storage, such as SQL databases and object storages.

By default, DSS suggests a connection consistent with the input dataset when a user creates a new managed output dataset. Administrators can configure fallback connections and enforce a default connection. In addition, administrators can define the preferred file format to be used for new managed datasets.

Administrators can configure these settings globally and on a per-project basis.

Global Settings

Find the options to define a default connection for datasets in DSS under Administration > Settings > Engines & connections.

Here, you can:

  • set a preferred fallback connection for datasets,

  • force a connection (overriding a perhaps better contextual connection), and

  • define a preferred storage format.

Fallback and Forced Connections

In most cases, the best choice is to let DSS propose the connection for an output dataset. DSS will default to using the same connection as the input dataset of a recipe. This enables functionality such as push-down computation of SQL-based operations.

This behavior can be completed with a fallback connection whenever no obvious consistent connection is available for the input dataset (for example, in recipes that use uploaded datasets as input).

Finally, the admin can force a default connection to be used. This will overrule the connection that DSS would have chosen. It is recommended to work with preferred connections over strict enforcement of connections.

Cloud-based object storage such as Amazon S3 and network-attached storage devices are good choices for the default connection. It is not recommended to use the local filesystem of the server running DSS as the default storage location.

Default File Format

In the Engines & connections panel, you can configure a default file format by adding a comma-separated list under Preferred storage formats.

Enter the list in order from most preferred format to least preferred.

Default file format options:

  • CSV_ESCAPING_NOGZIP_FORHIVE

  • CSV_UNIX_GZIP

  • CSV_EXCEL_GZIP

  • CSV_EXCEL_GZIP_BIGQUERY

  • CSV_NOQUOTING_NOGZIP_FORPIG

  • PARQUET_HIVE

  • AVRO

  • ORC

  • JSON

  • STRING

The global engines and connections panel.

Project Settings

You can adjust the default configuration for preferred connections and file formats for every project in DSS. Doing so enables you to configure options that are more suitable in the context of the project.

Find and adjust the settings within the project under Settings > Engines & connections.

Uncheck the Use global settings box to override the global settings and set the preferred fallback connection, preferred connection and preferred storage formats for the project.

Per-project settings for engines and connections.

Recommendations

The selected formats should be closely aligned with the implemented use cases and the technology used. Therefore, Dataiku recommends defining this option jointly with the data science teams using the platform.

Avoid using the local filesystem as a preferred connection to prevent disk space issues that could influence the stability of the DSS platform. Cloud storages or network-attached storage devices are good choices as default connections to foster a separation between the data created by DSS (for example configuration and log files) and data generated by the end users (datasets).