Build Your Security Model - Connections - Metastore

The metastore catalog is a concept that originated from the Hive project. The metastore stores an association between paths (initially on HDFS) and virtual tables.

In this article, we’ll show you how to configure Dataiku DSS (DSS) to interact with an internal or external metastore service. To learn more, visit Metastore catalog.

Configuring an Internal Metastore

When no external metastore service is available such as a Hive Metastore or an AWS Glue Data Catalog, you can use the DSS virtual metastore. A DSS virtual metastore is where DSS itself plays the role of a metastore. For more information about enabling this type of metastore service, visit DSS as virtual metastore.

To request the DSS virtual metastore service from the catalog, a DSS user performs the following steps:

  • From the Administration menu, select Catalog.

  • Choose the metastore service.

Configuring an External Metastore

DSS can interact with an AWS Glue Data Catalog when DSS is deployed on AWS. You’ll need an existing S3 connection from which you will retrieve credentials.

Step 1 - Enable the AWS Glue Data Catalog

The first step is to enable the AWS Glue Data Catalog. To do this:

  • From the Administration menu, navigate to the Settings tab.

  • Under Compute & Scaling select Metastore catalogs.

  • From the Metastore kind menu arrow select AWS Glue.

Metastore catalogs page in the Compute and Scaling section of the Settings tab within DSS settings.

Step 2 - Define an S3 Connection to the AWS Glue Data Catalog

The next step is to configure an S3 connection to use the AWS Glue Data Catalog.

Access to the AWS Glue Data Catalog will be created through the credentials defined in the S3 connection, which may be per-user credentials. For this reason, Dataiku recommends authentication through an S3 connection.

When submitting Spark jobs, DSS will automatically configure Spark to use AWS Glue with the appropriate credentials.

To configure an S3 connection to the AWS Glue Data Catalog:

  • Create an S3 connection or obtain the information of an existing S3 connection that you want to rely on.

  • Set the Metastore type to AWS Glue.

  • Set Glue Auth to Use AWS credentials from a connection.

  • Enter the S3 connection name and then select Save.

Metastore catalogs page in the Compute and Scaling section of the Settings tab within DSS settings.

Step 3 - Configure the Connection to Sync with the Metastore

In order to synchronize the datasets created through a specific connection to an external metastore, the connection must be configured.

To do this:

  • Select Keep datasets synced.

  • Select a Fallback metastore DB value to point these metadata to the proper zone.

Once the connection is configured, every new dataset metadata will be enriched inside the external metastore, allowing the import of it from the catalog menu (via the Connections explorer tab).

For example, the following connection named “athena” is pointing to an AWS Glue catalog:

AWS Glue catalog connection in the Connections explorer tab of the Catalog.

External datasets (datasets that are generated outside of DSS) can also be imported this way.