DSS Metastore Catalog

Explore resources in this section to configure Dataiku to interact with an internal or external metastore services as well as working with assets from metastore-aware engines.

Note

To learn more, visit Metastore catalog in the reference documentation.

Reference | Dataiku metastore catalog

The metastore catalog is a concept that originated from the Hive project. The metastore stores an association between paths (initially on HDFS) and virtual tables.

The Dataiku DSS (DSS) metastore catalog stores the association between paths and virtual tables.

A “table” in the metastore catalog is made of the following:

  • A location of the files making up the data,

  • A schema (column names and types),

  • A storage format indicating the file format of the data files, and

  • Other various metadata.

Hadoop Distributed File System (HDFS) and cloud storage datasets such as Amazon S3, Azure Blob Storage, and Google Cloud Storage (GCS) can have an associated table in the metastore catalog.

Note

To find out about the three kinds of metastores that DSS can leverage, DSS integration points with the metastore, and engines and features that leverage the metastore, visit Metastore catalog.

How-to | Configure an internal metastore

When no external metastore service is available such as a Hive Metastore or an AWS Glue Data Catalog, you can use the DSS virtual metastore. A DSS virtual metastore is where DSS itself plays the role of a metastore. For more information about enabling this type of metastore service, visit DSS as virtual metastore.

To request the DSS virtual metastore service from the catalog, a DSS user performs the following steps:

  1. From the Administration menu, select Catalog.

  2. Choose the metastore service.

How-to | Configure an external metastore (AWS Glue Data Catalog)

DSS can interact with an AWS Glue Data Catalog when DSS is deployed on AWS. You’ll need an existing S3 connection from which you will retrieve credentials.

Step 1 - Enable the AWS Glue Data Catalog

The first step is to enable the AWS Glue Data Catalog.

  1. From the Administration menu, navigate to the Settings tab.

  2. Under Compute & Scaling select Metastore catalogs.

  3. From the Metastore kind menu arrow select AWS Glue.

Metastore catalogs page in the Compute and Scaling section of the Settings tab within DSS settings.

Step 2 - Define an S3 Connection to the AWS Glue Data Catalog

The next step is to configure an S3 connection to use the AWS Glue Data Catalog.

Access to the AWS Glue Data Catalog will be created through the credentials defined in the S3 connection, which may be per-user credentials. For this reason, Dataiku recommends authentication through an S3 connection.

When submitting Spark jobs, DSS will automatically configure Spark to use AWS Glue with the appropriate credentials.

To configure an S3 connection to the AWS Glue Data Catalog:

  1. Create an S3 connection or obtain the information of an existing S3 connection that you want to rely on.

  2. Set the Metastore kind option to AWS Glue.

  3. Set Glue Auth to Use AWS credentials from a connection.

  4. Enter the S3 connection name and then select Save.

Metastore catalogs page in the Compute and Scaling section of the Settings tab within DSS settings.

Step 3 - Configure the Connection to Sync with the Metastore

In order to synchronize the datasets created through a specific connection to an external metastore, the connection must be configured.

  1. Select Keep datasets synced.

  2. Select a Fallback metastore DB value to point these metadata to the proper zone.

Once the connection is configured, every new dataset metadata will be enriched inside the external metastore, allowing the import of it from the catalog menu (via the Connections explorer tab).

For example, the following connection named “athena” is pointing to an AWS Glue catalog:

AWS Glue catalog connection in the Connections explorer tab of the Catalog.

External datasets (datasets that are generated outside of DSS) can also be imported this way.

How-to | Synchronize a dataset to the metastore catalog

When a managed dataset is built, DSS automatically “pushes” the dataset’s definition as the corresponding table in the metastore. Metastore synchronization happens as part of the job run, after the dataset is built. You can also force the dataset to synchronize with the metastore.

To force the dataset to synchronize with the metastore, visit the advanced settings of the dataset to define the metastore database and table (if not defined at the connection level), then select Synchronize.

Force synchronization of an HDFS dataset to the Dataiku metastore catalog.

How-to | Import a dataset from the Hive metastore (HMS)

You can import HDFS datasets directly from the Hive metastore (HMS) into a DSS project.

For example, let’s say we have a Hadoop integration and a Hive metastore (through HiveServer2). To import the dataset, do the following:

  1. Select + DATASET > Search and import….

    Search for an import a dataset from a Hive metastore (HMS).

  2. Choose the Hive metastore.

  3. Select the tables you want to import.

    Selecting tables to import from hive catalog.

  4. Select the HDFS target connection that gives you access to the actual data paths.

  5. Create your new datasets.

    Select the HDFS target connection.

Similarly, if using Glue as your DSS metastore, you can import S3 datasets from the Glue metastore.

How-to | Interact with AWS Glue

AWS Glue Catalog is a Metastore service that provides references to data. References can include DSS managed datasets or external references filled by external components such as crawlers and EMR. DSS can interact with AWS Glue to either store its own datasets’ metadata or expose a data catalog. When the data catalog is exposed is it reachable through the DSS Catalog > Connections Explorer.

Configure AWS Glue as the Metastore

An AWS Glue configuration is coupled with S3 connections because its main authentication mechanism relies on an already defined S3 connection.

To configure AWS Glue as the Metastore catalog:

  1. Select the Administration menu and navigate to the Settings tab.

  2. Select Metastore catalogs in the left panel.

DSS can leverage three kinds of metastores.

Usage Parameters section of the Connections tab within DSS settings.

To configure AWS Glue as the Metastore, you must choose an authentication option:

  • Use AWS credentials from Environment (credentials file, environment variables, instance metadata).

  • Use AWS credentials from an already existing S3 connection.

Using AWS Credentials from an S3 Connection

When using the credentials from an S3 connection, you must first ensure that the IAM role defined to be assumed is properly configured to interact with AWS Glue. This is because all the AWS Glue interactions will be performed through this role.

The following policy example allows the role to interact with AWS Glue:

{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": [
              "glue:GetDatabase",
              "glue:CreateTable",
              "glue:GetTables",
              "glue:CreateDatabase",
              "glue:DeleteTable",
              "glue:GetDatabases",
              "glue:GetTable",
              "glue:GetCatalogImportStatus"
          ],
          "Resource": [
              "arn:aws:glue:*:<AWS ACCOUNT ID>:table/*/*",
              "arn:aws:glue:*:<AWS ACCOUNT ID>:database/*",
              "arn:aws:glue:*:<AWS ACCOUNT ID>:catalog"
          ]
      }
  ]
}

Once the AWS Glue Metastore is configured, DSS users may browse the DSS Catalog to find AWS Glue databases. This depends on the policies you set for the IAM role that interacts with the Metastore configuration.

In addition, you can configure the S3 connection to persist its managed datasets’ metadata in a specific AWS Glue database.

Note

Every interaction with a dataset stored on an S3 bucket can be audited inside AWS Cloudtrail. For traceability, AWS Cloudtrail contains the information concerning the role assumed and the role session name.

Reference | Querying datasets from metastore-aware engines

You can query datasets from metastore-aware engines, including datasets that are synced to the metastore from Hive, Impala, SparkSQL and Athena.

Querying HDFS Managed Datasets

Once the dataset definitions are synchronized to the Hive metastore (through HiveServer2), you can query HDFS managed datasets from metastore aware engines.

In the following example, we’re using a Sort recipe to sort the rows of our dataset by averageprice. Since our dataset definitions are synchronized to the Hive metastore, we can successfully run the recipe with Hive or Impala engines.

Running a recipe on the Hive engine.

Querying S3 Datasets

Similarly, you can query S3 datasets from metastore aware engines.

In the following example, we’re using a Sort recipe to sort the rows of our dataset by Rate.

Since we configured DSS to use the Glue metastore and made sure the S3 connection has the “keep datasets synced” metastore property, Athena is able to find the definition of the input dataset to successfully query it.

The Athena engine uses the metastore catalog to get the dataset definition.

Running a recipe with athena engine.

How-to | Build a chart using a metastore-aware engine

You can also use metastore aware engines when building a chart. For example, we can chart the same S3 dataset using Athena by changing the execution engine from the default DSS engine to an in-database engine.

  1. From your dataset, visit the Charts tab.

  2. Select the in-database engine option.

Computing a chart using a metastore aware engine.

How-to | Query datasets from a metastore-aware notebook

Another advantage of having your datasets’ definitions synchronized to the metastore is that you can directly query the datasets from metastore-aware engine notebooks including Hive, Impala, and SparkSQL notebooks.

Query an HDFS dataset from a SparkSQL notebook.

Note

To query an S3 dataset from a SparkSQL notebook, configure the Interactive SparkSQL Execution mode. You’ll find this by visiting Administration > Settings > Compute & Scaling > Spark.

Configuring interactive SparkSQL execution mode.