How to Work with the DSS Metastore Catalog

In this article, we’ll describe the steps needed to perform the following tasks:

  • Synchronize a dataset to the metastore catalog,

  • Import a dataset from a Hive metastore, and

  • Query datasets from metastore aware engines and notebooks.

The Dataiku DSS (DSS) metastore catalog stores the association between paths and virtual tables.

A “table” in the metastore catalog is made of the following:

  • A location of the files making up the data,

  • A schema (column names and types),

  • A storage format indicating the file format of the data files, and

  • Other various metadata.

Hadoop Distributed File System (HDFS) and cloud storage datasets such as Amazon S3, Azure Blob Storage, and Google Cloud Storage (GCS) can have an associated table in the metastore catalog.

To find out about the three kinds of metastores that DSS can leverage, DSS integration points with the metastore, and engines and features that leverage the metastore, visit Metastore catalog.

Synchronize a Dataset to the Metastore Catalog

When a managed dataset is built, DSS automatically “pushes” the dataset’s definition as the corresponding table in the metastore. Metastore synchronization happens as part of the job run, after the dataset is built. You can also force the dataset to synchronize with the metastore. To force the dataset to synchronize with the metastore, visit the advanced settings of the dataset to define the metastore database and table (if not defined at the connection level), then select Synchronize.

Force synchronization of an HDFS dataset to the Dataiku metastore catalog.

Import a Dataset from a Hive Metastore (HMS)

You can import HDFS datasets directly from the Hive metastore (HMS) into a DSS project.

For example, let’s say we have a Hadoop integration and a Hive metastore (through HiveServer2). To import the dataset, do the following:

  • Select + DATASET > Search and import….

Search for an import a dataset from a Hive metastore (HMS).

  • Choose the Hive metastore.

Browsing connection to hive metastore.
  • Select the tables you want to import.

Selecting tables to import from hive catalog.
  • Select the HDFS “Target connection” that gives you access to the actual data paths.

  • Create your new datasets.

Select the HDFS target connection.

Similarly, if using Glue as your DSS metastore, you can import S3 datasets from the Glue metastore.

Browsing connection to glue metastore.

Query a Dataset from a Metastore Aware Engine

You can query datasets from metastore aware engines, including datasets that are synced to the metastore from Hive, Impala, SparkSQL and Athena.

Querying HDFS Managed Datasets

Once the dataset definitions are synchronized to the Hive metastore (through HiveServer2), you can query HDFS managed datasets from metastore aware engines.

In the following example, we’re using a Sort recipe to sort the rows of our dataset by averageprice. Since our dataset definitions are synchronized to the Hive metastore, we can successfully run the recipe with Hive or Impala engines.

Running a recipe on the Hive engine.

Querying S3 Datasets

Similarly, you can query S3 datasets from metastore aware engines.

In the following example, we’re using a Sort recipe to sort the rows of our dataset by Rate.

Since we configured DSS to use the Glue metastore and made sure the S3 connection has the “keep datasets synced” metastore property, Athena is able to find the definition of the input dataset to successfully query it.

The Athena engine uses the metastore catalog to get the dataset definition.

Running a recipe with athena engine.

Building a Chart Using a Metastore Aware Engine

You can also use metastore aware engines when building a chart. For example, we can chart the same S3 dataset using Athena by changing the execution engine from the default DSS engine to an in-database engine.

  • From your dataset, visit the Charts tab.

  • Select the in-database engine option.

Computing a chart using a metastore aware engine.

Query Datasets from Notebooks

Another advantage of having your datasets’ definitions synchronized to the metastore is that you can directly query the datasets from metastore aware engine notebooks including Hive, Impala, and SparkSQL notebooks.

Query an HDFS dataset from a SparkSQL notebook.

Note

To query an S3 dataset from a SparkSQL notebook, configure the Interactive SparkSQL Execution mode. You’ll find this by visiting Administration > Settings > Compute & Scaling > Spark.

Configuring interactive SparkSQL execution mode.