Concept | Data Catalog#

The Data Catalog is a central place for analysts, data scientists, and other collaborators to share and search for datasets across their organization.

From the Data Catalog homepage, which is accessible from the Applications menu on the top right, you can search for datasets and indexed tables using the search bar at the top. You can also use the different categories below to find datasets:

  • Data Collections

  • Datasets & Indexed Tables

  • Connection Explorer

  • Popular Datasets

You can also view the data lineage, which tracks data transformations throughout your pipeline.

Screenshot of the Data Catalog.

Data Collections#

In Data Collections, you can find curated groups of datasets, view information about those datasets, and reuse them in your own projects. Click on any dataset in a collection to view its details, status, and schema. From here, you can also explore, publish, export, watch, or mark the dataset as a favorite.

Users with relevant permissions can publish datasets to a Data Collection from the Flow or from within a collection. Other users can then browse collections to find relevant datasets, read the documentation, and add datasets to their projects straight from the collection.

Screenshot of the Data Collections.

Tip

Data Collections, Workspaces, and the Feature Store are all central repositories for teams to share items in Dataiku. However, each of them has a different specialty.

  • Data Collections is the recommended place for analytics teams to share and search for input datasets to use in projects.

  • Workspaces are designed primarily for analytics teams to share end products — such as dashboards, applications, and datasets — with external audiences, collaborate with business teams and gather feedback.

  • The Feature Store is the central registry for data scientists to aggregate and share feature groups that are highlighted for their higher quality.

Datasets & Indexed Tables#

In the Datasets & Indexed Tables tab of the Data Catalog, you can search for any dataset used in projects on your organization’s Dataiku instance.

With relevant permissions, you can also use a dataset in your own projects, or publish it to a data collection or the Feature Store. You also can view details such as which projects the dataset is used in, the data contact, and when the dataset was created, modified, or last built.

If your instance admin has indexed your external database connections, you can also toggle to search Indexed External Tables. This section allows you to search your organization’s indexed connections, preview tables and their schemas, and import them as Dataiku datasets.

Screenshot of the dataset search.

Note

For a more general look at different ways to search for items and information in Dataiku, see Concept | Searching in Dataiku.

Connection Explorer#

The Connection Explorer allows you to browse your organization’s remote connections, such as BigQuery, Hive, or SQL server connections. You can browse, filter, and preview the tables on a connection, then import selected tables into your Dataiku projects.

Screenshot of the connection explorer.

Note

Administrators of data collections and of Dataiku projects can configure permissions that impact dataset visibility in a data collection. For more detail see the reference documentation.

Data Lineage#

In the Data Lineage tab, you can view the lineage of a column and track its transformations through the data pipeline. Choose the project, dataset, and column you would like to track. See Concept | Data lineage for more information.

An example of the Dataset Lineage view in the Data Catalog.