Concept | Data Catalog#
The Data Catalog is a central place for analysts, data scientists, and other collaborators to share and search for datasets across their organization.
From the Data Catalog homepage, which is accessible from the Applications menu on the top right, you can search for datasets and indexed tables using the search bar at the top. You can also use the different categories below to find datasets:
Data Collections
Datasets & Indexed Tables
Connection Explorer
Popular Datasets
You can also view the data lineage, which tracks data transformations throughout your pipeline.
Data Collections#
In Data Collections, you can find curated groups of datasets, view information about those datasets, and reuse them in your own projects. Click on any dataset in a collection to view its details, status, and schema. From here, you can also explore, publish, export, watch, or mark the dataset as a favorite.
Users with relevant permissions can publish datasets to a Data Collection from the Flow or from within a collection. Other users can then browse collections to find relevant datasets, read the documentation, and add datasets to their projects straight from the collection.
Tip
Data Collections, Workspaces, and the Feature Store are all central repositories for teams to share items in Dataiku. However, each of them has a different specialty.
Data Collections is the recommended place for analytics teams to share and search for input datasets to use in projects.
Workspaces are designed primarily for analytics teams to share end products — such as dashboards, applications, and datasets — with external audiences, collaborate with business teams and gather feedback.
The Feature Store is the central registry for data scientists to aggregate and share feature groups that are highlighted for their higher quality.
Datasets & Indexed Tables#
In the Datasets & Indexed Tables tab of the Data Catalog, you can search for any dataset used in projects on your organization’s Dataiku instance.
With relevant permissions, you can also use a dataset in your own projects, or publish it to a data collection or the Feature Store. You also can view details such as which projects the dataset is used in, the data contact, and when the dataset was created, modified, or last built.
If your instance admin has indexed your external database connections, you can also toggle to search Indexed External Tables. This section allows you to search your organization’s indexed connections, preview tables and their schemas, and import them as Dataiku datasets.
Note
For a more general look at different ways to search for items and information in Dataiku, see Concept | Searching in Dataiku.
Connection Explorer#
The Connection Explorer allows you to browse your organization’s remote connections, such as BigQuery, Hive, or SQL server connections. You can browse, filter, and preview the tables on a connection, then import selected tables into your Dataiku projects.
Note
Administrators of data collections and of Dataiku projects can configure permissions that impact dataset visibility in a data collection. For more detail see the reference documentation.
Popular Datasets#
On the bottom of the Data Catalog homepage, you also can see a list of Popular Datasets, which Dataiku automatically populates with datasets that have a relatively high number of shares, a short time since the last rebuild, or trending popularity, among other metrics.
Like other datasets in the Data Catalog, you can publish popular datasets to a Data Collection, Workspace, or Feature Store; preview the dataset; or import it directly into a project.
Data Lineage#
In the Data Lineage tab, you can view the lineage of a column and track its transformations through the data pipeline. Choose the project, dataset, and column you would like to track. See Concept | Data lineage for more information.