Dataiku enables users to split, or partition, datasets along meaningful dimensions. These partitions, or subsets of the original dataset, can then be computed independently.
Learn more about this technique in the following concept articles and tutorials.
To validate your knowledge of this area, register for the Partitioning course, an optional part of the Advanced Designer learning path, on the Dataiku Academy.
Concepts & tutorials¶
- Concept | Partitioning
- Concept | How partitioning adds value
- Concept | Partitioned datasets
- Concept | Jobs with partitioned datasets
- Tutorial | File-based partitioning
- Tutorial | Column-based partitioning
- Concept | Partitioning by pattern
- Concept | Partitioning in a scenario
- Tutorial | Partitioning in a scenario
- Concept | Partition redispatch and collection
- Tutorial | Repartition a non-partitioned dataset
Tip | Interacting with partitioned datasets using the Python API¶
If your recipe deals with partitioned datasets, in input or output, you need to be careful about reading and/or writing the correct data.
Reading and writing¶
If your recipe deals with partitioned datasets, in input or output, you don’t need to specify the source or destination partitions in your code. Reading and writing is done through Dataiku DSS.
To read from or write to the input partitions (as defined by the partition dependencies), use “get_dataframe()”. This will automatically give you the relevant partitions.
For purposes other than reading or writing dataframes, you can access the partition name (as well as any other variables) you want to build using the Python dictionary called “dku_flow_variables”. This dictionary can be accessed using
dataiku.dku_flow_variables, as described in the product documentation.
dataset.get_write_partition() is deprecated.