Concept | Stack recipe#

The Stack recipe appends the rows of two or more datasets into a single output dataset. This is the operation to reach for when you need to vertically align multiple datasets into one schema.

Stacking with other technologies#

Depending on your previous experience with data tools, you’ve likely already stacked datasets.

Technology	Tool/Function
Excel	Append Queries in Excel Power Query (or copy-paste)
SQL	`UNION ALL`
Python (pandas)	`pd.concat([dataset_a, dataset_b], ignore_index=True)`
R (tidyverse)	`bind_rows(dataset_a, dataset_b)`

You could perform the same SQL, Python, or R operation with a code recipe. However, in Dataiku, you’ll often want to use the visual Stack recipe for this kind of data transformation.

Stacking method overview#

The recipe includes six options for defining a stack condition:

The table below gives a first introduction to these methods and when you might need them.

Stack recipe methods#
Stack method	Main idea	Typical purpose
Union of input schemas	Keep all columns from all input datasets	Combine similar datasets without losing information
Intersection of input schemas	Keep only columns shared by all datasets	Standardize datasets to common fields
Use schema from one input dataset	Use one dataset as the “master schema”	Append supplemental data into a canonical structure
Map using column order	Match columns by position instead of name	Combine poorly labeled or headerless files
Manual column remapping	Explicitly map columns between datasets	Merge datasets with different naming conventions
Use a custom defined schema	Define exactly which output columns to keep	Create a curated standardized output

Example datasets#

To walk through these different stacking methods, review the following two datasets. Note that both datasets share two columns (name and age), while two other columns (height and sex) appear only in one dataset.

Dataset A#
name	age	height
Alice	22	5’8”
Bob	36	6’2”
Carol	19	5’10”

Dataset B#
name	age	sex
John	53	M
Alyssa	47	F

Union of input schemas#

The first method is stacking datasets based on the union of input schemas. In other words, any column in either dataset A or dataset B appears in the output. If there is any missing information, the recipe creates an empty value.

Note the empty values in the height and sex columns.

A and B stacked (Union of input schemas)#
name	age	height	sex
Alice	22	5’8”
Bob	36	6’2”
Carol	19	5’10”
John	53		M
Alyssa	47		F

Intersection of input schemas#

The second method is stacking datasets based on the intersection of input schemas. In other words, only columns in both dataset A and dataset B appear in the output.

In this example, only the columns name and age appear in both datasets.

The recipe drops columns height and sex since they aren’t found in both A and B.

A and B stacked (Intersection of input schemas)#
name	age
Alice	22
Bob	36
Carol	19
John	53
Alyssa	47

Use schema from dataset#

The third method is stacking datasets using the schema from one of the input datasets. If selecting dataset A as the reference schema, the recipe excludes columns not found in this schema (like dataset B’s sex column).

If choosing Dataset B as the reference schema, sex would appear in the output instead of height.

A and B stacked (Use schema from Dataset A)#
name	age	height
Alice	22	5’8”
Bob	36	6’2”
Carol	19	5’10”
John	53
Alyssa	47

Using column order#

The fourth method is stacking datasets based on column order. For this stacking method, the recipe matches columns based on their ordinal position, ignoring column names.

If taking dataset A’s schema as the reference order, the recipe writes the values of the dataset B’s third column (sex) under dataset A’s third column (height).

This option would be highly impractical for this data! Values from the sex column shouldn’t appear in a height column. Ordinal stacking wouldn’t be the right choice for this kind of data.

A and B stacked (Use column order from Dataset A)#
name	age	height
Alice	22	5’8”
Bob	36	6’2”
Carol	19	5’10”
John	53	M
Alyssa	47	F

Manually remapping#

The fifth method is manually selecting and remapping columns. This is the most flexible stacking method. It enables you to drop, rename, and reorder columns. You can also specify the source column from each input dataset to feed into the output column.

In this example, the recipe drops the age column, renames the height column, and reorders the sex column.

A and B stacked (Manual remapping)#
first_name	sex	height_ft_in
Alice	5’8”	22
Bob	6’2”	36
Carol	5’10”	19
John		53
Alyssa		F

Custom defined schema#

The sixth method is stacking datasets based on a custom schema. With this method, you can manually select which columns to include in the output.

In this example, the recipe selects only the name and height columns.

A and B stacked (Custom schema)#
name	height
Alice	5’8”
Bob	6’2”
Carol	5’10”
John
Alyssa

Next steps#

You’ve just learned how to stack your data! To perform the opposite operation, dividing a dataset into multiple outputs, check out the Split recipe.

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Visual Recipes. When ready, challenge yourself to earn a certification!