From Excel to Dataiku¶
If you already use Excel, you will find that you can do perform the same operations and analytics in Dataiku, while also making the most of its big data capabilities! This guide leads you through some of the core concepts and capabilities of working in a spreadsheet, and how they are accomplished in Dataiku.
Tip
Excel users new to Dataiku may also want to check out the quick start tutorial geared to users with an Excel background.
Reference | Data cleaning¶
Excel supports a number of data cleaning operations, including:
Renaming columns
Removing columns
Assessing data quality
Finding and replacing invalid values

Each of these operations, and more, are handled as individual steps in Dataiku’s Prepare recipe. The Prepare recipe’s script provides a history of the cleaning and enrichment actions taken, and allows you to quickly re-apply the data preparation when new data arrives.
See also
Prepare recipe lessons in the Visual Recipes 101 course
Data preparation in the reference documentation
Reference | Using formulas¶
Excel provides the ability to create complex formulas for computing new values based on arbitrary selections of cells.

Dataiku’s approach to tabular data is similar to Excel’s structured references to columns. Instead of referring to cells within a table, Dataiku formulas refer to column names, and the formula is applied to each row in the table. Formulas are provided as a step in Dataiku’s Prepare recipe, along with an expression editor to validate your formula. The expression editor shows a preview of results of your formula, examples of correct syntax, and a reference guide to functions you can use in formulas.
See also
Prepare recipe lessons in the Visual Recipes 101 course
Formula language in the reference documentation
Reference | Working with dates¶
Excel has functions for generating and manipulating dates, like TODAY() and DATEDIF(). Date values are stored as integers in Excel, but are displayed in a format of your choice.

In Dataiku, the value displayed is the value that is stored, and while text like “1/5/17” may be clear to the human eye (or not–is that January 5 or May 1?) Dataiku must parse these text dates, taking into account their format and time zone, in order to create a true date column. Functions for manipulation of true date columns are provided as individual steps in the Prepare recipe’s script.
See also
Prepare recipe lessons in the Visual Recipes 101 course
Managing Dates in the reference documentation
Reference | Removing duplicates¶
Excel can remove duplicate values, using all columns or a subset to determine uniqueness of a row. Duplicates are simply removed, with no way to recover them later.

Dataiku’s Distinct recipe identifies and removes duplicate rows within a dataset. Additionally, it can track which rows had duplicates, and how many, in the original dataset.
See also
Distinct recipe lessons in the Visual Recipes 101 course
Distinct recipe in the reference documentation
Reference | Filtering rows¶
Excel has two separate tools for filtering rows: one that allows you to specify a simple filter on one column, and an advanced filter tool that uses filter criteria specified in worksheet cells.

In Dataiku, you can create simple or advanced row filters with the Filter recipe, a visual tool that helps you to create filters that can be read and understood in natural language.
Alternatively, you can filter rows as a step within a script of the Prepare recipe.
See also
Filter recipe lessons in the Visual Recipes 101 course
Prepare recipe lessons in the Visual Recipes 101 course
Filtering and flagging rows in the reference documentation
Reference | Sampling rows¶
In Excel you can create a column of random values with the RAND() function and then filter on values of that column to select a random subsample of rows.

In Dataiku, you can use the Sample step in a Sample/Filter recipe to draw a subsample of your data. This can be a random sample, or a weighted sample based on the different classes of a particular variable.
See also
Sample recipe lessons in the Visual Recipes 101 course
Sample recipe in the reference documentation
Reference | Split a dataset¶
You can split datasets in Excel by selecting and copying portions of a worksheet to a new file.
In Dataiku, you can create simple or advanced rules for splitting a dataset into two (or more) output datasets with the Split recipe, a visual tool that helps you to create split rules that can be read and understood in natural language.
See also
Split recipe lessons in the Visual Recipes 101 course
Split recipe in the reference documentation
Reference | Append datasets¶
You can merge two datasets in Excel by copying the contents of one dataset and pasting it at the end of the other. You need to make sure the columns of both datasets are in the same order before pasting.
In Dataiku, you can use the Stack recipe to append one or more datasets to another. The recipe automatically lines up columns with matching names across the datasets. Columns without a match may be kept or dropped from the output dataset.
See also
Stack recipe lessons in the Visual Recipes 101 course
Stack recipe in the reference documentation
Reference | Joining datasets¶
In Excel you can use the VLOOKUP function to enrich one dataset with columns from another, by matching values of a key column that is common to both datasets. A separate VLOOKUP statement should be created for each column added from the other dataset.

In Dataiku, the Join recipe matches values of a key column that is common to both datasets. Records with the same key values are paired up and share information. Depending on the type of join, records without a match may be kept or dropped. By default, all other columns are included in the output dataset, but you can choose a subset.
See also
Join recipe lessons in the Visual Recipes 101 course
Join recipe in the reference documentation
Reference | Aggregate and pivot¶
In Excel you can use pivot tables to compute aggregated column statistics.

In Dataiku, to produce a dataset equivalent to an Excel pivot table, use the Pivot recipe.
The Group Recipe can also produce aggregated statistics, but no columns are pivoted. Think of it as creating an Excel pivot table without specifying any Column fields.
See also
Pivot recipe lessons in the Visual Recipes 101 course
Group recipe lessons in the Visual Recipes 101 course
Reference | Sorting values¶
Excel has a Sort tool for arranging the rows of a dataset according to values of one or more Sort columns.

In Dataiku, the Sort Recipe offers the same functionality. Additionally, you can compute various measures of the ordering of each row, including the row number, rank, and dense rank.
See also
Sort recipe lessons in the Visual Recipes 101 course
Sort recipe in the reference documentation
Reference | Top values¶
In Excel, you can use the Sort tool and remove rows in order to extract the most extreme rows in a dataset.
In Dataiku, the TopN recipe handles this task more efficiently than a Sort + Filter. Additionally, you can retrieve the top/bottom rows by values of a grouping column, and compute various measures of the ordering of each row.
See also
TopN recipe lessons in the Visual Recipes 101 course
TopN recipe in the reference documentation