Tip | Excel to Dataiku operation mapping#

New terminology can be a stumbling block when learning a new software tool. The tables below provide an overview mapping some key operations in Excel to a similar feature in Dataiku.

Use formulas#

Excel

Dataiku

Excel provides the ability to create complex formulas for computing new values based on arbitrary selections of cells.

Dataiku’s approach to tabular data is similar to Excel’s structured references to columns. Instead of referring to cells within a table, Dataiku formulas refer to column names, and apply the formula to each row in the table.

Filter rows#

Excel

Dataiku

Excel has two separate tools for filtering rows. The first allows you to specify a simple filter on one column. The second is an advanced filter tool that uses filter criteria specified in worksheet cells.

In Dataiku, you can create simple or advanced row filters with the Filter recipe. Alternatively, you can filter rows as a step within a script of the Prepare recipe.

Clean data#

Excel

Dataiku

Excel supports a number of data cleaning operations, including renaming or removing columns, assessing data quality, or fixing invalid values.

Dataiku’s Prepare recipe handles each of these operations and more as individuals steps. The Prepare recipe’s script provides a history of the cleaning and enrichment actions taken, and allows you to re-apply the data preparation when new data arrives.

Join rows#

Excel

Dataiku

In Excel, you can use the VLOOKUP function to enrich one dataset with columns from another, by matching values of a key column that’s common to both datasets.

In Dataiku, the Join recipe matches values of a key column that’s common to both datasets. Records with the same key values are paired up and share information. Depending on the type of join, the recipe may keep or drop records without a match. By default, the recipe includes all other columns in the output dataset, but you can choose a subset.

Aggregate and pivot rows#

Excel

Dataiku

In Excel, you can use pivot tables to compute aggregated column statistics.

In Dataiku, to produce a dataset equivalent to an Excel pivot table, use the Pivot recipe. To produce a visualization of a pivot table, see Tutorial | Pivot tables.

The Group recipe can also produce aggregated statistics, but no columns are pivoted. Think of it as creating an Excel pivot table without specifying any Column fields.

Work with dates#

Excel

Dataiku

Excel has functions for generating and manipulating dates, like TODAY() and DATEDIF(). Excel stores date values stored as integers, but displays them in a format of your choice.

In Dataiku, the value displayed is the value that’s stored. While text like 1/5/17 may be clear to the human eye (or not — is that January 5 or May 1?) Dataiku must parse these text dates, taking into account their format and time zone. The Prepare recipe’s processor library provides functions for manipulation of date columns. Learn more in Concept | Date handling in Dataiku.

Append rows#

Excel

Dataiku

In Excel, you can merge two datasets by copying the contents of one dataset and pasting it at the end of the other. You need to make sure the columns of both datasets are in the same order before pasting.

In Dataiku, you can use the Stack recipe to append one or more datasets to another. The recipe automatically lines up columns with matching names across the datasets.

Split a worksheet#

Excel

Dataiku

You can split datasets in Excel by selecting and copying portions of a worksheet to a new file.

In Dataiku, you can create rules for splitting a dataset into two (or more) output datasets with the Split recipe.

Remove duplicates#

Excel

Dataiku

In Excel, you can remove duplicate values, using all columns or a subset to determine uniqueness of a row. Excel removes duplicates with no way to recover them later.

Dataiku’s Distinct recipe identifies and removes duplicate rows within a dataset. Additionally, it can track which rows had duplicates, and how many, in the original dataset.

Sort rows#

Excel

Dataiku

Excel has a Sort tool for arranging the rows of a dataset according to values of one or more Sort columns.

In Dataiku, the Sort recipe offers the same functionality. Additionally, you can compute various measures of the ordering of each row, including the row number, rank, and dense rank.

Extract top or bottom rows#

Excel

Dataiku

In Excel, you can use the Sort tool and remove rows to extract the most extreme rows in a dataset.

In Dataiku, the Top N recipe handles this task more efficiently than a Sort + Filter. Additionally, you can retrieve the top/bottom rows by values of a grouping column, and compute various measures of the ordering of each row.

Sample rows#

Excel

Dataiku

In Excel, you can create a column of random values with the RAND() function, and then filter on values of that column to select a random subsample of rows.

In Dataiku, you can use a Sample/Filter recipe to draw a subsample of your data. This can be a random sample, or a weighted sample based on the different classes of a particular variable.

Next steps#

The best place to get started transitioning from Excel to Dataiku is Quick Start | Excel to Dataiku.