From Excel to Dataiku¶
If you already use Excel, you will find that you can do perform the same operations and analytics in Dataiku, while also making the most of its big data capabilities! This guide leads you through some of the core concepts and capabilities of working in a spreadsheet, and how they are accomplished in Dataiku.
Excel users new to Dataiku may also want to check out the quick start tutorial geared to users with an Excel background.
Excel supports a number of data cleaning operations, including:
Assessing data quality
Finding and replacing invalid values
Each of these operations, and more, are handled as individual steps in Dataiku’s Prepare recipe. The Prepare recipe’s script provides a history of the cleaning and enrichment actions taken, and allows you to quickly re-apply the data preparation when new data arrives.
Excel provides the ability to create complex formulas for computing new values based on arbitrary selections of cells.
Dataiku’s approach to tabular data is similar to Excel’s structured references to columns. Instead of referring to cells within a table, Dataiku formulas refer to column names, and the formula is applied to each row in the table. Formulas are provided as a step in Dataiku’s Prepare recipe, along with an expression editor to validate your formula. The expression editor shows a preview of results of your formula, examples of correct syntax, and a reference guide to functions you can use in formulas.
Excel has functions for generating and manipulating dates, like TODAY() and DATEDIF(). Date values are stored as integers in Excel, but are displayed in a format of your choice.
In Dataiku, the value displayed is the value that is stored, and while text like “1/5/17” may be clear to the human eye (or not–is that January 5 or May 1?) Dataiku must parse these text dates, taking into account their format and time zone, in order to create a true date column. Functions for manipulation of true date columns are provided as individual steps in the Prepare recipe’s script.
Excel can remove duplicate values, using all columns or a subset to determine uniqueness of a row. Duplicates are simply removed, with no way to recover them later.
Dataiku’s Distinct recipe identifies and removes duplicate rows within a dataset. Additionally, it can track which rows had duplicates, and how many, in the original dataset.
Excel has two separate tools for filtering rows: one that allows you to specify a simple filter on one column, and an advanced filter tool that uses filter criteria specified in worksheet cells.
In Dataiku, you can create simple or advanced row filters with the Filter recipe, a visual tool that helps you to create filters that can be read and understood in natural language.
Alternatively, you can filter rows as a step within a script of the Prepare recipe.
In Excel you can create a column of random values with the RAND() function and then filter on values of that column to select a random subsample of rows.
In Dataiku, you can use the Sample step in a Sample/Filter recipe to draw a subsample of your data. This can be a random sample, or a weighted sample based on the different classes of a particular variable.
You can split datasets in Excel by selecting and copying portions of a worksheet to a new file.
In Dataiku, you can create simple or advanced rules for splitting a dataset into two (or more) output datasets with the Split recipe, a visual tool that helps you to create split rules that can be read and understood in natural language.
You can merge two datasets in Excel by copying the contents of one dataset and pasting it at the end of the other. You need to make sure the columns of both datasets are in the same order before pasting.
In Dataiku, you can use the Stack recipe to append one or more datasets to another. The recipe automatically lines up columns with matching names across the datasets. Columns without a match may be kept or dropped from the output dataset.
In Excel you can use the VLOOKUP function to enrich one dataset with columns from another, by matching values of a key column that is common to both datasets. A separate VLOOKUP statement should be created for each column added from the other dataset.
In Dataiku, the Join recipe matches values of a key column that is common to both datasets. Records with the same key values are paired up and share information. Depending on the type of join, records without a match may be kept or dropped. By default, all other columns are included in the output dataset, but you can choose a subset.
In Excel you can use pivot tables to compute aggregated column statistics.
In Dataiku, to produce a dataset equivalent to an Excel pivot table, use the Pivot recipe.
The Group Recipe can also produce aggregated statistics, but no columns are pivoted. Think of it as creating an Excel pivot table without specifying any Column fields.
Excel has a Sort tool for arranging the rows of a dataset according to values of one or more Sort columns.
In Dataiku, the Sort Recipe offers the same functionality. Additionally, you can compute various measures of the ordering of each row, including the row number, rank, and dense rank.
In Excel, you can use the Sort tool and remove rows in order to extract the most extreme rows in a dataset.
In Dataiku, the TopN recipe handles this task more efficiently than a Sort + Filter. Additionally, you can retrieve the top/bottom rows by values of a grouping column, and compute various measures of the ordering of each row.