Tutorial | SQL from a Python recipe in Dataiku¶
SQL is the most pervasive way to make data analysis queries. However, writing advanced logic like loops and conditions is often difficult in SQL. There are some options like stored queries, but they require learning new languages.
Dataiku lets you run SQL queries directly from a Python recipe. This lets you:
Sequence several SQL queries.
Dynamically generate new SQL queries to execute in the database.
Use SQL to obtain some aggregated data for further numerical processing in Python.
In this tutorial, you will:
Run a query in a SQL notebook.
Use Python to generate a more complex SQL query.
Execute a SQL query in-database from a Python recipe.
To get started, create the initial starter project with the data already uploaded.
From the Dataiku homepage, click +New Project > DSS tutorials > Developer > SQL in Python.
You can also download the starter project from this website and import it as a zip file.
The sfo_monthly_landing_data dataset records the number of cargo landings and total landed cargo weight at the SFO airport. From 2005 to 2015, the dataset contains one record for each month, airline, type and details of aircraft. The line includes the number of aircraft of this type of this company that landed this month, and the number of records.
What we would like to do is write a recipe to obtain, for each month and airline, a breakdown in several columns of the total landing weight by aircraft manufacturer. In essence, that’s a kind of crosstab / pivot table.
To make things a bit more complex, we are not interested in the small aircraft manufacturers. We only want the top 5 manufacturers, meaning those with the highest count of landings.
With these constraints, doing that in SQL only would be fairly complex. Let’s do it with a little bit of Python calling SQL.
Your Dataiku project has the source data in a filesystem (or uploaded) dataset. As this is an import from a CSV file, Dataiku has assigned only string types to the input columns. If you go to the Settings > Schema tab of the dataset, you’ll see the columns declared as string.
We could set the types manually in the dataset, but we could also let the Prepare recipe do it. Since anyway we need to copy our dataset to a SQL database, using a Prepare recipe instead of a Sync recipe will give us the typing inference for free.
Create a Prepare recipe from the sfo_monthly_landing_data dataset.
Choose to store the output in your SQL database connection.
Name the output dataset
Click Run since no preparation steps are needed.
In order to create the column per manufacturer, we’re going to use the
CASE expression. For example, let’s use an SQL notebook to calculate the total landing weights of Airbus planes.
Create a new SQL notebook on the sfo_prepared dataset.
In the Tables tab of the left hand panel, click the + next to the sfo_prepared dataset to copy a query that selects all columns and rows from this table.
Adjusting the table name if needed, copy and Run the query below that sums the weight of each row if it is Airbus.
SELECT SUM(CASE WHEN "Aircraft Manufacturer" = 'Airbus' THEN "Total Landed Weight" ELSE 0 END) AS airbus_weight FROM "DKU_TUTORIAL_SQLINPYTHON_sfo_prepared"
Be sure to recognize the correct reference to your table name. Run on a local PostgreSQL connection, the full table name is
"DKU_TUTORIAL_SQLINPYTHON_sfo_prepared". However, if using a managed Snowflake connection on Dataiku Cloud, for example, the complete table name would include instance information. You’ll need this table name for the remainder of the tutorial.
Your project key might also differ if projects of the same name already existed when creating the project.
Now let’s have Python send a SQL query that creates a pandas dataframe of the top five manufacturers.
From the sfo_prepared dataset, create a new Python notebook.
Replace the starter code with the code below.
Adjust the table name as needed depending on your instance, and then Run the cell.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Import the class that allows us to execute SQL on the Studio connections from dataiku import SQLExecutor2 # Get a handle on the input dataset sfo_prepared = dataiku.Dataset("sfo_prepared") # We create an executor. We pass to it the dataset instance. This way, the # executor knows which SQL database should be targeted executor = SQLExecutor2(dataset=sfo_prepared) # Get the 5 most frequent manufacturers by total landing count # (over the whole period) mf_manufacturers = executor.query_to_df( """ SELECT "Aircraft Manufacturer" as manufacturer, SUM("Landing Count") as count FROM "DKU_TUTORIAL_SQLINPYTHON_sfo_prepared" GROUP BY "Aircraft Manufacturer" ORDER BY count DESC LIMIT 5 """) # The "query_to_df" method returns a Pandas dataframe that # contains the manufacturers
Feel free to print out and inspect the
mf_manufacturers dataframe created from this code.
You can learn more in the Developer Guide about performing SQL queries with the Python API.
Now that we have a dataframe with the top manufacturers, let’s use a loop to generate a Python list of these pesky
CASE WHEN expressions.
In a new cell at the end of the same Python notebook, copy and Run the following:
cases =  for (row_index, manufacturer, count) in mf_manufacturers.itertuples(): cases.append( """ SUM(CASE WHEN "Aircraft Manufacturer" = '%s' THEN "Total Landed Weight" ELSE 0 END) AS "weight_%s" """ % (manufacturer, manufacturer))
We can now use this list of case expressions to create the final SQL query.
In a new cell at the end of the same Python notebook, copy and Run the following (again adjusting the table name as necessary):
final_query = """ SELECT "Activity Period", "Operating Airline", COUNT(*) AS airline_count, %s FROM "DKU_TUTORIAL_SQLINPYTHON_sfo_prepared" GROUP BY "Activity Period", "Operating Airline" """ % (",".join(cases)) print(final_query)
To finish, we need to execute the query with a Python recipe.
Click + Create Recipe > Python recipe > OK.
Add a new output dataset named
Click Create Dataset and Create Recipe.
Replace the default output at the end of the recipe with the code below, and then click Run.
result = executor.query_to_df(final_query) output_dataset = dataiku.Dataset("sfo_pivot") output_dataset.write_with_schema(result)
The output is exactly what we wanted.
Here is a vertical stacked bar chart with the sum of the weights of the top five manufacturers in the Y field and the operating airlines in the X field, sorted by descending airline counts.
Fairly unsurprisingly, Boeing is very dominant, and most airlines are mono-carrier. However, United Airlines has a non-negligible Airbus cargo fleet.
In this first version, we executed both queries using
query_to_df, meaning that the Python code actually received the whole data in memory, and sent it back to the database for storing the output dataset.
It would be better (and critical in the case of big datasets) that the final query be performed fully in-database.
Fortunately, Dataiku makes that easy. Executing SQL queries in-database and handling the work of dropping/creating table is what the SQL query does. The
SQLExecutor2 class lets you run a query as if it was a SQL Query recipe.
Return to the Python recipe.
Replace the final three lines of the code by the code block below, and then click Run.
output_dataset = dataiku.Dataset("sfo_pivot") SQLExecutor2.exec_recipe_fragment(output_dataset, final_query)
Everything works the same, but now the data has not been streamed to Python. Everything stayed in the database!