Hands-On Tutorial: Create a New Dataset With an SQL Query Recipe

Note

This lesson is part of a series of “Usage of SQL and Dataiku tutorials” that begins with the Hands-On Tutorial: Sync Recipe.

Tip

This content is also included in a free Dataiku Academy course Dataiku & SQL, which is an optional part of the Core Designer learning path. Register for the course there if you prefer to track and validate your progress alongside concept videos, text summaries, hands-on tutorials, and quizzes.

Now that we have datasets that correspond to tables in a database, we want all our processing to be done in-database. One way to do this is to write our own SQL code and execute it in a code recipe.

  • With the orders_copy dataset selected, select the SQL code recipe from the Actions menu.

  • Choose the SQL Query type, and click Set to define the output dataset.

  • Name it orders_by_customer; by default it will choose to store data into the same location.

  • Click Create Dataset and Create Recipe.

The recipe form is now populated with the following code, which selects all columns from the DKU_TUTORIAL_SQL_orders_copy table in the dku_shirt schema.

SELECT *
  FROM "dku_tshirt"."DKU_TUTORIAL_SQL_orders_copy"

Hint

The examples here identify tables with the schema prefix “dku_tshirt” because the datasets were configured to be created in this schema in the PostgreSQL database (see the Integration with SQL Databases course). If you did not follow the configuration steps from the Integration with SQL Databases course, then the schema prefix may not be required for your table names.

Our current goal with the past orders is to group them by customer, aggregating their past interactions. In the Basics 102 course, we accomplished this with a visual Group recipe, but it can also be easily accomplished with SQL code.

Edit the query to read the following:

SELECT customer_id,
       AVG(pages_visited) AS pages_visited_avg,
       SUM(tshirt_price*tshirt_quantity) AS total
    FROM "dku_tshirt"."DKU_TUTORIAL_SQL_orders_copy"
    GROUP BY customer_id;

The query result has the following columns:

  • customer_id

  • pages_visited_avg, representing the average number of pages on the Haiku T-shirt website visited by the customer during orders

  • total, representing the sum total of the value of orders made by the customer, where the value of each order is the price of each t-shirt multiplied by the number of t-shirts purchased

Click on Validate.

  • Dataiku parses your query, verifies that the syntax is correct, and fetches the output dataset column names and types.

Click Run.

  • Dataiku starts a job that creates the output table in the PostgreSQL database and fills it with the results of your query. Go to the Flow and see the newly added recipe and dataset.

Note

For more information about SQL code recipes, please see the product documentation.