Tutorial | Dataset plugin component example#
Custom datasets work in a similar way to custom recipes; however, custom datasets can only be written in Python.
You write Python code that reads rows from the data source, or writes rows to it
You write a JSON descriptor that declares the configuration parameters
The user is shown with a visual interface in which they can enter the dataset’s configuration parameters.
The dataset then behaves like all other Dataiku datasets. For example, you can then run a preparation recipe on this custom dataset. Custom datasets can be used, for example, to connect to external data sources like REST APIs.
For our custom dataset, we’re going to read the Dataiku RaaS (Randomness as a Service) REST API. This API returns random numbers, so we want to use it to create a new type of dataset.
To use the API, we have to perform a GET query on http://raas.dataiku.com/api.php>
. For example, visit: http://raas.dataiku.com/api.php?nb=5&max=200&apiKey=secret
. This returns 5 random numbers between 0 and 200.
Create the custom dataset#
Custom datasets are a bit more challenging to write than custom recipes since we can’t start from a regular dataset, but must build the custom dataset from scratch.
Go to the plugin developer page.
Create a new dev plugin (or reuse the previous one).
In the dev plugin page, click on +New Component.
Choose Dataset.
Select Python as the language.
Give the new dataset type an id, like raas and click Add.
Use the editor to modify files.
We’ll start with the connector.json file. Our custom dataset needs the user to input three parameters:
Number of random numbers
Range
API Key
So let’s create our params array:
"params": [
{
"name": "apiKey",
"label": "RAAS API Key",
"type": "STRING",
"description" : "You can enter more help here"
},
{
"name": "nb",
"label": "Number of random numbers",
"type": "INT",
"defaultValue" : 10 /* You can have the data prefilled */
},
{
"name": "max",
"label": "Max value",
"type": "INT"
}
]
For the Python part, we need to write a Python class.
In the constructor, we’ll retrieve the parameters:
# perform some more initialization
self.key = self.config["apiKey"]
self.nb = int(self.config["nb"])
self.max = int(self.config["max"])
We know in advance the schema of our dataset: it will only have one column named “random” containing integers. So, in get_read_schema, let’s return this schema
def get_read_schema(self):
return {
"columns" : [
{ "name" : "random", "type" : "int" }
]
}
Finally, the core of the connector is the generate_rows
method. This method is a generator over dictionaries. Each yield in the generator becomes a row in the dataset.
If you don’t know about generators in Python, you can have a look at this page from the Python wiki dedicated to generators.
We’ll be using the requests
library to perform the API calls.
The final code of our dataset is:
from dataiku.connector import Connector
import requests
class MyConnector(Connector):
def __init__(self, config):
Connector.__init__(self, config) # pass the parameters to the base class
self.key = self.config["apiKey"]
self.nb = int(self.config["nb"])
self.max = int(self.config["max"])
def get_read_schema(self):
return {
"columns" : [
{ "name" : "random", "type" : "int" }
]
}
def generate_rows(self, dataset_schema=None, dataset_partitioning=None,
partition_id=None, records_limit = -1):
req = requests.get("http://raas.dataiku.com/api.php", params = {
"apiKey": self.key,
"nb":self.nb,
"max":self.max
})
array = req.json()
for random_number in array:
yield { "random" : random_number}
(All other methods are not required at this point, so we removed them).
Use the plugin#
In the new dataset menu, you can now see your new dataset (try reloading your browser if this is not the case). You are presented with a UI to set the 3 required parameters.
Set secret as API Key.
Set anything as nb and max.
Click Test.
Your random numbers appear!
You can now hit Create, and you have created a new type of dataset. You can now use it like any other Dataiku dataset.
About caching#
There is no specific caching mechanism in custom datasets. Custom datasets are often used to access external APIs, and you may not want to perform another call on the API each time Dataiku needs to read the input dataset.
It is therefore highly recommended that the first thing you do with a custom dataset is to either use a Prepare or Sync recipe to make a cached version on a first-party data store.