Code Sample | Access info about datasets#
You generally want to avoid hard-coding connection information, table names, etc. in your recipe code. Dataiku can give you some connection / location information about the datasets that you are trying to read or write.
For all datasets, you can use the dataset.get_location_info()
method. It returns a structure containing an info
dict. The keys in the info
dict depend on the specific kind of dataset. Print the dict to see more (NB: you can do that in a Jupyter notebook). Here are a few examples:
# myfs is a Filesystem dataset
dataset = dataiku.Dataset("myfs")
locinfo = dataset.get_location_info()
print locinfo["info"]
{
"path" : "/data/input/myfs"
}
# sql is a PostgreSQL dataset
dataset = dataiku.Dataset("sql")
locinfo = dataset.get_location_info()
print locinfo["info"]
{
"databaseType" : "PostgreSQL",
"schema" : "public",
"table" : "mytablename"
}
In addition, for filesystem-like datasets (Filesystem, HDFS, S3, etc.), you can use the get_files_info()
method to get details about all files in a dataset (or partition).
dataset = dataiku.Dataset("non_partitioned_fs")
fi = dataset.get_files_info()
for filepath in fi["globalPaths"]:
# Returns a path relative to the root path of the dataset.
# The root path of the dataset is returned by get_location_info
print filepath["path"]
# Size in bytes of that file
print filepath["size"]
dataset = dataiku.Dataset("partitioned_fs")
fi = dataset.get_files_info()
for (partition_id, partition_filepaths) in fi["pathsByPartition"].items():
print partition_id
for filepath in partition_filepaths:
# Returns a path relative to the root path of the dataset.
# The root path of the dataset is returned by get_location_info
print filepath["path"]
# Size in bytes of that file
print filepath["size"]