Code Sample | Access info about datasets#

You generally want to avoid hard-coding connection information, table names, etc. in your recipe code. Dataiku can give you some connection / location information about the datasets that you are trying to read or write.

For all datasets, you can use the dataset.get_location_info() method. It returns a structure containing an info dict. The keys in the info dict depend on the specific kind of dataset. Print the dict to see more (NB: you can do that in a Jupyter notebook). Here are a few examples:

# myfs is a Filesystem dataset
dataset = dataiku.Dataset("myfs")
locinfo = dataset.get_location_info()
print locinfo["info"]

{
  "path" : "/data/input/myfs"
}

# sql is a PostgreSQL dataset
dataset = dataiku.Dataset("sql")
locinfo = dataset.get_location_info()
print locinfo["info"]

{
  "databaseType" : "PostgreSQL",
  "schema" : "public",
  "table" : "mytablename"
}

In addition, for filesystem-like datasets (Filesystem, HDFS, S3, etc.), you can use the get_files_info() method to get details about all files in a dataset (or partition).

dataset = dataiku.Dataset("non_partitioned_fs")
fi = dataset.get_files_info()

for filepath in fi["globalPaths"]:
  # Returns a path relative to the root path of the dataset.
  # The root path of the dataset is returned by get_location_info
  print filepath["path"]
  # Size in bytes of that file
  print filepath["size"]
dataset = dataiku.Dataset("partitioned_fs")
fi = dataset.get_files_info()

for (partition_id, partition_filepaths) in fi["pathsByPartition"].items():
  print partition_id

  for filepath in partition_filepaths:
    # Returns a path relative to the root path of the dataset.
    # The root path of the dataset is returned by get_location_info
    print filepath["path"]
    # Size in bytes of that file
    print filepath["size"]