How-to | Interact with AWS Glue#
AWS Glue Catalog is a Metastore service that provides references to data. References can include DSS managed datasets or external references filled by external components such as crawlers and EMR. DSS can interact with AWS Glue to either store its own datasets’ metadata or expose a data catalog. When the data catalog is exposed is it reachable through the DSS Catalog > Connections Explorer.
Configure AWS Glue as the Metastore#
An AWS Glue configuration is coupled with S3 connections because its main authentication mechanism relies on an already defined S3 connection.
To configure AWS Glue as the Metastore catalog:
Select the Administration menu and navigate to the Settings tab.
Select Metastore catalogs in the left panel.
DSS can leverage three kinds of metastores.
To configure AWS Glue as the Metastore, you must choose an authentication option:
Use AWS credentials from Environment (credentials file, environment variables, instance metadata).
Use AWS credentials from an already existing S3 connection.
Using AWS Credentials from an S3 Connection#
When using the credentials from an S3 connection, you must first ensure that the IAM role defined to be assumed is properly configured to interact with AWS Glue. This is because all the AWS Glue interactions will be performed through this role.
The following policy example allows the role to interact with AWS Glue:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateTable",
"glue:GetTables",
"glue:CreateDatabase",
"glue:DeleteTable",
"glue:GetDatabases",
"glue:GetTable",
"glue:GetCatalogImportStatus"
],
"Resource": [
"arn:aws:glue:*:<AWS ACCOUNT ID>:table/*/*",
"arn:aws:glue:*:<AWS ACCOUNT ID>:database/*",
"arn:aws:glue:*:<AWS ACCOUNT ID>:catalog"
]
}
]
}
Once the AWS Glue Metastore is configured, DSS users may browse the DSS Catalog to find AWS Glue databases. This depends on the policies you set for the IAM role that interacts with the Metastore configuration.
In addition, you can configure the S3 connection to persist its managed datasets’ metadata in a specific AWS Glue database.
Note
Every interaction with a dataset stored on an S3 bucket can be audited inside AWS Cloudtrail. For traceability, AWS Cloudtrail contains the information concerning the role assumed and the role session name.