The pipeline definition, setup and helper scripts, implementation of lambda functions, etc. are contained in this repository. In order to set up and deploy an ETL pipeline, the Git repository needs to be cloned or the content downloaded and the setup scripts executed.
See Architecture for a detailed description of how the pipeline works.
- AWS account
- AWS CLI installed and configured
- AWS CDK (TypeScript) installed
- Docker for running and building local images
The ETL pipeline is implemented for the Amazon Web Services (AWS) public cloud.
The deployment of the ETL Pipeline as CloudFormation stack will use the AWS credentials provided during the cdk
tool invocation.
By default, this will use the locally configured credentials (account, AWS access key and secret, region) set up in the AWS CLI configuration.
Use aws configure
to set up access to AWS and the target region.
Please make sure that the user account used to set up the ETL pipeline has access to the AWS Systems Manager (SSM) parameter store, e.g. by granting the managed role AmazonSSMReadOnlyAccess
.
- When using multiple AWS accounts or users, setting the
AWS_PROFILE
environment variable in the executing shell to the target account/user will automatically use this account for all operations. Alternatively, the active profile can be selected usingcdk --profile=myprofile
command line switch when invoking a CDK command. - When running this within an EC2 instance configured with a IAM role, the permissions of the role associated with the EC2 instance will be used
Run the following comment in a terminal within the folder containing the ETL pipeline package:
aws s3 list
This should provide a list of S3 buckets in the account and region.
The AWS Cloud Development Kit (AWS CDK) is used to define all infrastructure components, synthesize AWS CloudFormation templates and apply them to create a CloudFormation stack in the AWS account of the user.
The AWS CDK including the cdk
command line tool and nodejs
and npm
tools need to be installed as described in the Working with the AWS CDK
guide.
These resources include an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments See the CDK Bootstrap guide for details.
The required resources are defined in an AWS CloudFormation stack, called the bootstrap stack, which is usually named CDKToolkit
. Like any AWS CloudFormation stack, it appears in the AWS CloudFormation console once it has been deployed.
In contrast to the regular pipeline deployment, the bootstrapping needs to be done only once per account and region. Also, it needs to be done with admin permissions in that account!
Note: bootstrapping will create IAM policies etc. with all relevant permissions, etc. to deploy the ETL pipeline and related infrastructure! Please make sure to review it (find the roles and policies in the AWS CloudFormation console in the CDKToolkit
stack).
Run this command to bootstrap CDK:
cdk bootstrap
The bootstrapping (described above) needs to be run as an user with admin permissions. In order to deploy the pipeline, no admin permissions are required.
However, for the user deploying the pipeline, at least permissions to read AWS Systems Manager (SSM) parameters are required. This can be achieved by granting the managed role AmazonSSMReadOnlyAccess
to the executing user.
Availability of this minimal set of permissions can be verified by running this command (which determines the version of the bootstrap data):
aws ssm get-parameters-by-path --path "/cdk-bootstrap/" --recursive
This should return something like this:
{
"Parameters": [
{
"Name": "/cdk-bootstrap/hnb659fds/version",
"Type": "String",
"Value": "19",
"Version": 2,
"LastModifiedDate": "2023-11-15T13:51:59.126000+01:00",
"ARN": "arn:aws:ssm:us-east-1:123456789012:parameter/cdk-bootstrap/hnb659fds/version",
"DataType": "text"
}
]
}
The source files to be converted to RDF can be provided either in the source-files
folder, in which case they will be automatically be uploaded to the respective S3 bucket.
Alternatively, they can also be uploaded the the source-files
bucket, which is created during pipeline creation.
Source files are accepted in the following formats:
- JSON:
.json
or.json.gz
- JSONL (a text file with one JSON document per file):
.jsonl
or.jsonl.gz
- XML:
.xml
or.xml.gz
- CSV (with header line for column names!):
.csv
or.csv.gz
For all variants, the files may also be compressed using GZIP (.gz
).
RDF Mappings are need to be provided in the mappings
folder and follow the RDF Mapping Language (RML) format.
The mapping files should contain a logical source with an empty rml::source
which during the actual RML mapping process will be replaced with one reading from stdin
:
<my-mapping>
rml:logicalSource [
rml:source [] ;
rml:referenceFormulation ql:JSONPath ;
rml:iterator "$.root[*]"
] ;
Mapping files should be provided as RDF files in Turtle format (.ttl
).
The deployment can be triggered using the cdk deploy
command. It uses the configuration provided as environment variables and in the .env
file. See the section Parameter Reference below for details on all available parameters!
cdk deploy
After synthesizing the CloudFormation templates, the CDK tool will ask once more for confirmation and display all created IAM roles and permissions.
The actual deployment will be triggered after confirming this be entering y
(yes).
The deployment can be watched in the CloudFormation console for the stack EtlPipeline
. All created resource are listed in the Resources
tab and all activities are listed in the Events
tab.
To delete the ETL pipeline, the CloudFormation stack can be torn down with the following command:
cdk destroy
Please note that some resources like the output
S3 bucket are intentionally not deleted to avoid accidental data loss. These need to be deleted manually.
The ETL Pipeline can be parameterized with configuration parameters set either as an environment variable or in a .env
file located in the folder from which the cdk deploy
command is executed.
The following configuration options are available:
ID of the pipeline. If provided, it will be appended to the pipeline id and allows to provision multiple independent instances of the pipeline. The id should only consist of letters, numbers and dashes, no whitespace or special characters.
Example:
PIPELINE_ID=my-data
This will result in the resources of the provisioned pipeline be prefixed with EtlPipeline-my-data
. When unsed, the resources will be prefixed with just EtlPipeline
.
The AWS account in which to deploy the pipeline in which to deploy the pipeline.
When unset, the current region of the profile will be used.
Example:
AWS_ACCOUNT=123456789012
AWS region in which to deploy the pipeline.
When unset, the current region of the profile or the default region of the account will be used.
Example:
AWS_REGION=us-east-1
The SSH public key to store on the EC2 instance for ingestion. This can be used to connect to the EC2 instance (the command is specified in the output after the ETL pipeline deployment) using SSH.
The key is sshPubKey
, the value is the full public key. The public key can be typically found in the ~/.ssh/
folder, e.g. in the file ~/.ssh/id_ed25519.pub
.
When unset, no remote access will be possible.
Example:
SSH_PUB_KEY=ssh-ed25519 AAAA...XXX [email protected]
The ETL pipeline creates an EC2 instance on which the RDF ingestion is performed. The instance needs to have enough resources to process all RDF files and create a GraphDB data journal using the GraphDB Preload tool. See the GraphDB Sizing guide for requirements when selecting an EC2 instance type.
The key is instanceType
, the value is the canonical name of the instance type, e.g. t3.xlarge
.
This parameter is optional, the default value when unset is t3.xlarge
.
Example:
INSTANCE_TYPE=t3.xlarge
When specified, email address will be registered for notifications from this pipeline. Notifications are sent as JSON documents to the provided email address. Further subscriptions may be added to the SNS topic manually.
Note: upon registration the address will receive a subscription confirmation email. Only after confirming the subscription by clicking on the link contained in the confirmation message will further notifications be forwarded to that address!