How Data-Pipeline’s RDS to Redshift copy template works ? (and limitations)

Written by mannem on . Posted in Data Pipelines

The template contains 4 Activities.

1. RDSToS3CopyActivity – Creates a CSV file in S3 based on mySQL table.
2. RedshiftTableCreateActivity – Translates MySQL table to PSQL and creates a table(if it does not exist).
3. S3ToRedshiftCopyActivity – Runs a Redshift COPY command.
4. S3StagingCleanupActivity – Cleans up the staging S3 directory.

RedshiftTableCreateActivity is the key activity and where all the complexity lies.

It runs a shell script using ShellCommandActivity which Translates the MySQL table structure to Psql syntax , and creates a table on Redshift with translated table structure. the data-type translation between RDS and Redhsift is provided here

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html
Now, if you dig more into the ShellCommandActivity, The conversion script is downloaded as part of s3://datapipeline-us-east-1/sample-scripts/dbconv.sh , which in-turn downloads the translator python script

curl -O https://s3.amazonaws.com/datapipeline-us-east-1/sample-scripts/mysql_to_redshift.py

You can check out the contents of this script on how exactly it translates.

Ex:

mysql> create table TestTable (Id int not null , Lname varchar(20));

According to translation table , this activity , translates to a psql script , which runs on the ec2 instance.

Make note of limitations on this script while using it !

mysql_to_redshift.py :

Tags: , , , , , , ,

Trackback from your site.

Comments (1)

Leave a comment

  • cloudformation

    cloudformation

    pipeline

    Data-pipelines

    directoryservice

    directoryservicez

    cloudtrail

    cloudtrail

    config

    config

    trustedadvisor

    Trustedadvisor

  • snap

    Snapshot

    glacier

    Glacie

    storagegw

    Storage Gatewa

    s3

    S3

    cloudFront

    Cloud Front

  • r53

    Route 53

    lambda

    lambd

    directConnect

    DirectConnect

    vpc

    VPC

    kinesis

    Kinesis

    emr

    Emr

  • sns

    SNS

    transcoder

    Transcoder

    sqs

    SQS

    cloudsearch

    Cloud Search

    appstream

    App Stream

    ses

    SES

  • opsworks

    opsworks

    cloudwatch

    Cloud Watch

    beanstalk

    Elastic Beanstalk

    codedeploy

    Code Deploy

    IAM

    IAM

  • dynamodb

    dynamodb

    rds

    RDS

    elasticache

    ElastiCache

    redshift

    Redshift

    simpledb

    simpledb