SqlActivity & RedshiftCopyActivity fails ? Use shellCommandActivity instead
There are several limitations of SQLActivity and RedshiftCopyActivity. If the psql/sql commands are too complex, these activities fail to prepare the statements correctly and will throw out some errors which cannot be easily rectified. So, you always have the option to use shellCommandActivity
to run your complex script.
PART 1:
The following shell script takes arguments referenced in ScriptArguments
object of the ShellCommandActivity. Its runs COPY
command to copy files from S3 to PostgreRDS. Another example shows a copy from S3 to Redshift.
Run a PSQL command to copy from S3 to PostgreRDS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
#!/usr/bin/env bash s3_prefix=$1 db_host=$2 db_name=$3 db_username=$4 db_password=$5 db_tablename=$6 db_port=$7 dir=temp export PGPASSWORD=$5 # install postgres in AmazonLinux sudo yum install -y postgresql94 # Copy from S3 to PostrgreSQL RDS. mkdir -p $dir rm -rf $dir/* aws s3 cp $s3_prefix $dir/ --recursive IFS=$'\n';for f in $(find $dir -name '*'); do if [[ -f "$f" ]]; then echo "Processing $f file..." psql -h $2 -d $3 -U $4 -p $7 -c "\COPY $db_tablename FROM '$f' CSV" fi done |
Run a PSQL command to copy from S3 to Redshift
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
#!/usr/bin/env bash manifest=$1 db_host=$2 db_name=$3 db_username=$4 db_password=$5 db_tablename=$6 db_port=$7 dir=temp export PGPASSWORD=$5 # install postgres in AmazonLinux sudo yum install -y postgresql94 # Copy from S3 to Redshift. # Using manifest file provided in ShellCommandAcvtivity's scriptArgument 1 of of the Data-Pipeline # http://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html psql -h $2 -d $3 -U $4 -p $7 -c "\COPY $db_tablename FROM '$1' manifest CSV" |
A sample Pipeline template to copy from s3 to RDSPostgres
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
{ "objects": [ { "occurrences": "1", "period": "1 Day", "name": "RunOnce", "id": "DefaultSchedule", "type": "Schedule", "startAt": "FIRST_ACTIVATION_DATE_TIME" }, { "resourceRole": "DataPipelineDefaultResourceRole", "role": "DataPipelineDefaultRole", "instanceType": "t1.micro", "name": "Ec2Instance", "id": "Ec2Instance", "type": "Ec2Resource", "myComment": "This object is used to create an Amazon EC2 Instance that activities in the pipeline can run on.", "terminateAfter": "1 Hour" }, { "schedule": { "ref": "DefaultSchedule" }, "scriptUri": "#{myCopyScript}", "name": "ShellCommandActivityObj", "id": "ShellCommandActivityObj", "runsOn": { "ref": "Ec2Instance" }, "scriptArgument": [ "#{myInputS3Path}", "#{myPostgresHostname}", "#{myPostgresDatabaseName}", "#{myPostgresUsername}", "#{*myPostgresPassword}", "#{myPostgresTablename}", "#{myPostgresPortnumber}" ], "type": "ShellCommandActivity" }, { "failureAndRerunMode": "CASCADE", "schedule": { "ref": "DefaultSchedule" }, "resourceRole": "DataPipelineDefaultResourceRole", "pipelineLogUri": "#{myS3LogsPath}", "role": "DataPipelineDefaultRole", "scheduleType": "cron", "name": "Default", "id": "Default" } ], "parameters": [ { "description": "Postgres database password", "id": "*myPostgresPassword", "type": "String" }, { "description": "Postgres database table name", "id": "myPostgresTablename", "type": "String" }, { "description": "Postgres database Host Name", "id": "myPostgresHostname", "type": "String" }, { "description": "Postgres Database name", "id": "myPostgresDatabaseName", "type": "String" }, { "watermark": "s3://mybucket", "description": "S3 folder for logs", "id": "myS3LogsPath", "type": "AWS::S3::ObjectKey" }, { "description": "S3 path containing input CSV files", "id": "myInputS3Path", "type": "String" }, { "description": "Postgres Database Port Number", "id": "myPostgresPortnumber", "type": "String" }, { "description": "Postgres Database Username", "id": "myPostgresUsername", "type": "String" }, { "description": "S3 path for copy script", "id": "myCopyScript", "type": "String" } ] } |
You may use a similar definition to copy from S3 to Redshift.
PART 2:
If your script is in S3 and you wanna pass arguments to your script:
Ex: insert into :v1 select * from source where source.id = :v2;
-> s3://YourBucket/event_data_yyyymm.sql
1 2 3 4 5 6 7 8 9 10 11 12 |
#----A sample shell code to run psql queries - from which you can build upon based on your requirement---- #!/bin/bash -x sudo yum install postgresql -y sudo yum install aws-cli -y export PGPASSWORD=$2 scriptFile=$(aws s3 ls $1 --region us-west-2 | awk '{print $4}') aws s3 cp $1 /home/ec2-user/ --region us-west-2 # The TableName and Condition value are passed via 3rd and 4th ScriptArguments of ShellCommandActivity object. psql -h redshift.awscloud.com -p 5439 -d DBName -U UserName -f /home/ec2-user/$scriptFile -v v1=$3 -v v2=$4 aws s3 cp /home/ec2-user/v2.sql s3://bucket/ --region us-west-2 #------------------------------------------------- |
$1 -> S3File location
$2 -> Password for Redshift
$3 -> Table Name
$4 -> condition value
A snippet of Definition for ShellCommandActivity can be :
1 2 3 4 5 6 7 8 9 |
"scriptUri": "s3://YourBucket/shellCommands.sh", "name": "ShellCommandActivityObj", "scriptArgument": [ "s3://YourBucket/event_data_yyyymm.sql", "#{*myRedshiftPass}", "tempsdf", "'123'" ], "type": "ShellCommandActivity" |