Posts Tagged ‘postgre’

spark-redshift library from databricks – Installation on EMR and using InstanceProfile

Written by mannem on . Posted in EMR || Elastic Map Reduce

ear earearspark-redshift is a library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to Redshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and JDBC is used to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.This library is more suited to ETL than interactive queries, since large amounts of data could be extracted to S3 for each query execution

spark-redshift installation instructions on EMR:
Steps 1-8 shows how to compile your own spark-redshift package(JAR). You can directly skip to step 9 if you wish to use pre-compiled JAR’s(V 0.6.1). Later we use spark-shell to invoke these JAR’s and run scala code to query Redshift table and put contents into a dataframe. This guide assumes you had followed github page https://github.com/databricks/spark-redshift and its tutorial.

Download pre-req’s to compile the your own JAR’s

————————————————-

If you wanna skip above steps , you can download the pre-compiled JAR’s using –

wget https://s3.amazonaws.com/emrsupport/spark/spark-redshift-databricks/minimal-json-0.9.5-SNAPSHOT.jar

wget https://s3.amazonaws.com/emrsupport/spark/spark-redshift-databricks/spark-redshift_2.10-0.6.1-SNAPSHOT.jar

With these JAR’s you can skip all above options (1-8)

————————————————-

10. In SCALA shell, you can run the following commands to init SQLContext. Note that the below code automatically uses IAM role’s (Instance profile ) cred’s to authenticate against S3

Sample Scala code uses Instance profile

————————————————-


Sample spark-sql invocation

spark-sql --jars RedshiftJDBC41-1.1.13.1013.jar,spark-redshift_2.10-0.6.1-SNAPSHOT.jar,minimal-json-0.9.5-SNAPSHOT.jar


Note that while running the above commands , the spark-redshift executes a COPY/Unload with a command like below:

These temporary credentials are from IAM role’s (EMR_EC2_DefaultRole) and this role should have policy that allows S3 access to atleast temp bucket mentioned in the command.


SqlActivity & RedshiftCopyActivity fails ? Use shellCommandActivity instead

Written by mannem on . Posted in Data Pipelines

There are several limitations of SQLActivity and RedshiftCopyActivity. If the psql/sql commands are too complex, these activities fail to prepare the statements correctly and will throw out some errors which cannot be easily rectified. So, you always have the option to use shellCommandActivity to run your complex script.

This article guides you to create a shell Script and corresponding Data-Pipeline template to run your complex script directly(part 1)or when present in S3(Part 2 ). As the true purpose of Data-Pipelines is automation, The script can also take arguments that you can reference using placeholders.

PART 1:

The following shell script takes arguments referenced in ScriptArguments object of the ShellCommandActivity. Its runs COPY command to copy files from S3 to PostgreRDS. Another example shows a copy from S3 to Redshift.

Run a PSQL command to copy from S3 to PostgreRDS

Run a PSQL command to copy from S3 to Redshift

A sample Pipeline template to copy from s3 to RDSPostgres

You may use a similar definition to copy from S3 to Redshift.


PART 2:

If your script is in S3 and you wanna pass arguments to your script:

Ex: insert into :v1 select * from source where source.id = :v2; -> s3://YourBucket/event_data_yyyymm.sql

$1 -> S3File location
$2 -> Password for Redshift
$3 -> Table Name
$4 -> condition value

A snippet of Definition for ShellCommandActivity can be :

  • cloudformation

    cloudformation

    pipeline

    Data-pipelines

    directoryservice

    directoryservicez

    cloudtrail

    cloudtrail

    config

    config

    trustedadvisor

    Trustedadvisor

  • snap

    Snapshot

    glacier

    Glacie

    storagegw

    Storage Gatewa

    s3

    S3

    cloudFront

    Cloud Front

  • r53

    Route 53

    lambda

    lambd

    directConnect

    DirectConnect

    vpc

    VPC

    kinesis

    Kinesis

    emr

    Emr

  • sns

    SNS

    transcoder

    Transcoder

    sqs

    SQS

    cloudsearch

    Cloud Search

    appstream

    App Stream

    ses

    SES

  • opsworks

    opsworks

    cloudwatch

    Cloud Watch

    beanstalk

    Elastic Beanstalk

    codedeploy

    Code Deploy

    IAM

    IAM

  • dynamodb

    dynamodb

    rds

    RDS

    elasticache

    ElastiCache

    redshift

    Redshift

    simpledb

    simpledb