Posts Tagged ‘rds’

SqlActivity & RedshiftCopyActivity fails ? Use shellCommandActivity instead

Written by mannem on . Posted in Data Pipelines

There are several limitations of SQLActivity and RedshiftCopyActivity. If the psql/sql commands are too complex, these activities fail to prepare the statements correctly and will throw out some errors which cannot be easily rectified. So, you always have the option to use shellCommandActivity to run your complex script.

This article guides you to create a shell Script and corresponding Data-Pipeline template to run your complex script directly(part 1)or when present in S3(Part 2 ). As the true purpose of Data-Pipelines is automation, The script can also take arguments that you can reference using placeholders.

PART 1:

The following shell script takes arguments referenced in ScriptArguments object of the ShellCommandActivity. Its runs COPY command to copy files from S3 to PostgreRDS. Another example shows a copy from S3 to Redshift.

Run a PSQL command to copy from S3 to PostgreRDS

Run a PSQL command to copy from S3 to Redshift

A sample Pipeline template to copy from s3 to RDSPostgres

You may use a similar definition to copy from S3 to Redshift.


PART 2:

If your script is in S3 and you wanna pass arguments to your script:

Ex: insert into :v1 select * from source where source.id = :v2; -> s3://YourBucket/event_data_yyyymm.sql

$1 -> S3File location
$2 -> Password for Redshift
$3 -> Table Name
$4 -> condition value

A snippet of Definition for ShellCommandActivity can be :

How Data-Pipeline’s RDS to Redshift copy template works ? (and limitations)

Written by mannem on . Posted in Data Pipelines

The template contains 4 Activities.

1. RDSToS3CopyActivity – Creates a CSV file in S3 based on mySQL table.
2. RedshiftTableCreateActivity – Translates MySQL table to PSQL and creates a table(if it does not exist).
3. S3ToRedshiftCopyActivity – Runs a Redshift COPY command.
4. S3StagingCleanupActivity – Cleans up the staging S3 directory.

RedshiftTableCreateActivity is the key activity and where all the complexity lies.

It runs a shell script using ShellCommandActivity which Translates the MySQL table structure to Psql syntax , and creates a table on Redshift with translated table structure. the data-type translation between RDS and Redhsift is provided here

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html
Now, if you dig more into the ShellCommandActivity, The conversion script is downloaded as part of s3://datapipeline-us-east-1/sample-scripts/dbconv.sh , which in-turn downloads the translator python script

curl -O https://s3.amazonaws.com/datapipeline-us-east-1/sample-scripts/mysql_to_redshift.py

You can check out the contents of this script on how exactly it translates.

Ex:

mysql> create table TestTable (Id int not null , Lname varchar(20));

According to translation table , this activity , translates to a psql script , which runs on the ec2 instance.

Make note of limitations on this script while using it !

mysql_to_redshift.py :

Installing and using Apache sqoop to export/Import large datasets (MySQL, S3) (CSV/TSV..) on EMR cluster

Written by mannem on . Posted in EMR || Elastic Map Reduce

emremr

Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Read Throughput , Write throughput , how the code parses the data(Inline or Batch etc,. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

http://sqoop.apache.org/

This guide shows you
> To Install sqoop and
> Export/Import MySQL tables (from S3 to RDS) ,(from RDS to S3) respectively.

Considering a sample MySQL Table in RDS

> mysql -h myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com -u mannem -p

mysql> describe dailyStockPrices;

mysql> SELECT table_name AS "Table",
-> round(((data_length + index_length) / 1024 / 1024), 2) "Size in MB"
-> FROM information_schema.TABLES
-> WHERE table_schema = "pipetest"
-> AND table_name = "dailyStockPrices";

> My MySQL table dataset has lot of commas in the fields, so I choose TSV format instead of CSV to import/export.
> If I used CSV format, Sqoop will get confused parsing data.

Sqoop on EMR 4.4.0 + is pre-installed

Starting from EMR AMI version 4.4.0 , Sqoop 1.4.6 is available as sandbox. This can be installed by simply selecting this option while provisioning the EMR cluster. By default, Sqoop on EMR has a MariaDB and PostgresSQL driver installed. To install an alternate set of JDBC connectors for Sqoop, you need to install them in /usr/lib/sqoop/lib.

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sandbox.html#emr-sqoop

Sqoop on EMR 3.x.x can be installed with the following script:

Import Commands:

Import sqoop
This command copies MySQL table from RDS to S3. The S3 file content type is TSV & File name will be in “part-m-00000” format.
Note that with -m 1 , I am using single mapper task to run in parallel.
sqoop import --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --target-dir s3://mannem/sqoopmatrix -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
Check S3 contents
hadoop fs -cat s3://mannem/sqoopmatrix/part-m-00000
Sqoop command usage:
http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Export commands :

Before export, The Destination MySQL/PSQL table should already be created with a similar schema.
Export to RDS(MySQL)
This command copies TSV file from S3 to MySQL Table.
sqoop export --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --export-dir s3://mannem/sqoopmatrix/part-m-00000
Export to Redshift(PSQL)
sqoop export --connect jdbc:redshift://$MYREDSHIFTHOST:5439/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver com.amazon.redshift.jdbc41.Driver --username master --password Mymasterpass1
Export commands with mariadb connection string
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
Export with using Secure Socket Layer encryption
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb?verifyServerCertificate=false&useSSL=true&requireSSL=true --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
scratch

WEB LOG ANALYSIS

Written by mannem on . Posted in Achitectures

emremr

Amazon Web Services provides services and infrastructure to build reliable, fault-tolerant, and highly available web applications in the cloud. In production environments, these applications can generate huge amounts of log information. This data can be an important source of knowledge for any company that is operating web applications. Analyzing logs can reveal information such as traffic patterns, user behavior, marketing profiles, etc.

However, as the web application grows and the number of visitors increases, storing and analyzing web logs becomes increasingly challenging. This diagram shows how to use Amazon Web Services to build a scalable and reliable large-scale log analytics platform. The core component of this architecture is Amazon Elastic MapReduce, a web service that enables analysts to process large amounts of data easily and cost-effectively using a Hadoop hosted framework.


  • cloudformation

    cloudformation

    pipeline

    Data-pipelines

    directoryservice

    directoryservicez

    cloudtrail

    cloudtrail

    config

    config

    trustedadvisor

    Trustedadvisor

  • snap

    Snapshot

    glacier

    Glacie

    storagegw

    Storage Gatewa

    s3

    S3

    cloudFront

    Cloud Front

  • r53

    Route 53

    lambda

    lambd

    directConnect

    DirectConnect

    vpc

    VPC

    kinesis

    Kinesis

    emr

    Emr

  • sns

    SNS

    transcoder

    Transcoder

    sqs

    SQS

    cloudsearch

    Cloud Search

    appstream

    App Stream

    ses

    SES

  • opsworks

    opsworks

    cloudwatch

    Cloud Watch

    beanstalk

    Elastic Beanstalk

    codedeploy

    Code Deploy

    IAM

    IAM

  • dynamodb

    dynamodb

    rds

    RDS

    elasticache

    ElastiCache

    redshift

    Redshift

    simpledb

    simpledb