Posts Tagged ‘LargeDatasets’

Installing and using Apache sqoop to export/Import large datasets (MySQL, S3) (CSV/TSV..) on EMR cluster

Written by mannem on . Posted in EMR || Elastic Map Reduce

emremr

Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Read Throughput , Write throughput , how the code parses the data(Inline or Batch etc,. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

http://sqoop.apache.org/

This guide shows you
> To Install sqoop and
> Export/Import MySQL tables (from S3 to RDS) ,(from RDS to S3) respectively.

Considering a sample MySQL Table in RDS

> mysql -h myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com -u mannem -p

mysql> describe dailyStockPrices;

mysql> SELECT table_name AS "Table",
-> round(((data_length + index_length) / 1024 / 1024), 2) "Size in MB"
-> FROM information_schema.TABLES
-> WHERE table_schema = "pipetest"
-> AND table_name = "dailyStockPrices";

> My MySQL table dataset has lot of commas in the fields, so I choose TSV format instead of CSV to import/export.
> If I used CSV format, Sqoop will get confused parsing data.

Sqoop on EMR 4.4.0 + is pre-installed

Starting from EMR AMI version 4.4.0 , Sqoop 1.4.6 is available as sandbox. This can be installed by simply selecting this option while provisioning the EMR cluster. By default, Sqoop on EMR has a MariaDB and PostgresSQL driver installed. To install an alternate set of JDBC connectors for Sqoop, you need to install them in /usr/lib/sqoop/lib.

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sandbox.html#emr-sqoop

Sqoop on EMR 3.x.x can be installed with the following script:

Import Commands:

Import sqoop
This command copies MySQL table from RDS to S3. The S3 file content type is TSV & File name will be in “part-m-00000” format.
Note that with -m 1 , I am using single mapper task to run in parallel.
sqoop import --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --target-dir s3://mannem/sqoopmatrix -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
Check S3 contents
hadoop fs -cat s3://mannem/sqoopmatrix/part-m-00000
Sqoop command usage:
http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Export commands :

Before export, The Destination MySQL/PSQL table should already be created with a similar schema.
Export to RDS(MySQL)
This command copies TSV file from S3 to MySQL Table.
sqoop export --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --export-dir s3://mannem/sqoopmatrix/part-m-00000
Export to Redshift(PSQL)
sqoop export --connect jdbc:redshift://$MYREDSHIFTHOST:5439/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver com.amazon.redshift.jdbc41.Driver --username master --password Mymasterpass1
Export commands with mariadb connection string
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
Export with using Secure Socket Layer encryption
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb?verifyServerCertificate=false&useSSL=true&requireSSL=true --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
scratch

  • cloudformation

    cloudformation

    pipeline

    Data-pipelines

    directoryservice

    directoryservicez

    cloudtrail

    cloudtrail

    config

    config

    trustedadvisor

    Trustedadvisor

  • snap

    Snapshot

    glacier

    Glacie

    storagegw

    Storage Gatewa

    s3

    S3

    cloudFront

    Cloud Front

  • r53

    Route 53

    lambda

    lambd

    directConnect

    DirectConnect

    vpc

    VPC

    kinesis

    Kinesis

    emr

    Emr

  • sns

    SNS

    transcoder

    Transcoder

    sqs

    SQS

    cloudsearch

    Cloud Search

    appstream

    App Stream

    ses

    SES

  • opsworks

    opsworks

    cloudwatch

    Cloud Watch

    beanstalk

    Elastic Beanstalk

    codedeploy

    Code Deploy

    IAM

    IAM

  • dynamodb

    dynamodb

    rds

    RDS

    elasticache

    ElastiCache

    redshift

    Redshift

    simpledb

    simpledb