Posts Tagged ‘LargeDatasets’

EMR Config for Big cluster to create Many Paritions in Hive

Written by mannem on . Posted in AWS BIG DATA, EMR || Elastic Map Reduce

EMR Configuration :





Create Cluster CLI Command :

In below command, search for <code>replace</code> and replace them with relevant parameters like your own security groups, ssh key , IAM Roles, EMR Log bucket etc.



Installing and using Apache sqoop to export/Import large datasets (MySQL, S3) (CSV/TSV..) on EMR cluster

Written by mannem on . Posted in EMR || Elastic Map Reduce


Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Read Throughput , Write throughput , how the code parses the data(Inline or Batch etc,. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

This guide shows you
> To Install sqoop and
> Export/Import MySQL tables (from S3 to RDS) ,(from RDS to S3) respectively.

Considering a sample MySQL Table in RDS

> mysql -h -u mannem -p

mysql> describe dailyStockPrices;

mysql> SELECT table_name AS "Table",
-> round(((data_length + index_length) / 1024 / 1024), 2) "Size in MB"
-> FROM information_schema.TABLES
-> WHERE table_schema = "pipetest"
-> AND table_name = "dailyStockPrices";

> My MySQL table dataset has lot of commas in the fields, so I choose TSV format instead of CSV to import/export.
> If I used CSV format, Sqoop will get confused parsing data.

Sqoop on EMR 4.4.0 + is pre-installed

Starting from EMR AMI version 4.4.0 , Sqoop 1.4.6 is available as sandbox. This can be installed by simply selecting this option while provisioning the EMR cluster. By default, Sqoop on EMR has a MariaDB and PostgresSQL driver installed. To install an alternate set of JDBC connectors for Sqoop, you need to install them in /usr/lib/sqoop/lib.

Sqoop on EMR 3.x.x can be installed with the following script:

Import Commands:

Import sqoop
This command copies MySQL table from RDS to S3. The S3 file content type is TSV & File name will be in “part-m-00000” format.
Note that with -m 1 , I am using single mapper task to run in parallel.
sqoop import --connect jdbc:mysql:// --username mannem --password Password123 --table dailyStockPrices --target-dir s3://mannem/sqoopmatrix -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
Check S3 contents
hadoop fs -cat s3://mannem/sqoopmatrix/part-m-00000
Sqoop command usage:

Export commands :

Before export, The Destination MySQL/PSQL table should already be created with a similar schema.
Export to RDS(MySQL)
This command copies TSV file from S3 to MySQL Table.
sqoop export --connect jdbc:mysql:// --username mannem --password Password123 --table dailyStockPrices --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --export-dir s3://mannem/sqoopmatrix/part-m-00000
Export to Redshift(PSQL)
sqoop export --connect jdbc:redshift://$MYREDSHIFTHOST:5439/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver --username master --password Mymasterpass1
Export commands with mariadb connection string
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
Export with using Secure Socket Layer encryption
sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb?verifyServerCertificate=false&useSSL=true&requireSSL=true --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1