hbase snapshot / export

Written by mannem on . Posted in EMR || Elastic Map Reduce

I observed that exporting large Hbase tables with Hbase provided ‘Export’ utility is very high CPU bound. If you are using default cluster configurations, the mappers may consume 100% CPU and may crash the regionServer(core-node) and your hbase. This article discusses some tuning on your map/reduce hbase export job. Also it focus on an alternate Hbase utility called ‘ExportSnapshot’ which mitigates problems with Hbase Export utility.

This article touches import/export tools that ship with hbase and shows how to use them efficiently

A small intro on Hbase Utilities.


Export is a utility that will dump the contents of table to HDFS in a sequence file.


CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The target table must first exist. The usage is as follows:


The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster. The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.

The main difference between Exporting a Snapshot and Copying/Exporting a table is that ExportSnapshot operates at HDFS level.

This means that Master and Region Servers are not involved in this operations.

Consequently, no unnecessary caches for data are created and there is no triggering of additional GC pauses due to the number of objects created during the scan process.

So, when Exporting snapshot, the job is not longer CPU bound.

Please see :

  • http://hbase.apache.org/book.html#ops.snapshots
  • http://blog.cloudera.com/blog/2013/03/introduction-to-apache-hbase-snapshots/
  • http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_bdr_managing_hbase_snapshots.html
  • Using snapshots, i was able to Export my hbase table from Source cluster with EMR 3.0.0 AMI to Target cluster with latest(at the time of writing this post) EMR 3.9.0 AMI.

    This job was considerably very fast than Hbase Export.

    Prepping source and target clusters:


    —Enabling snapshots on Hbase—

    You must add the following property to hbase-site.xml of HBase Master node:
    ‘hbase.snapshot.enabled’ property with value ‘true’.


    Stop the HMaster process.

    > sudo service hbase-master stop

    service-nanny will restart this process again. This will load the new configuration.

    —Enabling communication between source and Target clusters—

    To transfer snapshot from source cluster to target cluster,

    The source cluster need to communicate with target node’s name-node process listening on hdfs default port 9000(hdfs getconf -confKey fs.default.name)

    if both these clusters are using default EMR security groups, then they can communicate by default.Otherwise you may need to enable the security groups to allow communication over that port.

    (Check telnet 9000 from source cluster’s master)


    —Creating a snapshot on Source cluster—

    On source cluster(EMR 3.0.0 )

    > hbase shell
    hbase(main):001:0> snapshot 'sourceTable', 'snapshotName'

    check if the snapshot is created at HDFS /hbase/.hbase-snapshot/

    —Exporting this snapshot(snapshotName) to Hdfs /hbase/.hbase-snapshot/ of Target cluster(EMR 3.9.0)—

    with Master node’s internal IP
    listening on HDFS port 9000.
    Using 16 mappers
    with ‘ExportSnapshot’

    > hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshotName -copy-to hdfs:// -mappers 16


    On Target cluster, check if the snapshot is exported

    > hadoop fs -ls /hbase/.hbase-snapshot/

    —Clone this snapshot to actual Hbase table on Target cluster—

    > hbase shell
    hbase(main):001:0> clone_snapshot 'snapshotName', 'newTableName'


    Installing and using Apache sqoop to export/Import large datasets (MySQL, S3) (CSV/TSV..) on EMR cluster

    Written by mannem on . Posted in EMR || Elastic Map Reduce


    Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Read Throughput , Write throughput , how the code parses the data(Inline or Batch etc,. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.


    This guide shows you
    > To Install sqoop and
    > Export/Import MySQL tables (from S3 to RDS) ,(from RDS to S3) respectively.

    Considering a sample MySQL Table in RDS

    > mysql -h myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com -u mannem -p

    mysql> describe dailyStockPrices;

    mysql> SELECT table_name AS "Table",
    -> round(((data_length + index_length) / 1024 / 1024), 2) "Size in MB"
    -> FROM information_schema.TABLES
    -> WHERE table_schema = "pipetest"
    -> AND table_name = "dailyStockPrices";

    > My MySQL table dataset has lot of commas in the fields, so I choose TSV format instead of CSV to import/export.
    > If I used CSV format, Sqoop will get confused parsing data.

    Sqoop on EMR 4.4.0 + is pre-installed

    Starting from EMR AMI version 4.4.0 , Sqoop 1.4.6 is available as sandbox. This can be installed by simply selecting this option while provisioning the EMR cluster. By default, Sqoop on EMR has a MariaDB and PostgresSQL driver installed. To install an alternate set of JDBC connectors for Sqoop, you need to install them in /usr/lib/sqoop/lib.


    Sqoop on EMR 3.x.x can be installed with the following script:

    Import Commands:

    Import sqoop
    This command copies MySQL table from RDS to S3. The S3 file content type is TSV & File name will be in “part-m-00000” format.
    Note that with -m 1 , I am using single mapper task to run in parallel.
    sqoop import --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --target-dir s3://mannem/sqoopmatrix -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
    Check S3 contents
    hadoop fs -cat s3://mannem/sqoopmatrix/part-m-00000
    Sqoop command usage:

    Export commands :

    Before export, The Destination MySQL/PSQL table should already be created with a similar schema.
    Export to RDS(MySQL)
    This command copies TSV file from S3 to MySQL Table.
    sqoop export --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --export-dir s3://mannem/sqoopmatrix/part-m-00000
    Export to Redshift(PSQL)
    sqoop export --connect jdbc:redshift://$MYREDSHIFTHOST:5439/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver com.amazon.redshift.jdbc41.Driver --username master --password Mymasterpass1
    Export commands with mariadb connection string
    sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
    Export with using Secure Socket Layer encryption
    sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb?verifyServerCertificate=false&useSSL=true&requireSSL=true --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1

    How to retrieve Cluster ID / JobFlow ID from EMR master node.

    Written by mannem on . Posted in EMR || Elastic Map Reduce

    You may look at  /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup.

    More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID.

    I was able to install a JSON parser on my Master node to retrieve jobFlowID.

    # Install JSON PARSER
    # Install pkgs required for jsawk
    wget http://pkgs.repoforge.org/js/js-1.60-1.el6.rf.x86_64.rpm
    wget http://pkgs.repoforge.org/js/js-devel-1.60-1.el6.rf.x86_64.rpm
    sudo yum install -y js-1.60-1.el6.rf.x86_64.rpm
    sudo yum install -y js-devel-1.60-1.el6.rf.x86_64.rpm
    # Install JSAWK curl -L http://github.com/micha/jsawk/raw/master/jsawk > jsawk
    chmod 755 jsawk && sudo mv jsawk /usr/bin/
    # Parse and print job flow ID
    jsawk 'return this.jobFlowId' < /mnt/var/lib/info/job-flow.json