I observed that exporting large Hbase tables with Hbase provided ‘Export’ utility is very high CPU bound. If you are using default cluster configurations, the mappers may consume 100% CPU and may crash the regionServer(core-node) and your hbase. This article discusses some tuning on your map/reduce hbase export job. Also it focus on an alternate Hbase utility called ‘ExportSnapshot’ which mitigates problems with Hbase Export utility.
A small intro on Hbase Utilities.
Export is a utility that will dump the contents of table to HDFS in a sequence file.
CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The target table must first exist. The usage is as follows:
The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster. The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.
The main difference between Exporting a Snapshot and Copying/Exporting a table is that ExportSnapshot operates at HDFS level.
This means that Master and Region Servers are not involved in this operations.
Consequently, no unnecessary caches for data are created and there is no triggering of additional GC pauses due to the number of objects created during the scan process.
So, when Exporting snapshot, the job is not longer CPU bound.
Please see :
Using snapshots, i was able to Export my hbase table from Source cluster with EMR 3.0.0 AMI to Target cluster with latest(at the time of writing this post) EMR 3.9.0 AMI.
This job was considerably very fast than Hbase Export.
Prepping source and target clusters:
—Enabling snapshots on Hbase—
You must add the following property to hbase-site.xml of HBase Master node:
‘hbase.snapshot.enabled’ property with value ‘true’.
Stop the HMaster process.
> sudo service hbase-master stop
service-nanny will restart this process again. This will load the new configuration.
—Enabling communication between source and Target clusters—
To transfer snapshot from source cluster to target cluster,
The source cluster need to communicate with target node’s name-node process listening on hdfs default port 9000(hdfs getconf -confKey fs.default.name)
if both these clusters are using default EMR security groups, then they can communicate by default.Otherwise you may need to enable the security groups to allow communication over that port.
—Creating a snapshot on Source cluster—
On source cluster(EMR 3.0.0 )
> hbase shell
hbase(main):001:0> snapshot 'sourceTable', 'snapshotName'
check if the snapshot is created at HDFS /hbase/.hbase-snapshot/
—Exporting this snapshot(snapshotName) to Hdfs /hbase/.hbase-snapshot/ of Target cluster(EMR 3.9.0)—
with Master node’s internal IP 172.31.42.191
listening on HDFS port 9000.
Using 16 mappers
> hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshotName -copy-to hdfs://172.31.42.191:9000/hbase -mappers 16
On Target cluster, check if the snapshot is exported
> hadoop fs -ls /hbase/.hbase-snapshot/
—Clone this snapshot to actual Hbase table on Target cluster—
> hbase shell
hbase(main):001:0> clone_snapshot 'snapshotName', 'newTableName'