Posts Tagged ‘hbase’

hbase snapshot / export

Written by mannem on . Posted in EMR || Elastic Map Reduce

I observed that exporting large Hbase tables with Hbase provided ‘Export’ utility is very high CPU bound. If you are using default cluster configurations, the mappers may consume 100% CPU and may crash the regionServer(core-node) and your hbase. This article discusses some tuning on your map/reduce hbase export job. Also it focus on an alternate Hbase utility called ‘ExportSnapshot’ which mitigates problems with Hbase Export utility.

This article touches import/export tools that ship with hbase and shows how to use them efficiently

A small intro on Hbase Utilities.

————————————————-
org.apache.hadoop.hbase.mapreduce.Export

Export is a utility that will dump the contents of table to HDFS in a sequence file.

org.apache.hadoop.hbase.mapreduce.CopyTable

CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The target table must first exist. The usage is as follows:

org.apache.hadoop.hbase.snapshot.ExportSnapshot

The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster. The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.
————————————————-

The main difference between Exporting a Snapshot and Copying/Exporting a table is that ExportSnapshot operates at HDFS level.

This means that Master and Region Servers are not involved in this operations.

Consequently, no unnecessary caches for data are created and there is no triggering of additional GC pauses due to the number of objects created during the scan process.

So, when Exporting snapshot, the job is not longer CPU bound.

Please see :

  • http://hbase.apache.org/book.html#ops.snapshots
  • http://blog.cloudera.com/blog/2013/03/introduction-to-apache-hbase-snapshots/
  • http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_bdr_managing_hbase_snapshots.html
  • Using snapshots, i was able to Export my hbase table from Source cluster with EMR 3.0.0 AMI to Target cluster with latest(at the time of writing this post) EMR 3.9.0 AMI.

    This job was considerably very fast than Hbase Export.

    Prepping source and target clusters:

    ————————————————-

    —Enabling snapshots on Hbase—

    You must add the following property to hbase-site.xml of HBase Master node:
    ‘hbase.snapshot.enabled’ property with value ‘true’.

    /home/hadoop/hbase/conf/hbase-site.xml

    Stop the HMaster process.

    > sudo service hbase-master stop

    service-nanny will restart this process again. This will load the new configuration.

    —Enabling communication between source and Target clusters—

    To transfer snapshot from source cluster to target cluster,

    The source cluster need to communicate with target node’s name-node process listening on hdfs default port 9000(hdfs getconf -confKey fs.default.name)

    if both these clusters are using default EMR security groups, then they can communicate by default.Otherwise you may need to enable the security groups to allow communication over that port.

    (Check telnet 9000 from source cluster’s master)

    ————————————————-

    —Creating a snapshot on Source cluster—

    On source cluster(EMR 3.0.0 )

    > hbase shell
    hbase(main):001:0> snapshot 'sourceTable', 'snapshotName'

    check if the snapshot is created at HDFS /hbase/.hbase-snapshot/

    —Exporting this snapshot(snapshotName) to Hdfs /hbase/.hbase-snapshot/ of Target cluster(EMR 3.9.0)—

    with Master node’s internal IP 172.31.42.191
    listening on HDFS port 9000.
    Using 16 mappers
    with ‘ExportSnapshot’

    > hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshotName -copy-to hdfs://172.31.42.191:9000/hbase -mappers 16

    ————————————————-

    On Target cluster, check if the snapshot is exported

    > hadoop fs -ls /hbase/.hbase-snapshot/

    —Clone this snapshot to actual Hbase table on Target cluster—

    > hbase shell
    hbase(main):001:0> clone_snapshot 'snapshotName', 'newTableName'

    ————————————————-

    • cloudformation

      cloudformation

      pipeline

      Data-pipelines

      directoryservice

      directoryservicez

      cloudtrail

      cloudtrail

      config

      config

      trustedadvisor

      Trustedadvisor

    • snap

      Snapshot

      glacier

      Glacie

      storagegw

      Storage Gatewa

      s3

      S3

      cloudFront

      Cloud Front

    • r53

      Route 53

      lambda

      lambd

      directConnect

      DirectConnect

      vpc

      VPC

      kinesis

      Kinesis

      emr

      Emr

    • sns

      SNS

      transcoder

      Transcoder

      sqs

      SQS

      cloudsearch

      Cloud Search

      appstream

      App Stream

      ses

      SES

    • opsworks

      opsworks

      cloudwatch

      Cloud Watch

      beanstalk

      Elastic Beanstalk

      codedeploy

      Code Deploy

      IAM

      IAM

    • dynamodb

      dynamodb

      rds

      RDS

      elasticache

      ElastiCache

      redshift

      Redshift

      simpledb

      simpledb