Using AWS Athena to query S3 Server Access Logs.

Written by mannem on . Posted in EMR || Elastic Map Reduce

S3 server access logs can grow very big over time and it is very hard for a single machine to Process/Query/Analyze all these logs. So, we can use distributed computing to query the logs quickly. Ideally we might think of Apache Hadoop/Hive/Spark/Presto etc to process these logs. But, the simplicity of AWS Athena service as a Serverless model will make it even easier. This article will guide you to use Athena to process your s3 access logs with example queries and has some partitioning considerations which can help you to query TB’s of logs just in few seconds.

The server access log files consist of a sequence of new-line delimited log records. Each log record represents one request and consists of space delimited fields. The following is an example log consisting of six log records.

This article is based on the Log format of access logs specified here https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html. The field’s and its format may change and DDL QUERY should be changed accordingly.

Sample Data:

Data Format considerations:

On first look, the data format appears simple , which is a textfile with space filed delimiter and newline(/n) delimited. However, there is a catch in this data format, the columns like Time , RequestURI & User-Agent can have space in their data ( [06/Feb/2014:00:00:38 +0000] , "GET /gdelt/1980.csv.gz HTTP/1.1" & aws-cli/1.7.36 Python/2.7.9 Linux/3.14.44-32.39.amzn1.x86_64) which will mess up with the SerDe parsing .

For RequestURI & User-Agent field, the data is enclosed in quotes with spaces inside it. This can be parsed by any SerDe’s that support Quotes. Unfortunately, Athena does not support such SerDe’s like org.apache.hadoop.hive.serde2.OpenCSVSerde which does has quotes feature.

The org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe included by Athena will not support quotes yet.

A custom SerDe called com.amazon.emr.hive.serde.s3.S3LogDeserializer comes with all EMR AMI’s just for parsing these logs. A query like the following would create the table easily. However, this SerDe will not be supported by Athena.

Looking at these limitations, the org.apache.hadoop.hive.serde2.RegexSerDe only seems like the feasible option. Writing RegEx for your log structure was bit time consuming , but I was able to write it with the help of ELB log structure example and http://regexr.com/ .

I used the following Regex for the S3 server access logs :

([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)$

Also found at : http://regexr.com/3erct , where I removed some extra escape chars like \’s. In DDL , we need to add these escape characters like in following DDL.

([^ ]*) – matches a continuous string. Because the number fields like ‘time’ can return ‘-‘ character , I did not use ([-0-9]*) regex which is used for numbers. Because of same reason , I had to use STRING for all fields in DDL. We can use presto string functions to convert strings for appropriate conversions and comparison operators on DML

\\[(.*?)\\] or \\\[(.*?)\\\] – is for Time field – For a string like [06/Feb/2014:00:00:38 +0000], it will match and give a string like 05/Dec/2016:16:56:36 +0000 which is easier for querying between times.

\\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" – is for Request-URI field – removes the quotes and splits the "GET /mybucket?versioning HTTP/1.1" into three groups.

(\"[^\"]*\") – is for User-Agent – matches the whole string with quotes.

(-|[0-9]*) – is for HTTPstatus which is always a number like 200 or 404 . If you expect a ‘-‘ string, you can use ([^ ]*) instead,

CREATE TABLE DDL :

Where s3://s3server-accesslogs-bucket/logs/ is the bucket and prefix of you access logs.

Once you create the table, you can query this table for specific information :

Example Queries :

Any field can be set to “-” to indicate that the data was unknown or unavailable, or that the field was not applicable to this request.

Caveats :

1. From time to time, AWS might extend the access log record format by adding new fields to the end of each line.The DDL/RegEx must be written to handle trailing fields that it does not understand.

Partitioning considerations :

Overtime, the logs in your target prefix(s3://s3server-accesslogs-bucket/logs/) can grow big. This is especially true if you have this same bucket/prefix for other buckets access logs. If you enable logging on multiple source buckets that identify the same target bucket, the target bucket will have access logs for all those source buckets, but each log object will report access log records for a specific source bucket.

Amazon S3 uses the following object key format for the log objects it uploads in the target bucket: TargetPrefixYYYY-mm-DD-HH-MM-SS-UniqueString . So, it basically creates multiple files with a bunch of logs in the S3 bucket.

Querying can be slower if there are large number of small files in textformat.

To speed things up, we need at-least one of these options to consolidate these logs.
1. PARTITIONING – can work on subset of files instead of going through all files.
2. CONVERT TO a COLUMNAR STORAGE – like ORC or PARQUET For faster query performance
3. ENABLE COMPRESSION – Snappy or GZ
4. BLOOM FILTERS

Unfortunately, we cannot do this by just defining a DDL. We need to copy/move out text based data-set with the above options enabled. One option to use of AWS EMR to periodically structure and partition the S3 access logs so that you can query those logs easily with Athena. At this moment, it is not possible to use Athena itself to convert non-partitioned data into partitioned data.

Here’s a example to convert Non partitioned s3 access logs to partitioned s3 access logs :

Other Input Regex’s :

CloudFront
'^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$'
ELB access logs
'([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \\\”([^ ]*) ([^ ]*) (- |[^ ]*)\\\” ?(\”[^\”]*\”)? ?([A-Z0-9-]+)? ?([A-Za-z0-9.-]*)?$'
Apache web logs
'([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?'

You may also like:
Using Athena to Query CloudTrail logs.
Parsing AWS Cloudtrail logs with EMR Hive and Spark

EMR s3DistCp “–groupBy” Regex examples

Written by mannem on . Posted in EMR || Elastic Map Reduce

Sometimes the regex could be confusing on S3DistCp groupBy. I usually use some online regex tools like https://regex101.com/ to better work with string matching and grouping.

Here are some examples that I explored so far :

Example 1 :

Example 2 :

s3-dist-cp –src s3://support.elasticmapreduce/training/datasets/gdelt/ –dest hdfs:///gdeltWrongOutput1/ –groupBy ‘.*(\d{6}).*’

This command would not merge any files but copy all files with 6 numbers like 20130401.export.CSV.gz to destination.

Example 3 :

http://stackoverflow.com/questions/38374107/how-to-emr-s3distcp-groupby-properly

Example 4 :

If you want to concatenate matching files in the root directory and and all matching files inside a ‘sample_directory’ into a single file and compress that in gzip format. on http://regexr.com/3ftn9 will concatenate all matched file contents and creates one .gz file

Services in EMR – upstart

Written by mannem on . Posted in EMR || Elastic Map Reduce

Service management in EMR 4.x and 5.x are handled by upstart, and not the traditional SysVInit scripts.

You can view services by running the below command:

Services can be queried using the upstart commands, for example:

Services can be stop/start with the following commands

More upstart commands can be found here : http://upstart.ubuntu.com/cookbook/

Getting stack trace/Heap dump of a process in EMR

Written by mannem on . Posted in EMR || Elastic Map Reduce

In latest EMR AMI’s , Different Applications like Hive and Hadoop are installed with corresponding Unix USERS.

Example : Hive-server2 process in run with hive user.

To check the stack trace or heap dump of this process , you need to specify corresponding user who spawned this process.

You can use the following commands:

or

Parsing AWS Cloudtrail logs with EMR Hive and Spark

Written by mannem on . Posted in cloudtrail, EMR || Elastic Map Reduce

earemr

AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The recorded information includes the identity of the API caller, the time of the API call, the source IP address of the API caller, the request parameters, and the response elements returned by the AWS service.

Cloudtrail usually sends the logs to S3 bucket segregated into Accounts(Regions(data(logFile.json.gz))).

Cloudtrail recommends to use aws-cloudtrail-processing-library , but it may be complex if you wish to Ad-hoc query a huge number of big log files faster. If there are many files , it may also be harder to download all logs to a Linux/Unix node , unzip it and do RegEx matching on all these files. So, we can use a distributed environment like Apache Hive on AWS EMR cluster to parse/organize all these files using very simple SQL like commands.

This article guides you to query your Cloudtrail logs using EMR Hive. This article also provides some example queries which may be useful in different scenarios. It assumes that you have a running EMR cluster which Hive application installed and explored a bit.

Directory and File structure of Cloudtrail logs :



The following Hive queries shows how to create a Hive table and reference the cloud trial s3 bucket. Cloudtrail data is processed by CloudTrailInputFormat implementation, which defines the input data split and key/value records. The CloudTrailLogDeserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table. Data (such as using an INSERT statement) to be written is translated by the Serializer class defined in SerDe to the format that the OUTPUTFORMAT class( HiveIgnoreKeyTextOutputFormat) can read.

These classes are part of /usr/share/aws/emr/goodies/lib/ EmrHadoopGoodies-x.jar & /usr/share/aws/emr/goodies/lib/ EmrHadoopGoodies-x.jar files and are automatically included in Hive classpath. Hive can also automatically de-compress the GZ files. All you need to do is to run a query similar to SQL commands. Some sample queries are included.

EMR uses an instance profile role on its nodes to auntenticate requests made to cloudtrail bucket. The default IAM policy on the role EMR_EC2_DefaultRole allows access to all S3 buckets. If your cluster do not have access , you may need to make sure the instance profile/Role has access to necessary s3 cloudtrail bucket.
Do not run any Insert overwrite on this hive table. If EMR has write access to the s3 bucket, an insert overwrite may delete all logs from this bucket. Please check hive language manual before attempting any commands
Cloudtrail json elements are extensive and are different for different kind of request. This SerDe (which is kind-of abandoned by EMR team)doesn’t include all possible rows in Ctrail. For example, if you try to query requestparameters , it would give FAILED: SemanticException [Error 10004]: Line 6:28 Invalid table alias or column reference ‘requestparameters’.
If your cloudtrail bucket has large number of files, Tez’s grouped splits or MR’s input splits calculation may take considerable time and memory. Make sure you allocate enough resources to the hive-client or tez-client.

TEZ: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works



Alternatively, you can use APACHE SPARK which has spark-shell to query cloutrail logs using the following library : https://github.com/awslabs/timely-security-analytics

Here’s some instructions on this design :



hive server 2 crashing with OutOfMemoryError (OOM) ?

Written by mannem on . Posted in EMR || Elastic Map Reduce

emr ear  

Often times HiveServer2 can be single point of failure. It can easy crash with OOM. If HiveServer2 restarts now-and-then , it must be due to OOM where it is likely set to be killed and re-spawned. We need to check the JVM options to see the behavior of this process on OOM , analyze thread dumps and check the underlying issue to mitigate such issues.

This post explains – How to identify OOM issue on HiveServer2 – How to analyze Thread/Heap dumps – How to Mitigate this issue. Some of the discussion is generic to EMR.

PART 1 : How to identify OOM issue on HiveServer2

Default JVM options of HS2 on EMR 4.7.2 AMI :

Some observations from above JVM:

1. Heap space for HiveServer2 JVM is 1000MB and this option is sourced from hive-env.sh.
2. Hive logging options like(-Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -server) are sourced from /etc/init/hive-server2.conf
3. All other options starting from ‘-server’ are set by EMR.
4. If this JVM goes OutOfMemoryError , -XX:OnOutOfMemoryError=kill -9 %p option will kill HS2 on OOM.

If for some reason the HiveServer2 goes OOM , it simply gets killed and there is no way to find why.. hive-server2.log and hive-server2.out may not show OOM at all. you can check hive-server2.log to verify HiveServer2 getting restarted . So, we tweak JVM options for HS2 to include GC verbose and to Dump Heap On OutOfMemoryError.

With the following settings , if the HS2 process does go OOM , it would display the error like following on /var/log/hive/hive-server2.out. This message means that for some reason the garbage collector of this HS2 JVM is taking an excessive amount of time and recovers very little memory in each run.

An example script to enable GC verbose on HS2 JVM – can be like following on EMR 4.x.x and 5.x.x:



PART 2 : How to analyze Thread/Heap dumps

With above options if HS2 fails with OOM , it will log such error to /var/log/hive/hive-server2.out. It will also heap dump to a file like /var/log/hive/java_pid8548.hprof . There are multiple ways to open and analyze this file. I found “eclipse memory analyzer tool” helpful in identifying memory leaks and exploring thread dumps on java classes and sub-classes.

generatedata

generatedata



PART 3 : How to Mitigate this issue

1. Tweaking hive-site.xml by checking stack trace:

Best way to mitigate the OOM’s is to check the stack trace of the Hive-server2 process and see if there’s any leaks. It is also a good idea to check for top consumers and act accordingly. For each of those threads , see if any of the hive-site.xml settings would reduce its memory consumption. Some times you may not have control on any of the thread specifics , in which case , you may need to further increase the heap space.

For example : In the stack trace ,

– if you see multiple threads getting blocked etc, you can edit a setting like hive.server2.thrift.http.max.worker.threads

2. Mitigating OOM obviously involves increasing the Heap space for Hive-server2.

The heap space is defined in hive-env.sh using env variable . Its important to identify memory requirements of 3 hive services HiveServer2 , Hive metastore , Hive clients in advance , using load/concurrency tests before moving to PROD. If you observe any OOM’s , you may need to increase memory accordingly.

Identifying hive services and increasing memory :

if ‘export HADOOP_HEAPSIZE=2048’ string is present on hive-env.sh , it would be applied to all of the 3 hive services if restarted. So, you can use if-statements to provide different settings(HADOOP_HEAPSIZE and HADOOP_OPTS) for 3 of these hive services.

Example contents of hive-env.sh :

3. High availability and LoadBalancing of Hiveserver2 :

Increasing heap space for HS2 may help , However , As per Cloudera , after certain memory limit, there is possibility that you continue to hit OOM no mater how much you increase the memory for HS2. In that case, “Cloudera recommends splitting HiveServer2 into multiple instances and load balancing once you start allocating >12 GB to HiveServer2. The objective is to size to reduce impact of Java garbage collection on active processing by the service.”

You may checkout the limitations here : https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_hive_install.html#concept_alp_4kl_3q

4. Watch out for any bugs in the LockManager :

https://hive.apache.org/javadocs/r2.0.0/api/org/apache/hadoop/hive/ql/lockmgr/

If you are using Hive’s Table Lock Manager Service by setting hive.support.concurrency to true , check if there’s issues with the Lockmanager which is responsible for maintaining locks for concurrent user support. Lockmanager can be ZooKeeperHiveLockManager based or can be EmbeddedLockManager which is shared lock manager for dedicated hive server , where – all locks are managed in memory. if the issue seems to be with these lock managers , you may need to edit their configs.

5. Check for generic version issues :

On Apache JIRA issues for hive-server2.

6. GC Tuning :

If the HS2 JVM is spending too much time on GC , you might consider some GC Tuning as discussed here : http://www.cubrid.org/blog/dev-platform/how-to-tune-java-garbage-collection/



spark-redshift library from databricks – Installation on EMR and using InstanceProfile

Written by mannem on . Posted in EMR || Elastic Map Reduce

ear earearspark-redshift is a library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to Redshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and JDBC is used to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.This library is more suited to ETL than interactive queries, since large amounts of data could be extracted to S3 for each query execution

spark-redshift installation instructions on EMR:
Steps 1-8 shows how to compile your own spark-redshift package(JAR). You can directly skip to step 9 if you wish to use pre-compiled JAR’s(V 0.6.1). Later we use spark-shell to invoke these JAR’s and run scala code to query Redshift table and put contents into a dataframe. This guide assumes you had followed github page https://github.com/databricks/spark-redshift and its tutorial.

Download pre-req’s to compile the your own JAR’s

————————————————-

If you wanna skip above steps , you can download the pre-compiled JAR’s using –

wget https://s3.amazonaws.com/emrsupport/spark/spark-redshift-databricks/minimal-json-0.9.5-SNAPSHOT.jar

wget https://s3.amazonaws.com/emrsupport/spark/spark-redshift-databricks/spark-redshift_2.10-0.6.1-SNAPSHOT.jar

With these JAR’s you can skip all above options (1-8)

————————————————-

10. In SCALA shell, you can run the following commands to init SQLContext. Note that the below code automatically uses IAM role’s (Instance profile ) cred’s to authenticate against S3

Sample Scala code uses Instance profile

————————————————-


Sample spark-sql invocation

spark-sql --jars RedshiftJDBC41-1.1.13.1013.jar,spark-redshift_2.10-0.6.1-SNAPSHOT.jar,minimal-json-0.9.5-SNAPSHOT.jar


Note that while running the above commands , the spark-redshift executes a COPY/Unload with a command like below:

These temporary credentials are from IAM role’s (EMR_EC2_DefaultRole) and this role should have policy that allows S3 access to atleast temp bucket mentioned in the command.


Using a single hive warehouse for all EMR(Hadoop) clusters

Written by mannem on . Posted in EMR || Elastic Map Reduce

As the EMR/Hadoop cluster’s are transient, tracking all those databases and tables across clusters may be difficult. So, Instead of having different warehouse directories across clusters, You can use a single permanent hive warehouse across all EMR clusters. S3 would be a great choice as it is persistent storage and had robust architecture providing redundancy and read-after-write consistency.

For each cluster:

This can be configured using hive.metastore.warehouse.dir property on hive-site.xml.

You may need to update this setting on all nodes.

On a single hive session:

this can be configured using a command like set hive.metastore.warehouse.dir ="s3n://bucket/hive_warehouse"

or initialize hive cli with the following invocation -hiveconf hive.metastore.warehouse.dir=s3n://bucket/hive_warehouse

Note that using above configuration, all default databases and tables will be stored on s3 on path like s3://bucket/hive_warehouse/myHiveDatabase.db/

YARN Log aggregation on EMR Cluster – How to ?

Written by mannem on . Posted in EMR || Elastic Map Reduce

earear

Why Log aggregation ?

User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, race conditions when running on a cluster, and debugging task/job failures due to hardware or platform bugs. Secondly, one can do historical analyses of the logs to see how individual tasks in job/workflow perform over time. One can even analyze the Hadoop MapReduce user-logs using Hadoop MapReduce(!) to determine any performance issues.

More info found here:
http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

Without log aggregation, when you try to access the job histoy logs from ResoureManager UI, you see this error.

“Aggregation is not enabled. Try the nodemanager at ip-172-31-41-154.us-west-2.compute.internal:8041”

This article guides you on how various EMR versions defaults the YARN log aggregation option. It also guides you to enable yarn log aggregation on EMR AMI’s that do not have aggregation by default. (Both on live cluster and while launching a cluster)

EMR Log aggregation vs YARN Log aggregation :

EMR had its own log aggregation into S3 bucket for persistent logging and debugging. EMR Web console provides similar feature as “yarn logs -applicationId” if you turn on debugging feature.

YARN log aggregation stores the application container logs in HDFS , where as EMR’s LogPusher (process to push logs to S3 as persistent option) needed the files in local file system. After post-aggregation , the default behavior of YARN is to copy the containers logs in local machines of core-nodes to HDFS and then after post-aggregation , DELETE those local files on individual core-nodes. Since they are being deleted by Nodemanager on individual node’s , EMR has no way to save those logs to a more persistent storage such as S3. So, historically, EMR had to the disable the YARN Log aggregation so that the container logs stay in local machines and could be pushed to S3. For Hue’s integrated feature such as displaying application log in Hue console, when you install Hue on EMR either AMI 3.x.x or 4.x.x, yarn.log-aggregation-enable will be enabled by default and container logs might be missing.

This behavior had been changed in latest EMR AMI’s like 4.3.0 , where both YARN Log aggregation and EMR log aggregation can work together. EMR basically introduced a patch described in https://issues.apache.org/jira/browse/YARN-3929 to all EMR AMI’s after EMR 4.3.0 having Hadoop branch-2.7.3-amzn-0 . Note that this patch is not adopted in open-source and is now part of EMR Hadoop.

With this commit, we basically are having an option to keep the files on local machines after log aggregation. Managed by “yarn.log-aggregation.enable-local-cleanup” property in yarn-site.xml on respective core/task nodes. This property is not public and can only be set on EMR distributions. This option is by default set to FALSE which means the cleanup on local machines WILL NOT take place. Logpusher will need this logs on local machines to push to S3 and LogPusher is the one which is responsible for removing those local logs after they are copied over and after certain retention period(4 hours for containers ).

For the logs to be deleted from local disks, we need to flip it to true with configurations API while launching the cluster. On live cluster, all core/task node’s yarn-site.xml should be updated and NM should be restarted. After the restart old container logs might still be present.

*** This options might not recommended because the Logpusher , will NOT be able to push those local container logs to customer’s(service’s) S3 if this option is set to true.
** and the Only source of container logs will be aggregated logs on HDFS which is not so persistent.


EMR 4.x.x :

AMI 4.0.0 doesn’t support Hue, so yarn.log-aggregation-enable=false is default.

To enable it on 4.0.0 ->

Step 1: For 4.0.0, You may need to enable it by following the below procedure for 3.x.x.

As AMI 4.x.x uses upstart, You need to use the following command rather than “sudo service “.

sudo restart hadoop-mapreduce-historyserver

/etc/init/ folder has all service configuration files

Step 2: /etc/hadoop/conf.empty/yarn-site.xml is the configuration file that need to be edited.

Step 3: While launching the cluster, you can use yarn-site release-label classification to specify to make necessary changes to enable log-aggregation.

Configuration that needs to be enabled :

yarn.log-aggregation-enable=true

You can refer to the following page to configure your cluster.
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html

4.1.0,4.2.0:

If you don’t install Hue, you need to turn it on specifically using EMR’s configuration(yarn.log-aggregation-enable=true) using yarn-site classification .

If you turn on yarn.log-aggregation-enable by specifically or Hue, application container logs will not be saved on your S3 location. It will stay on HDFS for Yarn’s log aggregation feature.

4.3.0 and above / 5.0.0 and Above

YARN log aggregation is enabled by default and the logs will also get pushed to your S3 log bucket.


EMR 3.x.x :

Here’s how you enable log aggregation on 3.x.x EMR cluster:

Step 1:

Change yarn-site.xml settings to enable log aggregation

On Master node:
> vim /home/hadoop/conf/yarn-site.xml

Step 2:

Copy this configuration on all nodes.

yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 scp -o StrictHostKeyChecking=no -i ~/MyKeyName.pem /home/hadoop/conf/yarn-site.xml hadoop@{}://home/hadoop/conf/

Where MyKeyName.pem is the private key for SSH’ing into slaves.

The above command lists all slave nodes and uses scp to copy(replace) yarn-site.xml present on master onto slaves nodes

Step 3:

Restart hadoop HistoryServer daemon on all nodes.

yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 ssh -o StrictHostKeyChecking=no -i ~/MyKeyName.pem hadoop@{} "sudo service hadoop-mapreduce-historyserver restart"



scratch

hbase snapshot / export

Written by mannem on . Posted in EMR || Elastic Map Reduce

I observed that exporting large Hbase tables with Hbase provided ‘Export’ utility is very high CPU bound. If you are using default cluster configurations, the mappers may consume 100% CPU and may crash the regionServer(core-node) and your hbase. This article discusses some tuning on your map/reduce hbase export job. Also it focus on an alternate Hbase utility called ‘ExportSnapshot’ which mitigates problems with Hbase Export utility.

This article touches import/export tools that ship with hbase and shows how to use them efficiently

A small intro on Hbase Utilities.

————————————————-
org.apache.hadoop.hbase.mapreduce.Export

Export is a utility that will dump the contents of table to HDFS in a sequence file.

org.apache.hadoop.hbase.mapreduce.CopyTable

CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The target table must first exist. The usage is as follows:

org.apache.hadoop.hbase.snapshot.ExportSnapshot

The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster. The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.
————————————————-

The main difference between Exporting a Snapshot and Copying/Exporting a table is that ExportSnapshot operates at HDFS level.

This means that Master and Region Servers are not involved in this operations.

Consequently, no unnecessary caches for data are created and there is no triggering of additional GC pauses due to the number of objects created during the scan process.

So, when Exporting snapshot, the job is not longer CPU bound.

Please see :

  • http://hbase.apache.org/book.html#ops.snapshots
  • http://blog.cloudera.com/blog/2013/03/introduction-to-apache-hbase-snapshots/
  • http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_bdr_managing_hbase_snapshots.html
  • Using snapshots, i was able to Export my hbase table from Source cluster with EMR 3.0.0 AMI to Target cluster with latest(at the time of writing this post) EMR 3.9.0 AMI.

    This job was considerably very fast than Hbase Export.

    Prepping source and target clusters:

    ————————————————-

    —Enabling snapshots on Hbase—

    You must add the following property to hbase-site.xml of HBase Master node:
    ‘hbase.snapshot.enabled’ property with value ‘true’.

    /home/hadoop/hbase/conf/hbase-site.xml

    Stop the HMaster process.

    > sudo service hbase-master stop

    service-nanny will restart this process again. This will load the new configuration.

    —Enabling communication between source and Target clusters—

    To transfer snapshot from source cluster to target cluster,

    The source cluster need to communicate with target node’s name-node process listening on hdfs default port 9000(hdfs getconf -confKey fs.default.name)

    if both these clusters are using default EMR security groups, then they can communicate by default.Otherwise you may need to enable the security groups to allow communication over that port.

    (Check telnet 9000 from source cluster’s master)

    ————————————————-

    —Creating a snapshot on Source cluster—

    On source cluster(EMR 3.0.0 )

    > hbase shell
    hbase(main):001:0> snapshot 'sourceTable', 'snapshotName'

    check if the snapshot is created at HDFS /hbase/.hbase-snapshot/

    —Exporting this snapshot(snapshotName) to Hdfs /hbase/.hbase-snapshot/ of Target cluster(EMR 3.9.0)—

    with Master node’s internal IP 172.31.42.191
    listening on HDFS port 9000.
    Using 16 mappers
    with ‘ExportSnapshot’

    > hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshotName -copy-to hdfs://172.31.42.191:9000/hbase -mappers 16

    ————————————————-

    On Target cluster, check if the snapshot is exported

    > hadoop fs -ls /hbase/.hbase-snapshot/

    —Clone this snapshot to actual Hbase table on Target cluster—

    > hbase shell
    hbase(main):001:0> clone_snapshot 'snapshotName', 'newTableName'

    ————————————————-

    • cloudformation

      cloudformation

      pipeline

      Data-pipelines

      directoryservice

      directoryservicez

      cloudtrail

      cloudtrail

      config

      config

      trustedadvisor

      Trustedadvisor

    • snap

      Snapshot

      glacier

      Glacie

      storagegw

      Storage Gatewa

      s3

      S3

      cloudFront

      Cloud Front

    • r53

      Route 53

      lambda

      lambd

      directConnect

      DirectConnect

      vpc

      VPC

      kinesis

      Kinesis

      emr

      Emr

    • sns

      SNS

      transcoder

      Transcoder

      sqs

      SQS

      cloudsearch

      Cloud Search

      appstream

      App Stream

      ses

      SES

    • opsworks

      opsworks

      cloudwatch

      Cloud Watch

      beanstalk

      Elastic Beanstalk

      codedeploy

      Code Deploy

      IAM

      IAM

    • dynamodb

      dynamodb

      rds

      RDS

      elasticache

      ElastiCache

      redshift

      Redshift

      simpledb

      simpledb