Posts Tagged ‘hadoop’

Common issues of disk going full on EMR Cluster (or In general any Hadoop / Spark cluster)

Written by mannem on . Posted in AWS BIG DATA, EMR || Elastic Map Reduce

A disk going full can make YARN on EMR UNHEALTHY. So, customer’s need to identify and proactively predict why each Application like Hadoop / Spark can occupy disk space and act accordingly. This Article focuses on some most common ways the EMR cluster can go full and recommends actions we could take for those specific scenarios.

Some common factors occupying disk space on EMR :

  • HDFS (/mnt/hdfs/  ) i.e dfs.datanode.data.dir may be the one that’s occupying most space.
  • YARN containers logs. (/mnt/var/log/hadoop-yarn/containers/)
  • Localized files during an Hadoop/spark job run using YARN framework. (yarn.nodemanager.local-dirs/filecache , ../usercache/filecache , ../usercache//appcache/<app-id>/), where yarn.nodemanager.local-dirs is usually /mnt/yarn/ on one disk setup. For multiple disks per Instance(not root), multiple disks will be used by comma separated value on yarn.nodemanager.local-dirs.
  • Spark Application history logs (hdfs:///var/log/spark/apps/)
  • It may also be a combination of all of the above.

YARN LOGS:

/mnt/var/log/hadoop-yarn/

If its the logs(/var/logs symlinked to  /mnt/var/logs/) that’s occupying more space in that list, we can use multiple mount points for yarn.nodemanager.log-dirs setting(Comma seperated). Currently, EMR only uses one mount point for storing YARN container  logs.

The container logs on local machines should be ideally deleted by components in this order.
1. By YARN Nodemanager after log aggregation. – (Logic Altered by EMR team)
2. By LogPusher after retention period.
3. IC’s DSM when its heuristics are satisfied.

1.
In YARN , If log aggregation is turned on (with the yarn.log-aggregation-enable config), when the spark application is completed , container logs are copied to HDFS and after post-aggregation they are expected to be deleted from the local machine by NodeManager’s AppLogAggregatorImpl. However on EMR , we seem to keep it on local machines because we need those logs for logpusher to push them to S3.(logpusher cannot push logs from HDFS). So, EMR had a feature introduced in EMR Hadoop branch-2.7.3-amzn (not adopted in open source) by an internal commit.

With this commit, we basically are having an option to keep the files on local machines after log aggregation. Managed by “yarn.log-aggregation.enable-local-cleanup” property in yarn-site.xml on respective core/task nodes. This property is not public and can only be set on EMR distributions. In the latest EMR AMI’s , this option is set to ‘false’. This means the cleanup WILL NOT take place.

–For the logs to be deleted from local disks, we need to flip it to true with configurations API while launching the cluster. On live cluster, all core/task node’s yarn-site.xml should be updated and NM should be restarted. After the restart old container logs might still be present. — (Read below)

*** This options might not recommended because the Logpusher , will NOT be able to push those local container logs to customer’s(service’s) S3 if this option is set to true.
** and the Only source of container logs will be aggregated logs on HDFS which is not so persistent.

2. With logs non-cleanup in local machine because Logpusher needs them in local dir’s , we seem to be relying on same Logpusher to delete those local files after a certain retention period of 4 hours. More particularly on “/var/log/hadoop-yarn/containers” (/etc/logpusher/hadoop.config). LogPusher will only delete logs if they have not been touched in four hours

3. Instance contoller’s DiskSpaceManager is kind of a fail-safe to avoid disk fill up i.e If disk space goes beyond certain % , DSM will mark some files(including local container logs)for deletion . DSM does seem to have issues deleting the log files because of user/permissions issues. Ideally it need to list and delete logs from all users(yarn/ spark / hive ) and not just hadoop user’s logs.

Hadoop & Spark Streaming Jobs :

In a streaming(Hadoop or Spark using YARN) application, it is reasonable to expect that a log would be touched at least once every four hours for the entire lifetime of the streaming job, resulting in LogPusher never deleting the file. This can lead to disk space filling, which can lead to the customer wanting to spread logs across multiple mounts. Spreading across multiple mounts is not the best solution: we specifically put logs into one mount to leave space on the customer’s cluster for data

The correct solution here is to implement/configure log rotation for container logs. This way, if we rotate on an hourly basis, we:

  • keep the overall size of each log down
  • give logpusher a chance to upload and delete old logs
  • saving disk space for the customer and
  • preventing us from having to add unnecessary features to logpusher

Enabling log rotation for spark using /etc/spark/conf/log4j.properties to rotate ${spark.yarn.app.container.log.dir}/spark-%d{yyyy-MM-dd-HH-mm-ss}.log

Similarly log rotation can be done for Hadoop YARN logs using
/etc/hadoop/conf/container-log4j.properties
/etc/hadoop/conf/log4j.properties

HDFS DATA:

/mnt/hdfs/
If HDFS is occupying most space , then we might need to monitor HDFS CW metric and trigger a resize of auto-scale accordingly(Or manual resize).  After the resize the blocks will NOT be balanced. Only new data will go the node you just added. Old HDFS data blocks will not balance out automatically. you will need to re-balance out HDFS so that disk utilization on this node goes below 90% More details on HDFS re-balancing are explained in 
 
HDFS utilization > 90% doesn’t necessarily mean disk on a particular node will be > 90%. This really depends on HDFS replication and how blocks are spread around.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/emr-metricscollected.html

– Customer might need to check replication factor of HDFS. is it too large ?

– Is there a recent scale-down which lead to HDFS decommissioning where Blocks will be moved to available core nodes , thus filling up nodes ?

YARN LOCALIZED FILES:

http://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/

/mnt/yarn/

If /mnt/yarn/ (yarn.nodemanager.local-dirs) i.e  YARN localized files  is going full, it can happen at different stages of the Application.

1. /mnt/yarn/ (yarn.nodemanager.local-dirs)

On EMR , /mnt/yarn/ is configured on yarn-site.xml for with yarn.nodemanager.local-dirs . The list of directories used on this parameters is used –

–  During a MapReduce job, intermediate data and working files are written to temporary local files. Because this data includes the potentially very large output of map tasks, you need to ensure that the yarn.nodemanager.local-dirs property, which controls the location of local temporary storage for YARN containers, is configured to use disk partitions that are large enough.

–  During resource localization by YARN NM i.e NM downloads  resources from the supported source (such as HDFS, HTTP, and so on) to the NodeManager node’s local directory.

– After the job finishes, the Node Managers automatically clean up the localized files immediately by default.

Scenario 1 : /mnt/yarn/usercache/hadoop/appcache/ – occupying more space.

Localized files rarely fill up volumes. Its usually intermediate data from Mappers that fills this up.

Troubleshooting steps  :

1. Confirm if the existence of large intermediate output files. In this case , from one single application ,  large GB’s of intermediate data from mapper attempts are about to fill up disk space on one core node.

2. We can also confirm from mapper syslogs , that this directory is being used for intermediate data (mapreduce.cluster.local.dir)

3. Now,

You can refer to the following NodeManager log during resource localization:

This directory will be used by multiple YARN Applications and its containers during its lifecycle. First, we need to check if the application and its containers are  still running and currently occupying disk space. If they an running and corresponding appcache is making disk to go full, then your application needs that cache. NM doesn’t really delete any appcache that is currently being used by running containers. So, you will need to provision more space to handle your application’s cache. if multiple applications are running and filling up your appcache together , then you might need to limit the parallelism of your applications or provision bigger volumes.

If Applications are not running(containers ) and you continue to see disk being full with appcache, you might need to tune NM to speed the trigger of deletion service. Some NM parameters that you might need to configure(yarn-site.xml ) to change how NM decides to trigger the deletion service to remove the appcache..

yarn.nodemanager.localizer.cache.cleanup.interval-ms : Interval in between cache cleanups.
yarn.nodemanager.localizer.cache.target-size-mb  : Target size of localizer cache in MB, per local directory.
yarn.nodemanager.delete.thread-count

You can also try  running

(for running that command, you’ll need to make sure that the machine from where you are yarn-site.xml contains yarn.sharedcache.admin.address (the default is 0.0.0.0:8047) property defined. You might even try master IP instead of 0.0.0.0. )

Another parameter to watch out for is yarn.nodemanager.delete.debug-delay-sec , this is Number of seconds after an application finishes before the nodemanager’s DeletionService will delete the application’s localized file directory and log directory. This is set to 0 by defaylt which means , it will not wait for deletion seervice. If you have large number of this, the appcache will not be deleted after the application finishes and will exists untill this time.

References :
https://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

 

Spark’s usercache & SPARK on YARN :

/mnt/yarn/usercache/hadoop/appcache/


Ex:

2.8G  ./mnt/yarn/usercache/hadoop/appcache/application_1474295547515_0187/blockmgr-42cdcd45-fe7d-4f0d-bf4b-5c819d4ef15e
3.5G  ./mnt/yarn/usercache/hadoop/appcache/application_1474295547515_0187/blockmgr-840ac0bf-b0dd-4573-8f74-aa7859d83832

/usercache/ , In usercache direcory suppose there are a lot of big folders like blockmgr-b5b55c6f-ef8a-4359-93e4-9935f2390367.

filling up  with blocks from the block manager, which could mean you’re persisting a bunch of RDDs to disk, or maybe have a huge shuffle. The first step would be to figure out which of those it may be and avoid the issue by caching in memory or designing to avoid huge shuffles. You can consider upping spark.shuffle.memoryFraction to use more memory for shuffling and spill less.

– In cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored. In client mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in spark.local.dir. This is because the Spark driver does not run on the YARN cluster in client mode, only the Spark executors do.

– Access the application cache through yarn.nodemanager.local-dirs on the nodes on which containers are launched. This directory contains the launch script, JARs, and all environment variables used for launching each container.

http://spark.apache.org/docs/latest/running-on-yarn.html

spark.local.dir :     /tmp     Directory to use for “scratch” space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.

NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager – which seem to be /mnt/ etc, that your were concerned about.

http://spark.apache.org/docs/latest/configuration.html
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

 

SPARK HISTORY LOGS:

/var/log/spark/apps/

hdfs dfs -du -h /
15.7 M   /apps
0        /tmp
2.1 G    /user
199.3 G  /var

Within /var/log/spark/apps/ there is currently 15,326 files ranging in size of 100KB to 30 MB in size.

26.1 M   /var/log/spark/apps/application_1489085921151_9995_1
6.6 M    /var/log/spark/apps/application_1489085921151_9996_1
28.2 M   /var/log/spark/apps/application_1489085921151_9997_1
6.0 M    /var/log/spark/apps/application_1489085921151_9998_1
24.4 M   /var/log/spark/apps/application_1489085921151_9999_1

So why is this happening and how can I get these log files cleaned up once they have been saved to s3?

Those are spark history logs. Those retention settings are separate from YARN container log settings and can be configured to clean up at shorter intervals as defined here:

http://spark.apache.org/docs/latest/monitoring.html

following spark-default configurations might help in cleaning up logs.

spark.history.fs.cleaner.enabled : true
spark.history.fs.cleaner.interval : 1d
spark.history.fs.cleaner.maxAge : 7d

Emr edit software settings: [classification”:”spark-defaults”,”properties”:{“spark.history.fs.cleaner.maxAge”:”7d”,”spark.history.fs.cleaner.interval”:”1d”,”spark.history.fs.cleaner.enabled”:”true”}}]

You can also disable history logs(Event logs)  if you don’t care for it; for large files it doesn’t work anyway.
For disabling, you can use  “–conf spark.eventLog.enabled=false”  on spark-submit

But EMR’S Apppusher might need this events logs to display onto EMR console the spark’s Application logs.

 

Some other factors to consider:

If there’s a NM restart oR RM restart during the localization , there might be some stale files on the usercache which might not be deleted by deletion service and those files might persist after job completion. So, You might need to manually delete them sometimes.

ENABLING DEBUG LOGGING – EMR MASTER GUIDE

Written by mannem on . Posted in AWS BIG DATA, EMR || Elastic Map Reduce

Contains different configurations and procedures to enable logging on different daemons on AWS EMR cluster.
[Please contribute to this article to add additional ways to enable logging]

HBASE on S3 :

This will enable calls made from EMRFS from HBASE.

Important to troubleshoot S3 consistency issues and failures for HBASE on S3 cluster.

Enabling DEBUG on Hive Metastore daemon (its Datastore) on EMR :

or

Logs at /var/log/hive/user/hive/hive.log

HUE:

use_get_log_api=true in the beeswaxsection of the hue.ini configuration file.

Hadoop and MR :

Enable GC verbose on Hive Server 2 JVM:

WIRE OR DEBUG logging on EMR to check calls to S3 and DDB for DynamoDb connector library :

Paste the following on log4j configurations of Hadoop / hive / spark etc.

/etc/hadoop/conf/log4j.properties
/etc/hadoop/conf/container-log4j.properties
/etc/hive/conf/hive-log4j2.properties
/etc/spark/conf/..

https://github.com/awslabs/emr-dynamodb-connector/blob/master/emr-dynamodb-hive/src/test/resources/log4j.properties

Debug on S3 Calls from EMR HIVE :

These metrics can be obtained from the hive.log when enabling debug logging in aws-java-sdk. To enable this logging, add the following line to '/etc/hive/conf/hive-log4j.properties'. The Configuration API can be used as well.

Enable DEBUG logging for Http Connection pool:

(from spark) by adding the following to /etc/spark/conf/log4j.properties

*Tez overwrites the loglevel options we have passed. Please see the related items.*

Enabling Debug on Hadoop log to log calls by EMRFS :

/etc/hadoop/conf/log4j.properties

You can use same logging config for other Application like spark/hbase using respective log4j config files as appropriate. You can also use EMR log4j configuration classification like hadoop-log4j or spark-log4j to set those config’s while starting EMR cluster.(see below for sample JSON for configuration API)

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

DEBUG on EMR Logpusher Logs :

Edit this file on Master / Slave’s manually and restart Logpusher.

/etc/logpusher/logpusher-log4j.properties

(Might need to stop Service-nanny before stopping Logpusher, to properly stop/start Logpusher)

DEBUG on Spark classes :

Use the following EMR config to set DEBUG level for relevant class files.

DEBUG using spark shell:

Execute the following commands after invoking spark-shell to enable DEBUG logging on respective spark classes like Memstore. You can use the same if you want to reduce the amount of logging from INFO (which is default coming from log4j.properties in the spark conf ) to ERROR.

EMRFS CLI command like EMRFS SYNC :

/etc/hadoop/conf/log4j.properties

Logs will be on the console out. We might need to redirect to a File or do both.

Enable Debug on Boto3 client :

EMR s3DistCp “–groupBy” Regex examples

Written by mannem on . Posted in EMR || Elastic Map Reduce

Sometimes the regex could be confusing on S3DistCp groupBy. I usually use some online regex tools like https://regex101.com/ to better work with string matching and grouping.

Here are some examples that I explored so far :

Example 1 :

Example 2 :

s3-dist-cp –src s3://support.elasticmapreduce/training/datasets/gdelt/ –dest hdfs:///gdeltWrongOutput1/ –groupBy ‘.*(\d{6}).*’

This command would not merge any files but copy all files with 6 numbers like 20130401.export.CSV.gz to destination.

Example 3 :

http://stackoverflow.com/questions/38374107/how-to-emr-s3distcp-groupby-properly

Example 4 :

If you want to concatenate matching files in the root directory and and all matching files inside a ‘sample_directory’ into a single file and compress that in gzip format. on http://regexr.com/3ftn9 will concatenate all matched file contents and creates one .gz file

hive server 2 crashing with OutOfMemoryError (OOM) ?

Written by mannem on . Posted in EMR || Elastic Map Reduce

emr ear  

Often times HiveServer2 can be single point of failure. It can easy crash with OOM. If HiveServer2 restarts now-and-then , it must be due to OOM where it is likely set to be killed and re-spawned. We need to check the JVM options to see the behavior of this process on OOM , analyze thread dumps and check the underlying issue to mitigate such issues.

This post explains – How to identify OOM issue on HiveServer2 – How to analyze Thread/Heap dumps – How to Mitigate this issue. Some of the discussion is generic to EMR.

PART 1 : How to identify OOM issue on HiveServer2

Default JVM options of HS2 on EMR 4.7.2 AMI :

Some observations from above JVM:

1. Heap space for HiveServer2 JVM is 1000MB and this option is sourced from hive-env.sh.
2. Hive logging options like(-Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -server) are sourced from /etc/init/hive-server2.conf
3. All other options starting from ‘-server’ are set by EMR.
4. If this JVM goes OutOfMemoryError , -XX:OnOutOfMemoryError=kill -9 %p option will kill HS2 on OOM.

If for some reason the HiveServer2 goes OOM , it simply gets killed and there is no way to find why.. hive-server2.log and hive-server2.out may not show OOM at all. you can check hive-server2.log to verify HiveServer2 getting restarted . So, we tweak JVM options for HS2 to include GC verbose and to Dump Heap On OutOfMemoryError.

With the following settings , if the HS2 process does go OOM , it would display the error like following on /var/log/hive/hive-server2.out. This message means that for some reason the garbage collector of this HS2 JVM is taking an excessive amount of time and recovers very little memory in each run.

An example script to enable GC verbose on HS2 JVM – can be like following on EMR 4.x.x and 5.x.x:



PART 2 : How to analyze Thread/Heap dumps

With above options if HS2 fails with OOM , it will log such error to /var/log/hive/hive-server2.out. It will also heap dump to a file like /var/log/hive/java_pid8548.hprof . There are multiple ways to open and analyze this file. I found “eclipse memory analyzer tool” helpful in identifying memory leaks and exploring thread dumps on java classes and sub-classes.

generatedata

generatedata



PART 3 : How to Mitigate this issue

1. Tweaking hive-site.xml by checking stack trace:

Best way to mitigate the OOM’s is to check the stack trace of the Hive-server2 process and see if there’s any leaks. It is also a good idea to check for top consumers and act accordingly. For each of those threads , see if any of the hive-site.xml settings would reduce its memory consumption. Some times you may not have control on any of the thread specifics , in which case , you may need to further increase the heap space.

For example : In the stack trace ,

– if you see multiple threads getting blocked etc, you can edit a setting like hive.server2.thrift.http.max.worker.threads

2. Mitigating OOM obviously involves increasing the Heap space for Hive-server2.

The heap space is defined in hive-env.sh using env variable . Its important to identify memory requirements of 3 hive services HiveServer2 , Hive metastore , Hive clients in advance , using load/concurrency tests before moving to PROD. If you observe any OOM’s , you may need to increase memory accordingly.

Identifying hive services and increasing memory :

if ‘export HADOOP_HEAPSIZE=2048’ string is present on hive-env.sh , it would be applied to all of the 3 hive services if restarted. So, you can use if-statements to provide different settings(HADOOP_HEAPSIZE and HADOOP_OPTS) for 3 of these hive services.

Example contents of hive-env.sh :

3. High availability and LoadBalancing of Hiveserver2 :

Increasing heap space for HS2 may help , However , As per Cloudera , after certain memory limit, there is possibility that you continue to hit OOM no mater how much you increase the memory for HS2. In that case, “Cloudera recommends splitting HiveServer2 into multiple instances and load balancing once you start allocating >12 GB to HiveServer2. The objective is to size to reduce impact of Java garbage collection on active processing by the service.”

You may checkout the limitations here : https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_hive_install.html#concept_alp_4kl_3q

4. Watch out for any bugs in the LockManager :

https://hive.apache.org/javadocs/r2.0.0/api/org/apache/hadoop/hive/ql/lockmgr/

If you are using Hive’s Table Lock Manager Service by setting hive.support.concurrency to true , check if there’s issues with the Lockmanager which is responsible for maintaining locks for concurrent user support. Lockmanager can be ZooKeeperHiveLockManager based or can be EmbeddedLockManager which is shared lock manager for dedicated hive server , where – all locks are managed in memory. if the issue seems to be with these lock managers , you may need to edit their configs.

5. Check for generic version issues :

On Apache JIRA issues for hive-server2.

6. GC Tuning :

If the HS2 JVM is spending too much time on GC , you might consider some GC Tuning as discussed here : http://www.cubrid.org/blog/dev-platform/how-to-tune-java-garbage-collection/



YARN Log aggregation on EMR Cluster – How to ?

Written by mannem on . Posted in EMR || Elastic Map Reduce

earear

Why Log aggregation ?

User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, race conditions when running on a cluster, and debugging task/job failures due to hardware or platform bugs. Secondly, one can do historical analyses of the logs to see how individual tasks in job/workflow perform over time. One can even analyze the Hadoop MapReduce user-logs using Hadoop MapReduce(!) to determine any performance issues.

More info found here:
https://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

Without log aggregation, when you try to access the job histoy logs from ResoureManager UI, you see this error.

“Aggregation is not enabled. Try the nodemanager at ip-172-31-41-154.us-west-2.compute.internal:8041”

This article guides you on how various EMR versions defaults the YARN log aggregation option. It also guides you to enable yarn log aggregation on EMR AMI’s that do not have aggregation by default. (Both on live cluster and while launching a cluster)

EMR Log aggregation vs YARN Log aggregation :

EMR had its own log aggregation into S3 bucket for persistent logging and debugging. EMR Web console provides similar feature as “yarn logs -applicationId” if you turn on debugging feature.

YARN log aggregation stores the application container logs in HDFS , where as EMR’s LogPusher (process to push logs to S3 as persistent option) needed the files in local file system. After post-aggregation , the default behavior of YARN is to copy the containers logs in local machines of core-nodes to HDFS and then after post-aggregation , DELETE those local files on individual core-nodes. Since they are being deleted by Nodemanager on individual node’s , EMR has no way to save those logs to a more persistent storage such as S3. So, historically, EMR had to the disable the YARN Log aggregation so that the container logs stay in local machines and could be pushed to S3. For Hue’s integrated feature such as displaying application log in Hue console, when you install Hue on EMR either AMI 3.x.x or 4.x.x, yarn.log-aggregation-enable will be enabled by default and container logs might be missing.

This behavior had been changed in latest EMR AMI’s like 4.3.0 , where both YARN Log aggregation and EMR log aggregation can work together. EMR basically introduced a patch described in https://issues.apache.org/jira/browse/YARN-3929 to all EMR AMI’s after EMR 4.3.0 having Hadoop branch-2.7.3-amzn-0 . Note that this patch is not adopted in open-source and is now part of EMR Hadoop.

With this commit, we basically are having an option to keep the files on local machines after log aggregation. Managed by “yarn.log-aggregation.enable-local-cleanup” property in yarn-site.xml on respective core/task nodes. This property is not public and can only be set on EMR distributions. This option is by default set to FALSE which means the cleanup on local machines WILL NOT take place. Logpusher will need this logs on local machines to push to S3 and LogPusher is the one which is responsible for removing those local logs after they are copied over and after certain retention period(4 hours for containers ).

For the logs to be deleted from local disks, we need to flip it to true with configurations API while launching the cluster. On live cluster, all core/task node’s yarn-site.xml should be updated and NM should be restarted. After the restart old container logs might still be present.

*** This options might not recommended because the Logpusher , will NOT be able to push those local container logs to customer’s(service’s) S3 if this option is set to true.
** and the Only source of container logs will be aggregated logs on HDFS which is not so persistent.


If you decide not to rely on both EMR LogPusher(that pushes to s3) and YARN Nodemanager(that aggregates logs to HDFS) and have your own monitoring solution that uploads logs to lets say an ElasticSearch or Splunk , then there’s few things to consider

1. Disable the YARN log aggregation using yarn.log-aggregation-enable = false . This means the YARN NodeManager on core nodes will not push the respective container logs on local disk to centralized HDFS. Note that Lohpusher can still delete(and can try to upload) the container logs to s3 after 4 hours(see /etc/logpusher/hadoop.config)
2. Once log aggregation is disabled, yarn.nodemanager.log.retain-seconds comes under picture, which will delete logs on local disks by default after 3 hours. This means Nodemanager can try to remove logs even before Logpusher tries to send them to s3 and delete the logs itself. So, make sure you increase this time so that your custom monitoring application has enough time to send logs to your preferred destination.


4.3.0 and above / 5.0.0 and Above

YARN log aggregation is enabled by default and the logs will also get pushed to your S3 log bucket.


EMR 4.x.x :

AMI 4.0.0 doesn’t support Hue, so yarn.log-aggregation-enable=false is default.

To enable it on 4.0.0 ->

Step 1: For 4.0.0, You may need to enable it by following the below procedure for 3.x.x.

As AMI 4.x.x uses upstart, You need to use the following command rather than “sudo service “.

sudo restart hadoop-mapreduce-historyserver

/etc/init/ folder has all service configuration files

Step 2: /etc/hadoop/conf.empty/yarn-site.xml is the configuration file that need to be edited.

Step 3: While launching the cluster, you can use yarn-site release-label classification to specify to make necessary changes to enable log-aggregation.

Configuration that needs to be enabled :

yarn.log-aggregation-enable=true

You can refer to the following page to configure your cluster.
https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html

4.1.0,4.2.0:

If you don’t install Hue, you need to turn it on specifically using EMR’s configuration(yarn.log-aggregation-enable=true) using yarn-site classification .

If you turn on yarn.log-aggregation-enable by specifically or Hue, application container logs will not be saved on your S3 location. It will stay on HDFS for Yarn’s log aggregation feature.


EMR 3.x.x :

Here’s how you enable log aggregation on 3.x.x EMR cluster:

Step 1:

Change yarn-site.xml settings to enable log aggregation

On Master node:
> vim /home/hadoop/conf/yarn-site.xml

Step 2:

Copy this configuration on all nodes.

yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 scp -o StrictHostKeyChecking=no -i ~/MyKeyName.pem /home/hadoop/conf/yarn-site.xml hadoop@{}://home/hadoop/conf/

Where MyKeyName.pem is the private key for SSH’ing into slaves.

The above command lists all slave nodes and uses scp to copy(replace) yarn-site.xml present on master onto slaves nodes

Step 3:

Restart hadoop HistoryServer daemon on all nodes.

yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 ssh -o StrictHostKeyChecking=no -i ~/MyKeyName.pem hadoop@{} "sudo service hadoop-mapreduce-historyserver restart"



scratch

WEB LOG ANALYSIS

Written by mannem on . Posted in Achitectures

emremr

Amazon Web Services provides services and infrastructure to build reliable, fault-tolerant, and highly available web applications in the cloud. In production environments, these applications can generate huge amounts of log information. This data can be an important source of knowledge for any company that is operating web applications. Analyzing logs can reveal information such as traffic patterns, user behavior, marketing profiles, etc.

However, as the web application grows and the number of visitors increases, storing and analyzing web logs becomes increasingly challenging. This diagram shows how to use Amazon Web Services to build a scalable and reliable large-scale log analytics platform. The core component of this architecture is Amazon Elastic MapReduce, a web service that enables analysts to process large amounts of data easily and cost-effectively using a Hadoop hosted framework.