YARN Log aggregation on EMR Cluster – How to ?

Written by mannem on . Posted in EMR || Elastic Map Reduce

earear

Why Log aggregation ?

User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, race conditions when running on a cluster, and debugging task/job failures due to hardware or platform bugs. Secondly, one can do historical analyses of the logs to see how individual tasks in job/workflow perform over time. One can even analyze the Hadoop MapReduce user-logs using Hadoop MapReduce(!) to determine any performance issues.

More info found here:
https://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

Without log aggregation, when you try to access the job histoy logs from ResoureManager UI, you see this error.

“Aggregation is not enabled. Try the nodemanager at ip-172-31-41-154.us-west-2.compute.internal:8041”

This article guides you on how various EMR versions defaults the YARN log aggregation option. It also guides you to enable yarn log aggregation on EMR AMI’s that do not have aggregation by default. (Both on live cluster and while launching a cluster)

EMR Log aggregation vs YARN Log aggregation :

EMR had its own log aggregation into S3 bucket for persistent logging and debugging. EMR Web console provides similar feature as “yarn logs -applicationId” if you turn on debugging feature.

YARN log aggregation stores the application container logs in HDFS , where as EMR’s LogPusher (process to push logs to S3 as persistent option) needed the files in local file system. After post-aggregation , the default behavior of YARN is to copy the containers logs in local machines of core-nodes to HDFS and then after post-aggregation , DELETE those local files on individual core-nodes. Since they are being deleted by Nodemanager on individual node’s , EMR has no way to save those logs to a more persistent storage such as S3. So, historically, EMR had to the disable the YARN Log aggregation so that the container logs stay in local machines and could be pushed to S3. For Hue’s integrated feature such as displaying application log in Hue console, when you install Hue on EMR either AMI 3.x.x or 4.x.x, yarn.log-aggregation-enable will be enabled by default and container logs might be missing.

This behavior had been changed in latest EMR AMI’s like 4.3.0 , where both YARN Log aggregation and EMR log aggregation can work together. EMR basically introduced a patch described in https://issues.apache.org/jira/browse/YARN-3929 to all EMR AMI’s after EMR 4.3.0 having Hadoop branch-2.7.3-amzn-0 . Note that this patch is not adopted in open-source and is now part of EMR Hadoop.

With this commit, we basically are having an option to keep the files on local machines after log aggregation. Managed by “yarn.log-aggregation.enable-local-cleanup” property in yarn-site.xml on respective core/task nodes. This property is not public and can only be set on EMR distributions. This option is by default set to FALSE which means the cleanup on local machines WILL NOT take place. Logpusher will need this logs on local machines to push to S3 and LogPusher is the one which is responsible for removing those local logs after they are copied over and after certain retention period(4 hours for containers ).

For the logs to be deleted from local disks, we need to flip it to true with configurations API while launching the cluster. On live cluster, all core/task node’s yarn-site.xml should be updated and NM should be restarted. After the restart old container logs might still be present.

*** This options might not recommended because the Logpusher , will NOT be able to push those local container logs to customer’s(service’s) S3 if this option is set to true.
** and the Only source of container logs will be aggregated logs on HDFS which is not so persistent.


If you decide not to rely on both EMR LogPusher(that pushes to s3) and YARN Nodemanager(that aggregates logs to HDFS) and have your own monitoring solution that uploads logs to lets say an ElasticSearch or Splunk , then there’s few things to consider

1. Disable the YARN log aggregation using yarn.log-aggregation-enable = false . This means the YARN NodeManager on core nodes will not push the respective container logs on local disk to centralized HDFS. Note that Lohpusher can still delete(and can try to upload) the container logs to s3 after 4 hours(see /etc/logpusher/hadoop.config)
2. Once log aggregation is disabled, yarn.nodemanager.log.retain-seconds comes under picture, which will delete logs on local disks by default after 3 hours. This means Nodemanager can try to remove logs even before Logpusher tries to send them to s3 and delete the logs itself. So, make sure you increase this time so that your custom monitoring application has enough time to send logs to your preferred destination.


4.3.0 and above / 5.0.0 and Above

YARN log aggregation is enabled by default and the logs will also get pushed to your S3 log bucket.


EMR 4.x.x :

AMI 4.0.0 doesn’t support Hue, so yarn.log-aggregation-enable=false is default.

To enable it on 4.0.0 ->

Step 1: For 4.0.0, You may need to enable it by following the below procedure for 3.x.x.

As AMI 4.x.x uses upstart, You need to use the following command rather than “sudo service “.

sudo restart hadoop-mapreduce-historyserver

/etc/init/ folder has all service configuration files

Step 2: /etc/hadoop/conf.empty/yarn-site.xml is the configuration file that need to be edited.

Step 3: While launching the cluster, you can use yarn-site release-label classification to specify to make necessary changes to enable log-aggregation.

Configuration that needs to be enabled :

yarn.log-aggregation-enable=true

You can refer to the following page to configure your cluster.
https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html

4.1.0,4.2.0:

If you don’t install Hue, you need to turn it on specifically using EMR’s configuration(yarn.log-aggregation-enable=true) using yarn-site classification .

If you turn on yarn.log-aggregation-enable by specifically or Hue, application container logs will not be saved on your S3 location. It will stay on HDFS for Yarn’s log aggregation feature.


EMR 3.x.x :

Here’s how you enable log aggregation on 3.x.x EMR cluster:

Step 1:

Change yarn-site.xml settings to enable log aggregation

On Master node:
> vim /home/hadoop/conf/yarn-site.xml

Step 2:

Copy this configuration on all nodes.

yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 scp -o StrictHostKeyChecking=no -i ~/MyKeyName.pem /home/hadoop/conf/yarn-site.xml hadoop@{}://home/hadoop/conf/

Where MyKeyName.pem is the private key for SSH’ing into slaves.

The above command lists all slave nodes and uses scp to copy(replace) yarn-site.xml present on master onto slaves nodes

Step 3:

Restart hadoop HistoryServer daemon on all nodes.

yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 ssh -o StrictHostKeyChecking=no -i ~/MyKeyName.pem hadoop@{} "sudo service hadoop-mapreduce-historyserver restart"



scratch

Tags: , , ,

Trackback from your site.

Comments (2)

  • Karthik

    |

    Hello,
    How to get aggregated logs from s3 like yarn logs. Is there a way other than manually/programmatically putting them together?

    Reply

    • mannem

      |

      Well, they are already put together and organized into EMR S3 log bucket with objects named according to the application-id and container-id.
      Ex :
      s3://emrlogbucket/j-2UZUOX8QK123/containers/application_1507311646202_0001/container_1507311646202_0001_01_000001/stderr.gz

      This is just like ‘yarn logs’ where you have options like -containerId and -applicationId .

      All you need to do is use S3 API’s or even Hadoop to cat those logs.
      hadoop fs -cat s3://emrlogbucket/j-2UZUOX8QK123/containers/application_1507311646202_0001/container_1507311646202_0001_01_000001/stderr.gz

      Reply

Leave a comment