Posts Tagged ‘ps’

Things to check in EMR Instance state logs

Written by mannem on . Posted in AWS BIG DATA, EMR || Elastic Map Reduce

emr

EMR logs Instance state logs every 15 mins which helps to identify several important metrics related to your EMR nodes like Memory / Disk etc. In this article, I will show you how to recognize important metrics and how we can interpret  them when diagnosing an issue.

 

Finding Instance State logs of EMR cluster :

On EMR nodes, go to /emr/instance-state/

On S3, You will find in path like

s3://emr-log-bucket/j-QHD70YCKZWTG/node/i-0d861d80c83e33ec0/daemons/instance-state/

Things to check :

CPU load Avg :

If Load Average is higher than CPU count of that Instance type, there could be communication issues b/w daemons and all sort of issues with HDFS and shuffles in jobs.

CPU load average is the average number of processes being or waiting executed over past 1, 5 and 15 minutes. So the number shown above means:

  • load average over the last 1 minute is 1.10
  • load average over the last 5 minute is 1.00
  • load average over the last 15 minute is 0.46

Lets say my Ec2 instance type is m5.large which has 2 vCPU’s according to

https://aws.amazon.com/ec2/instance-types/m5/

If my load avg. is larger than 2, then I should be concerned..

 TOP :

check if there’s any processes occupying a lot of CPU and memory.

 process list(PS).

Search for running processes like ‘HRegionServer’ to verify if a process is running. See previous instance state  log if there’s a  PID (process id) change for that process. If there is a PID change, most probably the process got killed with OOM between this time.

 VMSTAT R B ,

B = blocked process – shouldn’t be blocked.

 DMESG

to see OS issues like if OS is out of memory you will see OS randomly killing important processes.

“Free –m”

to check free memory. Do not overly rely on this as we only record free –m every 15 mins and its not a true representation of memory during the entire time.

“Df –h”

to check disk space.