EMR logs Instance state logs every 15 mins which helps to identify several important metrics related to your EMR nodes like Memory / Disk etc. In this article, I will show you how to recognize important metrics and how we can interpret them when diagnosing an issue.
Finding Instance State logs of EMR cluster :
On EMR nodes, go to
On S3, You will find in path like
Things to check :
CPU load Avg :
If Load Average is higher than CPU count of that Instance type, there could be communication issues b/w daemons and all sort of issues with HDFS and shuffles in jobs.
# how long have we been up
17:32:57 up 35 min, 0 users, load average: 1.10, 1.00, 0.46
CPU load average is the average number of processes being or waiting executed over past 1, 5 and 15 minutes. So the number shown above means:
- load average over the last 1 minute is 1.10
- load average over the last 5 minute is 1.00
- load average over the last 15 minute is 0.46
Lets say my Ec2 instance type is m5.large which has 2 vCPU’s according to
If my load avg. is larger than 2, then I should be concerned..
check if there’s any processes occupying a lot of CPU and memory.
Search for running processes like ‘HRegionServer’ to verify if a process is running. See previous instance state log if there’s a PID (process id) change for that process. If there is a PID change, most probably the process got killed with OOM between this time.
VMSTAT R B ,
B = blocked process – shouldn’t be blocked.
to see OS issues like if OS is out of memory you will see OS randomly killing important processes.
to check free memory. Do not overly rely on this as we only record free –m every 15 mins and its not a true representation of memory during the entire time.
to check disk space.