Posts Tagged ‘YARN’

Spark UI vs. Spark History Server UI , which one to use and why ?

Written by mannem on . Posted in AWS BIG DATA, EMR || Elastic Map Reduce

Is Job Running ?

1. If you have Spark Applications Running, then you should be using SPARK UI. This UI is usually hosted on Spark Driver
– In YARN cluster mode, the Driver is run on YARN Application Master run on random Core node )
– IN YARN Client  Mode, the Driver is run on Master node itself.
To access Spark UI, You should be going to  YARN ResourceManager UI First. Then navigate to corresponding Spark Application and use “Application Master” link to Access Spark UI. If you observe the link, its taking you  you to the application master’s web UI at port 20888. This is basically a proxy running on master  listening on 20888  which makes available the Spark UI(which runs on either Core node or Master node)

2. You can also access Spark UI by going directly to Driver Hostname and Portname where its hosted.
For example, when I run spark-submit in cluster mode, it spinned up application_1569345960040_0007. In my driver logs I see below messages
19/09/24 22:29:15 INFO Utils: Successfully started service ‘SparkUI’ on port 35395.
19/09/24 22:29:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://ip-10-0-0-69.myermdomain.com:35395
Where ip-10-0-0-69.myermdomain.com is one of my core node.
So I can go to
http://ip-10-0-0-69.myermdomain.com:35395
This automatically routes me to Master node proxy server listening on port 20888
 http://ip-10-0-0-113.ec2.internal:20888/proxy/application_1569345960040_0007/
Please note that, these links are temporary and will only show the UI while the Spark Application is running.

Is Job Completed ?

But if you want to see UI even when Spark job is completed, you should use Spark HistoryServer UI directly at http://master-public-dns-name:18080/.
Spark History Server can also be used for Running Jobs using “Show Incomplete Applications” Button. Spark History Server does this by using Spark Event logs which is enabled on EMR by default.

Differences between Spark UI and Spark History UI

 But looks like Spark History Server has some  differences when compared to “Spark UI” (For Running Apps of course ). Some of em’ that I observed are :
– Spark UI has “Kill” Button so your can kill some Spark Stages while Spark History Server doesn’t.
– SPark UI has “SQL” tab which shows more information about spark-sql jobs while Spark History Server doesn’t.
– Spark UI can pull up live  Thread Dumps for Executors  while Spark History Server doesn’t.
– Spark UI can give most update to date info(like “Total Uptime”) on Tasks while there can be a bit lag in  Spark History Server UI.

ENABLING DEBUG LOGGING – EMR MASTER GUIDE

Written by mannem on . Posted in AWS BIG DATA, EMR || Elastic Map Reduce

Contains different configurations and procedures to enable logging on different daemons on AWS EMR cluster.
[Please contribute to this article to add additional ways to enable logging]

HBASE on S3 :

This will enable calls made from EMRFS from HBASE.

Important to troubleshoot S3 consistency issues and failures for HBASE on S3 cluster.

Enabling DEBUG on Hive Metastore daemon (its Datastore) on EMR :

or

Logs at /var/log/hive/user/hive/hive.log

HUE:

use_get_log_api=true in the beeswaxsection of the hue.ini configuration file.

Hadoop and MR :

Enable GC verbose on Hive Server 2 JVM:

WIRE OR DEBUG logging on EMR to check calls to S3 and DDB for DynamoDb connector library :

Paste the following on log4j configurations of Hadoop / hive / spark etc.

/etc/hadoop/conf/log4j.properties
/etc/hadoop/conf/container-log4j.properties
/etc/hive/conf/hive-log4j2.properties
/etc/spark/conf/..

https://github.com/awslabs/emr-dynamodb-connector/blob/master/emr-dynamodb-hive/src/test/resources/log4j.properties

Debug on S3 Calls from EMR HIVE :

These metrics can be obtained from the hive.log when enabling debug logging in aws-java-sdk. To enable this logging, add the following line to '/etc/hive/conf/hive-log4j.properties'. The Configuration API can be used as well.

Enable DEBUG logging for Http Connection pool:

(from spark) by adding the following to /etc/spark/conf/log4j.properties

*Tez overwrites the loglevel options we have passed. Please see the related items.*

Enabling Debug on Hadoop log to log calls by EMRFS :

/etc/hadoop/conf/log4j.properties

You can use same logging config for other Application like spark/hbase using respective log4j config files as appropriate. You can also use EMR log4j configuration classification like hadoop-log4j or spark-log4j to set those config’s while starting EMR cluster.(see below for sample JSON for configuration API)

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

DEBUG on EMR Logpusher Logs :

Edit this file on Master / Slave’s manually and restart Logpusher.

/etc/logpusher/logpusher-log4j.properties

(Might need to stop Service-nanny before stopping Logpusher, to properly stop/start Logpusher)

DEBUG on Spark classes :

Use the following EMR config to set DEBUG level for relevant class files.

DEBUG using spark shell:

Execute the following commands after invoking spark-shell to enable DEBUG logging on respective spark classes like Memstore. You can use the same if you want to reduce the amount of logging from INFO (which is default coming from log4j.properties in the spark conf ) to ERROR.

EMRFS CLI command like EMRFS SYNC :

/etc/hadoop/conf/log4j.properties

Logs will be on the console out. We might need to redirect to a File or do both.

Enable Debug on Boto3 client :