Archive for October 3, 2019

Configuring an EC2 instance as DNS server using BIND that can only resolve forward lookups but not reverse lookups in an AWS VPC

Written by mannem on . Posted in EC2 Linux

Expectations while using EMRFS

Written by mannem on . Posted in AWS BIG DATA, EMR || Elastic Map Reduce

Expectations while using EMRFS :

– $ hadoop The fs -put command updates both the S3 and EMRFS tables (files are created in S3 and also in the EMRFS table).
Ex) hadoop fs -put ./111.txt s3: // mannem / emrfstest1
– $ hadoop The fs -rm command updates S3 but does not update the EMRFS table. (Deleted in S3, but still in the EMRFS table)
Ex) hadoop fs -rm -f s3: //mannem/emrfstest1/111.txt
– $ hadoop The fs -mv command will rename S3 (create a new name after deleting it), but only add a new name to the EMRFS table (add new name without deleting existing information).
Ex) hadoop fs -mv s3: //mannem/emrfstest1/emrfs-didnt-work.png s3: //mannem/emrfstest1/emrfs-didnt-work_new.png
– Adding files from the S3 console (WEB UI) is added to S3 but not to the EMRFS table
– Deleting files from the S3 console (WEB UI) will delete them in S3, but they will not be deleted in the EMRFS table
– Renaming a file in the S3 console (WEB UI) changes the name in S3, but does not rename the EMRFS table.
– The EMRFS CLI (for example, $ emrfs delete or $ emrfs sync) does not change the actual data in S3. Only add / delete DynamODB meta tables used by EMRFS.
– EMRFS uses DynamoDB. In EMR clusters, you can view the table names with the emrfs describe-metadata command. You can also see it on the EMR web console.

 

– In the EMRFS table, S3 prefix value is entered in HashKey and Object name is entered in Rangekey.
– You can optionally specify the number of retries and the time to wait until the next retry when an exception occurs
: Http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-retry-logic.html
– Please refer to http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-files-tracked.html for related information.

EMR vCPU vCore issue

Written by mannem on . Posted in AWS BIG DATA, EMR || Elastic Map Reduce

Several customer confuse when they see vCore’s used by EMR is different from what Ec2 vCPU’s. This article will clarify why EMR had to use vCore’s and some problems that exist with Instance Fleets and how to workaround them.

When you choose an instance type using the AWS EMR Management Console, the number of vCPU shown for each Instance type is the number of YARN vCores for that instance type, not the number of EC2 vCPUs for that instance type.
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-purchasing-options.html

1.  EMR console is picking the yarn.nodemanager.resource.cpu-vcores value for the respective instance type from a predefined fixed mapping done by EMR for every instance type / Family.  For some instance types/families like M5’s, EMR made vCore same as Ec2 vCPU‘s. And for some other instance types(Like M4 family), the setting is double the actual Ec2 vCPU’s.

For example : EMR used 80 vcore’s for m4.10xlarge whereas Ec2 reports vCPU’s as 40.

2. So it seems that the intent here is to report VCore usage at the YARN level, as opposed to the actual ec2 instance level.

3. The discrepancy on the EMR Console exists because we’re trying to represent a cluster’s compute power from YARN perspective. Since EMR clusters run applications according to their YARN settings, I think some decision may have been made to deem this a better representation of the compute resources than ec2’s vCPU.

4. The reason this is done is  to ensure that YARN runs enough containers to max out the CPU as much as possible. EMR determined at the introduction of some instance type families, that for a majority of use-cases, without doubling this value, the instances CPUs will usually be underutilized because most of the time applications are I/O bound.That is, if vCPUs were set to the actual number of CPUs for these instance types, you’d get about one YARN container per actual vCPU, but those containers would spend most of their time blocked on I/O anyway, so you could probably actually run more containers in order to max out the CPU.

5. Amazon EMR makes an effort to provide optimal configuration settings as defaults for each instance type for the broadest range of use cases(types of application and Engines like MapReduce and Spark).  However it is possible that you may need to manually adjust these settings for the needs of your application. This value may be changed via the Configuration API referencing the yarn.nodemanager.resource.cpu-vcores for your different applications and workloads using “yarn-site” classification.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

https://hadoop.apache.org/docs/r2.8.3/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Problems :

Sometime customer do not want to double the VCore’s for each vCPU. Some use-cases can cause  containers to compete for a single hyperthread(VCPU) causing subpar performance.  despite other nodes still having idle hyperthreads.

For example : If a customer has

– 1 m4.10xlarge core(which will double 40 vCPU to 80 vCore’s) &

– 1 m5.12xlarge Task node(with 1:1 mapping of  vCPU to vCore i.e 48 )

We cannot define “yarn.nodemanager.resource.cpu-vcores” as a fixed value for this cluster.

– For Instance Fleets, where  multiple instance families can be specified by customer  and fulfilled by EMR to satisfy Target capacity, its hard to set a single EMR configuration that encompasses all instance families.

– Similar issue exists for Uniform instance groups, with different instance families on different Core’s / Task group.

Workarounds :

————————————————-
– I tested the following settings to be set on “yarn-site” classification. This will tell YARN Nodemanagers(NM) on each node to use the vCpu’s of its underlying ec2 Linux instance  as vCore’s. This will ignore the defaults set by EMR. This means a m4.10xlarge node(NM) will only allocate 40 vcore’s for YARN containers because there are 40 logical processors.  The correct number off Vcore’s should be confirmed from ResourceManager UI or using YARN commands like  “yarn node -status ip-172-31-44-155.ec2.internal:8041”

https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#node

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html

Please see what these parameters mean here :
https://hadoop.apache.org/docs/r2.8.3/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

JSON :

Note: If you use above parameters, AWS EMR Console(or Describe cluster API call) will still show 80 vcore’s(double the VCPU) for m4.10xlarge because that info is based on the predefined fixed mapping for different instance types and NOT pulled from your live nodes. For Instance fleets, EMR will still use 80 vcore’s count towards  capacity calculations to launch resources that satisfy your  Target capacity units per Fleet.  This could cause discrepancy in what Vcore capacity you need vs. what actually is provisioned and being used.

– For this reason, another suggestion is to use same Instance family(Like M5) on all your instance groups of the instance fleets and also for all fleets ,  so that they can have consistency with predictable 1:1 vCpu and vCore mapping.