Scroll Top

EMR YARN Node Labels — For effective driver and executor placement

Home EMR YARN Node Labels — For effective driver and executor placement

The Digital Build image, woman pointing to computer.

By Vishal Periyasamy Rajendran Sub category1 Technical Blog May 11, 2022

Our experts are thinkers AND doers focused on accelerating business outcomes. To showcase our deep expertise, we created a blog series called “The Digital Build.”

YARN Node Labels:

Node label is a way to group nodes with similar characteristics and spark jobs can be specified where to run. With node labeling, we can achieve partition on the cluster, and by default, nodes belong to the DEFAULT partition.

Understanding EMR Node Types:

Master node: The master node manages the cluster and typically runs master components of distributed applications. All the major services like spark-history server, resource manager, and node manager runs on the master node.

Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.

Task node: A node with software components that only runs tasks and is utilized in adding power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don’t run the Data Node daemon, nor do they store data in HDFS.

Types of YARN node partitions:

Exclusive: Containers are allocated to nodes that exactly match node partitions. (e.g. Nodes requesting for CORE partition, are allocated to the node with partition=CORE. Nodes requesting DEFAULT partition, are allocated to DEFAULT partition nodes).

Non-exclusive: If a partition is non-exclusive, it shares idle resources to the container requesting a DEFAULT partition.
Non-Exclusive node partitions

Non-Exclusive node partitions

For example,

Consider two node labels,

CORE -> For EMR core nodes
TASK -> For EMR task nodes

First, register the node label list to the resource manager:

#Example for Non-exclusive node partitioning

yarn rmadmin -addToClusterNodeLabels "CORE(exclusive=false),TASK(exclusive=false)"

we can verify the node labels on the cluster using,

yarn cluster --list-node-labels

Note: Both commands cannot be run during the bootstrap action since, on EMR, the Hadoop installation takes place after bootstrap. This command can be run as a step after the cluster has been initialized.

YARN Node mapping Types:

Centralized mapping
Distributed mapping
Delegated-Centralized mapping

Centralized YARN mapping:

Node to labels mapping can be done through Resource manager using API. Below is the code to register a node under the label on centralized mapping,

yarn rmadmin -replaceLabelsOnNode “node1[:port]=CORE node2=TASK” [-failOnUnknownNodes]

Same as the previous case, this cannot be included in bootstrap action. So the best delegate option would be the EMR default mapping configuration. (i.e. Distributed YARN mapping)

Distributed YARN mapping:

Node to labels mapping is set by a configured Node Labels Provider in Node manager. We have two different providers in YARN: Script-based provider and Configuration-based provider.

In the case of script, Node manager can be configured with a script path and the script can emit the labels of the node.
In the case of config, node Labels can be directly configured in the Node manager’s yarn-site.xml.

A dynamic refresh of the label mapping is supported in both of these options.

YARN site XML configuration (yarn-site.xml):

Core node yarn configuration overwrite:

#Default configuration in EMR.

yarn.node-labels.configuration-type="distributed"

yarn.scheduler.capacity.root.default.default-node-label-expression="CORE"

yarn.scheduler.capacity.root.accessible-node-labels="CORE,TASK" #Default false after EMR version 5.19.0 and later.

yarn.node-labels.enabled="true"

yarn.scheduler.capacity.root.default.accessible-node-labels="CORE,TASK"

Task node yarn configuration overwrite:

#Default configuration in EMR.
yarn.node-labels.configuration-type="distributed"

 yarn.scheduler.capacity.root.default.default-node-label-expression="TASK"

yarn.scheduler.capacity.root.accessible-node-labels="CORE,TASK"

#Default false after EMR version 5.19.0 and later.

yarn.node-labels.enabled="true"

yarn.scheduler.capacity.root.default.accessible-node-labels="CORE,TASK"

Capacity Scheduler Configuration:

Once the node labeling is configured, we need to assign a capacity percentage for each node label on capacity-scheduler.xml,

yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity="100"

yarn.scheduler.capacity.root.accessible-node-labels.TASK.capacity="100"

yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity="100"

yarn.scheduler.capacity.root.default.accessible-node-labels.TASK.capacity="100"

yarn.scheduler.capacity.root.accessible-node-labels="*"

yarn.scheduler.capacity.root.default.accessible-node-labels="*"

After assigning node labels, we can verify the label status using the resource manager.

Resource manager console hadoop node labeling — Resource manager console

When launching a spark job, we can configure the driver and executor placement based on the node labels using the spark conf argument or by overwriting the spark default configuration file.

#Launches executor on TASK nodes
--conf spark.yarn.executor.nodeLabelExpression="TASK"

#Launches executor on CORE nodes
--conf spark.yarn.am.nodeLabelExpression="CORE"

Use cases for node labeling:

In most of the data engineering projects where EMR is used, SPOT instances are preferred for TASK nodes to reduce the overall cost but this brings the question about the stability of Spark jobs. When a Spark job is submitted to the EMR cluster, if the driver is launched in one of the task nodes and if that node is lost due to spot pricing fluctuation or any other reason, then the Spark jobs fail. To avoid such situations, the Yarn node labels play a major role in the driver and the executor placement across nodes when a spark job is launched with the cluster mode option.
Some spark jobs might benefit from running on nodes with powerful CPUs. With YARN Node Labels, you can mark nodes with labels such as “MEMORY_NODES” (for nodes with more RAM) or “CPU_NODES” (for nodes with powerful CPUs) so that spark jobs can choose the nodes on which to run their containers. The YARN Resource Manager will schedule jobs based on those node labels.

Caveat on node labeling:

When we configure the driver to always launch on the CORE node then the EMR concurrency is hugely dependent upon the size of the CORE node since more jobs can result in PENDING when capacity runs out.
EMR ASG will be affected since there is no uniform allocation of the containers on CORE and TASK nodes.

Author: Vishal Periyasamy.

References:

Vishal Periyasamy Rajendran

+ posts

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use nonessential cookies that help us analyze and understand how you use this website and enhance your user experience. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Functional

Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
Zoominfo	session	Zoominfo uses technologies to collect and store information when you interact with services it offer to their partners, such as advertising services or analytics. All of those processes are meant to improve your user experience and the overall quality of our services.

Analytics

Analytics cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111355416_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	This cookie is used to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's pageview limit.
_hjIncludedInSessionSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's daily session limit.
_hjTLDTest	session	Hotjar test cookie to check the most generic cookie path it should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we store the _hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid	session	This cookie is used for storing the session ID of the user who clicked on an okt.to link.

Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.

Other

Other uncategorized cookies are those that are being analyzed and have not yet been classified into a category according to their type and purpose.

Cookie	Duration	Description
__gwtCookieCheck	session	This cookie is used to check if the visitors' browser supports cookies.
AnalyticsSyncHistory	1 month	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
li_gc	2 years	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
UserMatchHistory	1 month	LinkedIn - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

EMR YARN Node Labels — For effective driver and executor placement

Vishal Periyasamy Rajendran

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us