HDFS NameNode Performance Diagnostics

27 November 2018

Technology

Foreword

Hadoop Distributed File System (HDFS) is a key component of G-Research’s Big Data platform, and our primary cluster has grown rapidly this year. Consequently, we have to ensure that our HDFS NameNode scales as our cluster grows.

Recently, we began to notice intermittent increases in NameNode RPC processing latency, and users of the cluster were experiencing slow HDFS operations occasionally too. Overall per-call latencies were increasing by a factor of ten, and in some cases individual calls were taking tens of seconds to complete. This was leading to noticeable delays in jobs completing.

This blog post describes some of the diagnostic tools that are available when investigating problems with NameNode RPC response time, and how we used them to identify the root cause of the slow-downs at G-Research.

Background

NameNode ensures consistency of the distributed file system by limiting concurrent namespace access to a single-writer or multiple-readers. This single global lock is known as the FSNamesystem lock, and is implemented using Java’ s ReentrantReadWriteLock class.

Understanding that long read lock holds will block writers, but not other readers is useful background knowledge to have when investigating NameNode performance problems.

Debug Logging

In cases where you can reproduce a performance problem on demand, enabling debug logging exposes the elapsed time for each HDFS call. For example:

export HADOOP_ROOT_LOGGER=DEBUG,console
hdfs dfs -du -s -h /user/nmarchant
18/09/23 17:38:10 DEBUG ipc.ProtobufRpcEngine: Call: getContentSummary took 29ms

In the above example, the HDFS getContentSummary call only took 29 milliseconds. If the total elapsed time for all the HDFS calls accounts for a small fraction of the elapsed time of the command, it is clear that NameNode RPC latency is not root cause.

Metrics

If you are using Apache Ambari to manage your cluster, the HDFS – NameNodes Grafana dashboard that comes with the Ambari Metrics service provides a number of useful NameNode metrics.

In our case, we could see that RPC Client Port Processing Time was increasing in-step with Lock Queue Length and Threads Blocked e.g. waiting for a lock. Therefore, our focus shifted to FSNamesystem lock contention.

In cases where you suspect that high RPC latency is load related the HDFS – Top-N and HDFS – Users dashboards contain detailed RPC workload metrics.

In addition, NameNode metrics includes an rpcdetailed context, which provides per-method operation counts and average elapsed time. By default Ambari Metrics configures a ten second sampling period for its hadoop-metrics2 sink, so the averages are fairly granular.

There are no built-in dashboards for visualizing the rpcdetailed metrics, but you can create your own custom ones, or use the Ambari Metrics API to fetch them.

For example, to retrieve all rpcdetailed metrics for host masternode1.net for a specific time period (epoch time in seconds or milliseconds):

curl -X GET -u $USER 
'http://amscollector.net:6188/ws/v1/timeline/metrics?metricNames=rpcdetailed.rpcdetailed.%&appId=namenode&hostname=masternode1.net&startTime=1541495871&endTime=1541496175'

We were interested in per-method timings because we wanted to understand whether the increases in overall RPC processing time affected all HDFS operations, or just a subset. For example, imagine if someone (or something) decided to run hdfs dfs -du on root (“/”). This would recursively scan the entire file system tree, holding the FSNamesystem lock for as long as it takes to process 5000 content summary counts (assuming the default setting for dfs.content-summary.limit) . Readers wouldn’t be blocked, but writers would be.

For example, using these metrics we were able to see that RPC processing time was regularly increasing getBlockLocations latency, which we knew would impact our user’s jobs.

NameNode Logs

NameNode logs a number of performance related events, so a log aggregation tool should be used ensure the HDFS logs are retained and easily searched.

FSNamesystem Lock Warnings

NameNode will write a log message whenever the FSNamesystem read or write lock is held for longer than the configured threshold. By default, dfs.namenode.write-lock-reporting-threshold and dfs.namenode.read-lock-reporting-threshold are set to 1 second and 5 seconds respectively. Additionally, by setting dfs.namenode.lock.detailed-metrics.enabled to true (by default it is false) NameNode will include the operation that initiated the lock hold. In our experience, the overhead of enabling this is not noticeable, and knowing what operation acquired the lock is very useful.

2018-11-05 17:19:32,885 INFO  namenode.FSNamesystem (FSNamesystem.java:writeUnlock(1689)) - FSNamesystem write lock held for 36575 ms via
java.lang.Thread.getStackTrace(Thread.java:1556)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1690)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4190)
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1109)
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:645)
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:422)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
Number of suppressed write-lock reports: 0
Longest write-lock held interval: 36575

User to Group Lookup Warnings

Whenever looking up a single user to group takes longer than the configured threshold, NameNode will log a warning message. By default hadoop.security.groups.cache.warn.after.ms is set to 5 seconds.

2018-11-05 17:19:32,884 WARN  security.Groups (Groups.java:fetchGroupList(361)) - Potential performance problem: getGroups(user=tbarker) took 36575 milliseconds.

Sync of Transaction Range Warnings

JournalNode will log a warning message whenever writes to the edit log take more than 1 second. Since NameNode processing is constrained by how quickly the JournalNodes can write the edit logs, these messages will help you identify I/O performance issues.

2018-10-19 05:18:49,213 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 3770473568-3770473568 took 10000ms

Detected Pause in JVM or Host Machine Warnings

NameNode logs a warning whenever the JvmPauseMonitor thread detects a garbage collection pause.

2018-10-30 16:44:43,208 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 6537ms

Garbage Collection Logs

Whilst the HDFS – Name Nodes dashboard includes garbage collection times and counts, Ambari Metrics ten-second sampling period means that long duration garbage collection events are averaged out. Therefore, it’s a good idea to independently review the NameNode garbage collection logs.

Long duration stop-the-world events (anything above a few tens of milliseconds) should not occur often. Frequent full garbage collections are a concern, especially when the size of the heap does not reduce significantly (which indicates that a larger heap size is required).

The following examples assume the use of the Concurrent Mark and Sweep collector.

Minor GC

2018-11-06T11:31:02.451+0000: 1455731.876: [GC (Allocation Failure) 2018-11-06T11:31:02.451+0000: 1455731.876: [ParNew: 3386482K->30147K(3774912K), 0.0442505 secs] 23296029K->19939896K(33135040K), 0.0448028 secs] [Times: user=0.18 sys=0.17, real=0.05 secs]

The GC event started at 2018-11-06T11:31:02.451+0000
It was a minor GC (GC)
It was triggered by an allocation request that could not fit into the Young generation (Allocation Failure)
The ParNew collector was used
The Young generation was 3386482K used before the collection and 30147K afterwards, with a total heap size of 3774912K: 3386482K->30147K(3774912K)
The application was stopped for 0.05 seconds (real=0.05 secs)

Full GC

Old Generation collections occur in multiple phases, and Young Generation collections can occur in parallel. For clarity, only Old Generation collection messages are included in this snippet:

2018-10-20T15:11:41.898+0000: 171.324: [GC (CMS Initial Mark) [1 CMS-initial-mark: 20589200K(29360128K)] 21075667K(33135040K), 0.0253652 secs] [Times: user=0.17 sys=0.03, real=0.03 secs]
2018-10-20T15:11:41.924+0000: 171.350: [CMS-concurrent-mark-start]
2018-10-20T15:11:58.260+0000: 187.686: [CMS-concurrent-mark: 13.679/16.336 secs] [Times: user=57.64 sys=4.57, real=16.33 secs]
2018-10-20T15:11:58.261+0000: 187.686: [CMS-concurrent-preclean-start]
2018-10-20T15:12:02.526+0000: 191.951: [CMS-concurrent-preclean: 3.779/4.265 secs] [Times: user=19.22 sys=2.00, real=4.27 secs]
2018-10-20T15:12:02.526+0000: 191.951: [CMS-concurrent-abortable-preclean-start]
2018-10-20T15:12:02.864+0000: 192.289: [CMS-concurrent-abortable-preclean: 0.236/0.338 secs] [Times: user=1.46 sys=0.13, real=0.34 secs]
2018-10-20T15:12:02.865+0000: 192.291: [GC (CMS Final Remark) [YG occupancy: 2837467 K (3774912 K)]2018-10-20T15:12:02.866+0000: 192.291: [Rescan (parallel) , 0.2059621 secs]2018-10-20T15:12:03.072+0000: 192.497: [weak refs processing, 0.0001084 secs]2018-10-20T15:12:03.072+0000: 192.497: [class unloading, 0.0192222 secs]2018-10-20T15:12:03.091+0000: 192.516: [scrub symbol table, 0.0101548 secs]2018-10-20T15:12:03.101+0000: 192.527: [scrub string table, 0.0010216 secs][1 CMS-remark: 21714369K(29360128K)] 24551836K(33135040K), 0.2380998 secs] [Times: user=1.66 sys=0.02, real=0.24 secs]
2018-10-20T15:12:03.104+0000: 192.529: [CMS-concurrent-sweep-start]
2018-10-20T15:12:12.204+0000: 201.629: [CMS-concurrent-sweep: 8.791/9.100 secs] [Times: user=28.15 sys=2.81, real=9.10 secs]
2018-10-20T15:12:12.204+0000: 201.629: [CMS-concurrent-reset-start]
2018-10-20T15:12:12.293+0000: 201.719: [CMS-concurrent-reset: 0.089/0.089 secs] [Times: user=0.12 sys=0.03, real=0.08 secs]

Only CMS Initial Mark and and Final Remark phases are stop-the-world.

Initial Mark
- Started at 2018-10-20T15:11:41.898+0000
- The Old Generation has 29360128K of memory available, and 20589200K is used: 20589200K(29360128K)
- The heap is 33135040K in size and 21075667K is used after the collection: 21075667K(33135040K)
- The application was stopped for 0.03 seconds (real=0.03 secs)
Final Remark
- Started at 2018-10-20T15:12:02.865+0000
- The Old Generation has 29360128K of memory available and is 21714369K used after the collection: 21714369K(29360128K)
- The heap is 33135040K in size and 24551836K is used: 24551836K(33135040K)
- The application was stopped for 0.08 seconds (real=0.08 secs)

Our Diagnosis

In our case, the spikes in RPC processing latency were caused by occasional slow user group lookups due to Linux sssd cache misses. This was evidenced by FSNamesystem read and write lock hold warnings in the NameNode logs with corresponding user to group lookup warnings, where the elapsed time for each was almost identical. Evidently, the operations in question were acquiring the lock ahead of the user group lookup, and so were only as fast (or slow) as the lookup itself.

2018-11-06 10:27:18,628 WARN  security.Groups (Groups.java:fetchGroupList(361)) - Potential performance problem: getGroups(user=dleake) took 17939 milliseconds.
2018-11-06 10:27:18,628 INFO  namenode.FSNamesystem (FSNamesystem.java:readUnlock(1638)) - FSNamesystem read lock held for 17940 ms via
java.lang.Thread.getStackTrace(Thread.java:1556)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1639)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4151)
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1137)
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:866)
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:422)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
Number of suppressed read-lock reports: 0
Longest read-lock held interval: 17940

By clearing the sssd cache and running hadoop org.apache.hadoop.security.UserGroupInformation as the users who frequently appeared in the warnings we were consistently able to reproduce the slow user to group lookup warnings:

$ hadoop org.apache.hadoop.security.UserGroupInformation
Getting UGI for current user
User: dberrecloth
Group Ids:
18/11/06 12:04:23 WARN security.Groups: Potential performance problem: getGroups(user=dberrecloth) took 25269 milliseconds.
Groups: sales, hr, finance
UGI: dberrecloth (auth:KERBEROS)
Auth method KERBEROS
Keytab false
============================================================

By enabling debug logging for sssd we could see that whilst individual LDAP queries returned in a few milliseconds, there would be hundreds of queries executed serially for users with a large number of group memberships (by default sssd will return group members when processing group lookup calls). Fortunately, sssd features a number of configuration parameters for optimizing LDAP queries, and we are currently assessing which would be the most effective in our environment.

Austin Hackett, Big Data Platform Engineer