It is no secret that the volume of data is exploding and the challenge of properly storing and maintaining that data has not gone away.
One primary solution for storing and managing a vast volume of data is HDFS, or the Hadoop Distributed File System. This central component is responsible for maintaining the data of the Hadoop platform. Once at scale however, it can be very difficult to get a breakdown of what has been stored into HDFS.
In most cases, users of large multi-tenant instances of HDFS rely on email broadcasts, alerting systems, or command line utilities to scan directories and to better understand their own datasets in terms of capacity. As a result of not having efficient means to get a quick view and breakdown of usage by user and directories, administrators of said large HDFS instances will also struggle to keep certain file counts or capacities within stable bounds if there is no one actively maintaining some sort of retention policy.
(This is a graph of 1 years worth of file growth on one of our largest production Hadoop clusters – culminating in 100M+ new files in 2017. ~400M is the current NameNode scalability limit.)
One major issue we have seen with large HDFS instances is the difficulty in performing large and complicated scans of the entire file system.
There are several architectures that others deploy to do this. The most commonly deployed one is to parse the FsImage, a binary file that represents all the metadata stored in the HDFS instance. Everything needed to perform this parsing, namely the OfflineImageViewer component, comes standard with any flavor of Apache HDFS. This has been the same architecture deployed at PayPal ever since Hadoop was first brought in. We would then ship the processed FsImage to Kibana or ElasticSearch and from there generate nightly usage reports and email them out. While it worked just fine; in most cases context was lost or users would check their reports way too late.
Processing the FsImage also has its own issue of actually losing certain metadata attributes because the nature of parsing the image into plaintext removes some metadata along the way. Examples of lost metadata include things like extended attributes, delegation tokens, and storage policies.
(Architecture graph of processing an FsImage from a large (300M+ files) HDFS cluster. 3 minutes for edit log tailing. 60 minutes to produce and ship image. 30 minutes to parse and ship image once more.)
As shown above, the act of fetching and parsing the FsImage can be a long process. I have personally clocked an end-to-end processing time of anywhere between 1.5 to 2 hours to generate a nightly report. Keep in mind it is unrealistic to publish a report every couple of hours using this architecture because of its reliance on the active HDFS cluster. Firstly, an FsImage on large HDFS instances is typically produced every 3-4 hours, and lastly, even if the FsImage was produced every hour, if we attempted to process it we could risk overwhelming the network because transferring the FsImage across hosts is a major network event and effects NameNode performance, slightly, but measurable.
As part of HSRE, the Hadoop Platform team at PayPal, we looked externally for a long time, were not satisfied with the market solutions. Part of our difficulty with finding satisfactory current market solutions was that all of them were deeply tied to this previous architecture pictured above and as a result had long processing times and reliance on the active HDFS instance. We had a strong desire to find a solution because data growth spikes were happening in PayPal within as small a window as a single day.
Thus we looked instead internally into creating a new, faster, architecture; and that’s how NameNode Analytics was born.
(Architecture graph of NNA. Stays up-to-date tailing EditLog from JournalNodes (which can handle many readers). Only needs to talk to cluster NameNode once for bootstrap. Only latency is the ~3 minutes of EditLog tailing.)
Traditionally, it would be required to pull the FsImage out of the active HDFS cluster to perform this processing. However, through my years of work and understanding of HDFS architecture, I discovered that by utilizing an isolated NameNode, just for the purpose of getting up-to-the-minute updates to metadata, we were able to create a system that is able to generate vastly more superior reports for end users and targeted directories. All without having to rely on critical pieces of the active HDFS cluster.
The first trick was to keep our isolated NameNode up-to-date through JournalNode EditLog tailing; the same process that keeps the Standby NameNode of the active HDFS cluster up-to-date. The last trick was to provide a query engine on top that could facilitate the fast traversal of the entire in-memory metadata set.
Through our efforts we found we could now generate more metadata enriched reports every 30 minutes on our largest HDFS installations, and perform near-real-time queries every 3 minutes or as fast as EditLog tailing would allow; something we could not even achieve before without severely impacting HDFS and Hadoop job performance.
Perhaps the greatest result from utilizing NNA has been the ability to offload users’ large metadata scans (which typically utilize expensive calls and can affect job performance) away from the Active NameNode and instead onto NNA.
In short, NNA has sped up our reporting intervals, offloaded expensive metadata scans, opened up an interactive query API for HDFS, and overall provided a better HDFS experience for end-users and administrators. Some example screenshots / queries are below:
Screenshots of the main NNA Status Page:
(The top half of the main NNA status page. Showcasing a breakdown of files and directories and offering a way (search bar in the top right) to breakdown those numbers by user.)
(The bottom half of the main NNA status page. Shows main breakdown of tiny, small, medium, large, old files and directories with quotas.)
Histogram breakdown of File Types on the cluster:
(Near real time query showing the breakdown of different type of files on the filesystem. Y-axis is number of files; X-axis is the file type.)
Histogram breakdown of File Replication Factors on the cluster:
(Near real time query showing the breakdown of different levels of file replication factors on the filesystem. Y-axis is number of files; X-axis is the replication factor buckets.)
Histogram breakdown of File Storage Policies on the cluster:
(Near real time query showing the breakdown of different storage policies of files on the filesystem. Y-axis is number of files; X-axis is the storage policies.)
(Many many more types of queries, histograms, summations, and searches are possible. Check documentation at the bottom for more info.)
Since this information can be obtained with relative ease, we, internally, even trend it in a time-series database and are now able to see exactly how users are behaving. If you are curious about all the different things and ways you can query HDFS using NameNode Analytics I highly encourage you to watch my talk or read my slides available at the DataWorks Summit 2018 session page: https://dataworkssummit.com/san-jose-2018/session/namenode-analytics-scouting-the-hdfs-metadata/.
There are many things that NNA can do that would be too long to list here. We have several stories where having this level of insight that NNA provides has saved us at PayPal. I highly encourage everyone to read our “Story Time!” slides!
One major and scary incident was when we reached the peak scalability limit of one of our NameNodes. Recall earlier that I said 400M+ files was the NameNode peak before Java Garbage Collection starts to thrash the NameNode.
With NNA by our side we were able to target directories for removal quickly – and we didn’t even have to communicate with the active HDFS cluster to find them. Here is what the event looked like, with file counts on the left, and Java Garbage Collection times on the right:
(File count (red) and Block count (orange) of our largest production Hadoop cluster.)
(Java GC during this event; NameNode operations were grinding to a halt.)
(This incident was so severe we risked a multi-hour HDFS outage if this was not addressed while the NameNodes were still alive and could process delete operations.)
Another event was when we were hitting an HDFS performance bug and had to change storage policies on several thousand directories in order to remediate. We had no idea or list of which directories had the bug and running a “recursive list status and grep” operation across the entire file system would have taken days. But with NNA we were able to get the list in about 2 minutes and get on to targeting exactly the directories we needed to in order to restore RPC performance in about an hour.
(The difference in client response times before and after we got away from the storage policy bug.)
Thank you for taking the time to have read this far! There is a lot to cover when it comes to NNA – please look out for more news / talks from me.
Plamen has been in the Hadoop field for 7 years at the time of writing. He has been a Senior Hadoop Engineer at PayPal since April 19th 2016 and has been working on HDFS and HDFS-related projects for nearly those entire 7 years. He is an Apache Hadoop contributor, most notably as part of the team that introduced truncate (https://issues.apache.org/jira/browse/HDFS-3107), part of the team at WANdisco that worked on making HDFS a unified geo-graphically distributed filesystem (https://patents.google.com/patent/US20150278244), and part of the team working on distributing the NameNode across HBase RegionServers (https://github.com/GiraffaFS/giraffa). He loves to talk HDFS, beer, crypto, and Melee. GitHub: zero45, Apache: zero45