Seven months after the launch of its AI benchmarking suite, the MLPerf consortium is releasing the first round of results based on submissions from Nvidia, Google and Intel. Of the seven benchmarks encompassed in version v0.5 of the would-be benchmarking standard, Nvidia announced that it captured the lead spot in six. Separately, Google (which led the creation the benchmark) said results show Google Cloud “offers the most accessible scale for machine learning training.”
MLPerf supporting companies (as of Dec. 12, 2018) – click-to-enlarge
As HPCwire covered in May, MLPerf is an emerging AI benchmarking suite “for measuring the speed of machine learning software and hardware.” Started by a small group from academia and industry, including Google, Baidu, Intel, AMD, Harvard, and Stanford, the project has grown considerably in the last half-year. At last count, the website lists 31 supporting companies: the aforementioned Google, Intel, AMD, and Baidu as well as ARM, Nvidia, Cray, Cisco, Microsoft and others (but not IBM or Amazon).
According to the consortium, the training benchmark is defined by a dataset and quality target and also provides a reference implementation for each benchmark that uses a specific model. The following table summarizes the seven benchmarks in version v0.5 of the suite, which spans five categories (image classification, object detection, translation, recommendation and reinforcement learning). Time to train is the main performance metric.
MLPerf Training v0.5 is a benchmark suite (Source: MLPerf)
Nvidia revealed today that its platforms outperformed the competition by up to 5.3x (faster time to results), showing leading single-node and at-scale results for six of the workloads. Nvidia opted not to submit for reinforcement learning network because, as Ian Buck, vice president and general manager of accelerated computing at Nvidia, explained in an advance press briefing, it is for the most part CPU-based and does not have meaningful acceleration in its current form.
Nvidia submitted for all of the six accelerated benchmarks in two categories — single node (testing up to 16 V100 GPUs in the DGX-2H platform) and at-scale (testing in various configurations, up to 640 GPUs).
In a blog post published today, Nvidia stated that “a single DGX-2 node can complete many of these workloads in under twenty minutes. And in the case of our at-scale submission, we’re completing these tasks in under seven minutes in all but one of the tests.”
Test Platform: DGX-2H – Dual-Socket Xeon Platinum 8174, 1.5TB system RAM, 16 x 32 GB Tesla V100 SXM-3 GPUs connected via NVSwitch (Source: Nvidia, see endnotes for details)
Test Platform: For Image Classification and Translation (non-recurrent), DGX-1V Cluster. For Object Detection (Heavy Weight) and Object Detection (Light Weight), Translation (recurrent) DGX-2H Cluster. Each DGX-1V, Dual-Socket Xeon E5- 2698 V4, 512GB system RAM, 8 x 16 GB Tesla V100 SXM-2 GPUs. Each DGX-2H, Dual-Socket Xeon Platinum 8174, 1.5TB system RAM, 16 x 32 GB Tesla V100 SXM-3 GPUs connected via NVSwitch. (Source: Nvidia, see endnotes for details)
While there are faster ResNet50 competitions out there, they aren’t under the standard MLPerf guidelines, Nvidia told us.
“By improving and delivering on the full-stack optimization and our performance at scale, we decrease training times, which makes research and deployment of AI faster and we improve the cost efficiency,” said Ian Buck in the press briefing. “If I take a DGX Station and look at its value over four years, it’s roughly $1.50/hr, so a little over $6 to train a ResNet50.” Buck added that Titan RTX, announced last week with a list price of $2,499, comes out to just over $2.00 to train a single ResNet50.
Speaking to the value of the “industry’s first comprehensive AI benchmark” and what that means for customers, Buck stated: “Nvidia is no stranger to benchmarks; we certainly have them in the graphics space, we have them in the supercomputing space and we now have them as well in the AI world. Providing a common benchmark, a common set of rules as long as it’s appropriately governed can provide perspective to customers and the rest of the community on the state of everyone’s solution. It also provides a nice common platform for people to innovate, to measure innovation and help companies move the ball forward in improving the performance.”
Google also took time to promote its results today in a blog post, claiming Google Cloud “offers the most accessible scale for machine learning training” and “a 19% TPU performance advantage on a chip-to-chip basis.”
“The results show Google Cloud’s TPUs (Tensor Processing Units) and TPU Pods as leading systems for training machine learning models at scale, based on competitive performance across several MLPerf tests,” wrote Urs Hölzle, Senior Vice President of Technical Infrastructure, Google.
“For data scientists, ML practitioners, and researchers, building on-premise GPU clusters for training is capital-intensive and time-consuming—it’s much simpler to access both GPU and TPU infrastructure on Google Cloud,” said Hölzle.
This graphic from Google compares absolute training times for Nvidia’s DGX-2 machine, containing 16 V100 GPUs, with results using 1/64th of a TPU v3 Pod (16 TPU v3 chips used for training and 4 TPU v2 chips used for evaluation). The three benchmarks shown are image classification (ResNet-50), object detection (SSD), and neural machine translation (NMT).
Training time comparison between 1/64th of a TPU v3 Pod (16 TPU v3 chips used for training, plus four separate Cloud TPU v2 chips used for evaluation) and an Nvidia DGX-2 (16 V100 GPUs) (Source: Google Cloud)
The inaugural MLPerf testing only had three submitters: Nvidia, Google and Intel. All submitted for the closed division, which compares hardware platforms or software frameworks on an “apples-to-apples” basis. None submitted to the open division, which is allows any ML approach that can reach the target quality and is intended to foster innovation. See results here: https://mlperf.org/results/
Noted on its Github page, “MLPerf v0.5.0 is the ‘alpha’ release of an agile benchmark, and the benchmark is still evolving based on feedback from the community.” Changes under consideration include “raising target quality, adopting a standard batch-size-to-hyperparameter table, scaling up some benchmarks (especially recommendation), and adding new benchmarks.” The current suite is limited to training workloads, but according to Nvidia, there are plans to add inference-focused benchmarks. The consortium is working on releasing interim versions of the suite (v0.5.1 and v0.5.2) in the first half of 2019 with a full version 1.0 release planned for the third quarter of 2019.
IBM and Nvidia today announced a new turnkey AI solution that combines IBM Spectrum Scale scale-out file storage with Nvidia’s GPU-based DGX-1 AI server to provide what the companies call the “the highest performance in any tested converged system” while supporting data science practices and AI data pipelines (data prep- training- inference- archive) in which data volumes continually grow.
Called IBM SpectrumAI with Nvidia DGX, the all-flash offering is designed to be an AI data infrastructure and is configurable from a single IBM Elastic Storage Server, to a rack of nine Nvidia DGX-1 servers with 72 Nvidia V100 Tensor Core GPUs, up to multi-rack configurations. IBM said Spectrum storage scales “practically linearly,” with random read data throughput requirements to feed multiple GPUs. The system has demonstrated 120GB/s of data throughput in a rack, according to IBM.
As with other AI multi-vendor partnerships involving Nvidia, the offering is built to handle tasks that would otherwise require data scientists.
“The key thing is we’re removing the impediments to getting AI to be an automated process,” Eric Herzog, VP, Product Marketing and Management, IBM Storage Systems, said in an interview.
IBM SpectrumAI with Nvidia DGX
SpectrumAI provides services across different storage choices, including AWS. The Spectrum Scale parallel file system can share data with IBM Cloud Object Storage and tape with shared metadata services provided by Spectrum Discover.He said the reference architecture for SpectrumAI with Nvidia DGX encompasses IBM Spectrum Scale, for high performance data ingest; Spectrum Discover, to make data accessible via data cataloguing and indexing within an unstructured environment; and an API and software that re-uses the workflows created by Spectrum Discover. “You can use the workflows over and over again,” Herzog said, “minimizing the time the data scientists spend on data prep – it’s automated.”
Herzog said Spectrum Scale has the capacity to support expanding, data-hungry AI implementations at scale.
“It’s important that it be able to grow,” he said. “IBM Spectrum Scale and IBM Cloud Object Storage have, in production today, installations of more than an exabyte. And both of them can scale to multi-exabyte configurations. In fact, this reference architecture its spec’ed up to a capacity of 8 exabytes – not counting the archive side, just counting the primary storage side. So it lets you create giants sets needed for AI. AI is mimicking what we do from a human perspective and allow this vast amount of data to be absorbed.”
Built specifically for AI implementations, DGX-1 servers contain eight Tesla V100 GPUs and deliver a petaflops of mixed-precision throughput, with 256 GB of system memory, according to Nvidia. The DGX software stack is designed for GPU-accelerated training, including the RAPIDS framework. Nvidia said adoption of new AI frameworks is simplified by the container model supported by the Nvidia NGC container repository of applications.
“IBM’s strategy is to make AI/ML/DL more accessible and more performant,” said Ashish Nadkarni, group VP, infrastructure systems, platforms and technologies, at industry watcher IDC. “IBM SpectrumAI with Nvidia DGX is designed to provide a tested and supported platform. For those who are choosing Nvidia DGX servers for the open source frameworks and high-throughput GPU platforms, IBM Spectrum Scale can add intelligent, scalable, secured, metadata-rich, cloud-integrated, multiprotocol, high performing, and efficient storage in an easy to deploy solution from their top tier business partners.”
Herzog said the offering will be sold strictly through an IBM and Nvidia reseller channel comprised of five companies and which Herzog said will expand next year.
SAN JOSE, Calif., Dec. 12, 2018 — WekaIO, an innovation leader in high-performance, scalable file storage for data intensive applications, has announced TRE ALTAMIRA, a world leader in measuring ground and structural movement from space, is using Matrix on Amazon Web Services (AWS) to process their dataset of satellite imagery for their customers. TRE ALTAMIRA chose the Matrix file system for its high-performance storage capabilities on AWS, which has eliminated production capacity limits, significantly reduced costs, and removed barriers to product innovation.
TRE ALTAMIRA uses satellite radar technology to measure ground and structural movement for clients in sectors ranging from oil and gas, to civil engineering, and geo hazards. Its technology has played a key role in such high-profile rail-transit projects as the Grand Paris Express and Canada Line. Processing the 4TB datasets required to provide such information to clients requires significant processing power and high-performance storage; requirements that limited the company’s production capacity to 30 analyses per week. Thus, TRE ALTAMIRA found its growth capped at this number and its ability to innovate similarly hamstrung by the lack of available resources to test and develop promising new algorithms.
“The situation was untenable, and we needed an urgent solution to solve the fast-growing resource constraints. Help came in the form of a recommendation for WekaIO Matrix on AWS. It was astonishing how easy the implementation and the integration was versus trying to build and manage our own Lustre® file system using Ubuntu instances,” said Alessandro Menegaz, IT Manager at TRE ALTAMIRA. “The improvements we got in terms of performance and costs savings have been great too. Recently we completed one of the most demanding analysis that we have ever made in less than two weeks with Matrix. In contrast, with the Lustre system we built, a similar analysis a year ago took two months and we spent three times as much. Plus, the WekaIO support is unparalleled.”
Time-to-market has also improved exponentially, as Matrix eliminated limitations on the number of concurrent analyses. An average execution now takes just 12 hours, regardless of how many orders are being processed simultaneously, which has simplified the sales process by unlocking the ability to offer guaranteed delivery times. Finally, TRE ALTAMIRA says WekaIO Matrix on AWS has opened new areas of innovation for its business; enabling it to revisit products and services that were previously discarded for want of resources.
“TRE ALTAMIRA is the perfect example of a data-intensive use case whose explosive potential could only be fully tapped by the combination of Matrix on AWS, our cloud-native storage solution that enables high-performance computing at massive scale and elasticity,” said WekaIO CEO, Liran Zvibel. “Here you have a customer that began immediately reaping tremendous returns on its investment from the get-go—faster time-to-market, better cost-management, new product innovation potential. We couldn’t have hoped for a better outcome with the power of the Matrix solution on AWS.”
WekaIO helps companies manage, scale and futureproof their data center so they can solve real problems that impact the world. WekaIO Matrix, one of the world’s fastest shared parallel file systems and WekaIO’s flagship product, leapfrogs legacy storage infrastructures by delivering simplicity, scale, and faster performance for a fraction of the cost. In the cloud or on-premises, WekaIO’s NVMe-native high-performance software-defined storage solution removes the barriers between the data and the compute layer, thus accelerating artificial intelligence, machine learning, genomics, research, and analytics workloads.
Dec. 12, 2018 — High-performance computing (HPC) specialists are looking forward to the technological improvements that should arrive in the coming years as supercomputers approach the exascale. New approaches in hardware design (including new processors and high-bandwidth memory) and in application development (for example, code parallelization and data processing) will expand the power of supercomputing and therefore make it possible to solve new kinds of complex problems.
Panel discussion on the Future of HPC in Engineering
Industrial research and development in engineering is one application area of high-performance computing that is likely to benefit from these advances. In an effort to expand the transfer of HPC expertise to engineering industry and research, a new European Union Centre of Excellence
(CoE) for engineering applications called EXCELLERAT kicked off on December 11 and 12 in Stuttgart, the location of its coordinating organization High-Performance Computing Center Stuttgart (HLRS). The kick-off event concluded with a panel discussion moderated by Andreas
Wierse (SICOS GmbH) with Bastian Koller (HLRS), Erwin Laure (KTH), Matthias Meinke (RWTH), Thomas Gerhold (DLR), and Gerd Büttner (Airbus). The expert panel of scientists, application owners, and application users discussed how EXCELLERAT will benefit the use of HPC in
different research areas, such as aerospace and automotive. With its roughly €8 million in EU funding, the centre will accelerate technology transfer of leading-edge HPC developments to the engineering sector.
EXCELLERAT’s goal is to facilitate the development of important codes for high-tech engineering, including maximizing their scalability to ever larger computing architectures and supporting the technology transfer that will enable their uptake within the industrial environment. “These
activities”, says project coordinator and HLRS managing director Bastian Koller, “will support engineers through the entire HPC engineering application lifecycle, including data pre-processing, code optimization, application execution, and post-processing. In addition, EXCELLERAT will
provide training that prepares engineers in industry to take advantage of the opportunities that the latest HPC technologies offer.”
Six codes to define success of EXCELLERAT
Keys to the future success of EXCELLERAT are six codes – namely Nek5000, Alya, AVBP, Fluidity, FEniCS, and FLUCS – that are being brought into the project by consortium partners. The codes have been developed for academic applications in engineering fields such as aerospace, automotive, combustion, and fluid dynamics. To support their integration into real-life industrial applications, all consortium partners will work closely with end users outside the consortium. This will ultimately lead to fast feedback-cycles in all areas of the HPC engineering application lifecycle, from consultation on methodology and code implementation to data transfer and code optimization. End users will benefit from their commitment by gaining first-hand access to the project results.
Consortium partners cover expertise along the HPC engineering application lifecycle The consortium of 13 partners in 7 European countries cover various areas of expertise. Project coordinator HLRS will provide consulting services for industrial HPC users on how to ensure their supercomputers are used most effectively. To support post-processing, HLRS will offer its extensive expertise in the implementation and application of parallel visualization tools for analyzing the large datasets produced by simulations.
Teratec and SICOS-BW are HPC advisors for industry and active at the intersection between computer science and the private sector. They will identify potential industrial users of EXCELLERAT services in their networks.
For machine learning-applications, the Fraunhofer Institute for Algorithms and Scientific Computing SCAI will provide expertise on computational algorithms and data analytics. The Barcelona Supercomputing Center (BSC), Royal Institute of Technology (KTH), German Aerospace Center (DLR), RWTH Aachen, and French research centre CERFACS, will supply the codes.
The participating HPC centres HLRS, ARCTUR, CINECA, EPCC, BSC, and KTH will provideaccess to their respective supercomputing systems in Germany, Italy, Slovenia, Scotland, Spain and Sweden, and will establish links to other systems, such as the forthcoming European preexascale
machines. IT service provider SSC will ensure secure data transfer using their SWAN platform for proprietary use cases.
There’s nice snapshot of advancing work to develop improved neural network “synapse” technologies posted yesterday on IEEE Spectrum. Lower power, ease of use, manufacturability, and performance are all key parameters in the search for optimum neural network technologies. Several promising approaches were presented at last week’s IEEE International Electron Device Meeting in San Francisco. Currently there is a virtual alphabet soup of possibilities: resistive RAM, flash memory, magnetoresistive RAM (MRAM), electrochemical RAM (ECRAM) and phase change memory (PCM), among the contenders.
As pointed out in the IEEE Spectrum article (Searching for the Perfect Artificial Synapse for AI written by Samuel Moore), “Rather than use the logic and memory of ordinary CPUs to represent these (neural net connections), companies and academic researchers have been working on ways of representing them in arrays of different kinds of nonvolatile memories. That way, key computations can be made without having to move any data. AI systems based on resistive RAM, flash memory, MRAM, and phase change memory are all in the works, but they all have their limitations.”
IBM presented interesting work with ECRAM technology as noted in this excerpt from the article:
“The ECRAM cell looks a bit like a CMOS transistor. A gate sits atop a dielectric layer, which covers a semiconducting channel and two electrodes, the source and drain. However, in the ECRAM, the dielectric is lithium phosphorous oxynitride, a solid-state electrolyte used in experimental thin-film lithium-ion batteries. In an ECRAM, the part that would be the silicon channel in a CMOS transistor is made from tungsten trioxide, which is used in smart windows, among other things.
“To set the level of resistance—the synapse’s “weight” in neural networks terms—you pulse a current across the gate and source electrodes. When this pulse is of one polarity, it drives lithium ions into the tungsten layer, making it more conductive. Reverse the polarity, and the ions flee back into the lithium phosphate, reducing conductance.
“Reading the synapse’s weight just requires setting a voltage across the source and drain electrodes and sensing the resulting current. The separation of the read current path from the write current path is one of the advantages of ECRAM, says Jianshi Tang at IBM T.J. Watson Research Center. Phase change and resistive memories have to both set and sense conductance by running current through the same path. So reading the cell can potentially cause its conductance to drift.”
Moore’s article is worth a quick read. He notes, “There may be no perfect synapse for neuromorphic chips and deep learning devices. But it seems clear from the variety of new, experimental ones revealed at IEDM last week that there will be better ones than we have today.”
Dec. 12, 2018 — The European Open Science Cloud (EOSC) has launched during an event hosted by the Austrian Presidency of the European Union. The inauguration marks the conclusion of a long process of consultation and reflection with stakeholders led by the European Commission.
During the event Professor Karel Luyben, representing the European Association of Universities of Technology CESAER, was installed as chairman of the Executive Board of EOSC. Professor Luyben is proud of his appointment to the EOSC: “The EOSC is an important step towards our ambition: to make all research accessible, findable and searchable. It is an honour to be able to contribute to this”. According to him, there is still plenty of work to be done: “Research data is not only about figures, but also about texts, images, sound or algorithms, for example, and that across all scientific disciplines. We have to work hard on international standards in order to be able to share this with each other.” He has a clear ultimate goal in mind: “In ten or twenty years’ time we should no longer be talking about ‘open science’. Then open is simply the norm.”
The event also introduced the new EOSC Portal. The EOSC Portal provides access to data, services and resources. It is a source of up-to-date information about the EOSC initiative, including best practices, governance and user stories. All service providers, research communities and other entities with a stake in EOSC are invited to visit the EOSC Portal and explore how they can benefit from its functionalities.
For several years now the big cloud providers – Amazon, Microsoft Azure, Google, et al – have been transforming from technology consumers into technology creators in hardware and software. The most recent example being Amazon’s announcement two weeks ago that it had developed its own Arm-based chips (Graviton) and was using them in select instances. Yesterday, the New York Times ran an article, Amazon’s Homegrown Chips Threaten Silicon Valley Giant Intel.
The Times piece, written by Cade Metz, focuses largely on goliath Intel which currently owns about 96.6 percent share of server chip sales, according to IDC. That’s down from 98.6 percent in the last year or so due AMD’s re-emergence in the server chip business with its Epyc CPU line introduced in June 2017.
Intel is an easy target, having dominated the CPU market for years and earned a reputation as a tough-minded, tough-acting supplier. Here’s a brief excerpt from Metz’s article which is a good, fast read:
“Amazon’s new chip arrives as market forces are rapidly undercutting chip makers and their $412 billion in yearly sales. Online operations like Amazon and Google have grown so large, they can save significant money by making chips tailored to their needs rather than buying them from longtime suppliers…
“About 35 percent of the server chips sold around the world go to about 10 companies, including large internet companies like Amazon and a handful of telecommunication firms, said Shane Rau, a chip analyst with the research firm IDC. That just one of them is shifting its business is terrible news for Intel.
“Each one of these companies is so large, they represent a market unto themselves,” Mr. Rau said.
These days Intel is facing a wide range of challenges – CPU rivals, surging GPU use (Nvidia, AMD), emergence of specialty chips (e.g. AI-oriented such as Google’s TPU), and export restrictions. Yet the bigger threat may not be just to Intel, but to chip-makers more generally and perhaps other portions of the technology supply chain as big cloud providers take more control over their technology needs. The ramifications could ripple through HPC, enterprise, an PC markerts.
According to Metz, “Amazon executives believe the new chip, which was designed to be more energy efficient, will help reduce the cost of electrical power in its data centers. It said it was offering a cloud-computing service that would allow business customers to use its new chip. The cost of the service could be 45 percent lower than other options…
“…Inside these cloud operations, the internet companies are also reinventing how computers are built. They are designing much of their own hardware, from the servers to the networking gear that connects servers. Facebook is doing similar work with the technology in its data centers.”
Metz ends the piece quoting a long-time semiconductor executive saying, “We will look back at this as a watershed moment.” We’ll see.
SAN JOSE, Calif., Dec. 11, 2018 — Quantum Corp. today announced the opening of its new Executive Briefing Center in Englewood, Colorado. The conference center and technology lab showcase Quantum’s integrated solutions and provide a place for key customers and partners to get direct access to Quantum executives, product managers, developers, technology, and leaders from all business functions to drive new levels of collaboration and customer-led innovation. Designed to facilitate discussion and demonstration, the Executive Briefing Center houses a technology proof of concept lab and serves as a training and collaboration hub for business partners.
“We designed our new Executive Briefing Center to create an inviting space for collaboration with our customers and partners, in order to better focus on the business challenges they face and understand how Quantum innovation can be harnessed to address them,” said Jamie Lerner, CEO and President of Quantum. “It provides a center for us to explore the emerging industry trends that are shaping the next generation of Quantum solutions.”
An Immersive Experience in Innovation
The Executive Briefing Center provides an immersive experience across the breadth of Quantum’s innovative solutions, products and services. Visitors can engage in hands-on learning in Quantum’s Tech Center, which was developed to provide interactive demonstrations of the latest innovations in storage. Demonstrations are available for offerings spanning file, block, object, and tape storage for high performance video and image workflows, as well as leading solutions for enterprise backup and other technologies from Quantum’s portfolio.
Highlighting the Latest Developments
The Executive Briefing Center opening is one several recent initiatives rolled out by Quantum. Earlier this year Quantum announced enhancements to the Xcellis high performance storage portfolio, including StorNext® 6.2 which bolsters performance for 4K and 8K content while enhancing integration with cloud-based workflows and global collaborative environments. More recently Quantum announced a new ruggedized, in-vehicle storage solution designed specifically for mobile and remote capture of video and other IoT sensor data. These and other technologies are available for demonstration at the Executive Briefing Center.
Quantum is a leading expert in scale-out tiered storage, archive and data protection, providing solutions for capturing, sharing, managing and preserving digital assets over the entire data lifecycle. From small businesses to major enterprises, more than 100,000 customers have trusted Quantum to address their most demanding data workflow challenges. Quantum’s end-to-end, tiered storage foundation enables customers to maximize the value of their data by making it accessible whenever and wherever needed, retaining it indefinitely and reducing total cost and complexity. See how at www.quantum.com/customerstories.
Call it a corollary to Murphy’s Law: When a system is most in demand, when end users are most dependent on the system performing as required, when it’s crunch time – that’s when the system is most likely to blow up. Or make you wait in line to use it.
They know this at Mellanox as well as anyone. With nearly 3,000 employees worldwide, the supplier of Ethernet and InﬁniBand interconnects for servers, storage, and hyper-converged infrastructure (used in half of the top 500 supercomputers) is under ongoing pressure to innovate a product line that includes high performance network and multicore processors, network adapters, switches, cables, software and silicon.
Doron Sayag of Mellanox
For the hundreds of developers supporting the company’s silicon design-related work, crunch time can overtax the Mellanox on-prem data center. The company needed a stable engineering HPC capability that could perform well in the on-premises high performance computing environment and burst transparently into Microsoft Azure Cloud during tape-outs – i.e., peak load periods.
“The idea is that we have a grid of servers that provide the compute calculations,” said Doron Sayag, IT enterprise computing services senior manager. “And we’d like that the engineer has as simple an environment as possible, one in which he can submit a job request for his calculation according to the specific tool he’s working with, and he doesn’t care if the job runs on server number 1 or 2, he just wants it to run as fast as possible and get the results back. This is the main reason we use the job scheduler.”
The Mellanox HPC cluster runs electronic design automation (EDA) pipeline software and simulation. The silicon engineering operation, which utilizes in-house tools and commercial applications, is computationally intensive. Sayag said superior workload orchestration is critical. For several years, the organization used Sun Grid Engine (SGE), an open source scheduler, but it lacked certain features and capabilities that Sayag and his colleagues felt were necessary.
“We were running an open source job scheduler, but it presented stability issues,” he said. “We wanted to replace it with a robust enterprise-supported solution that was cloud-enabled. Also, we were looking for ways to add not just more features for product development but also to increase the productivity of the system we have. That’s what drove us to explore other options on the market.”
In the fall of last year, the silicon design group decided to switch from SGE to a solution from Univa, comprised of Navops Launch, Univa Grid Engine and Unisight, with features such as resource management, quotas and limits, and priority and utilization policies, all working to enable Mellanox’s existing on-premise infrastructure and workﬂows to encompass the cloud.
Along with cloud bursting, the applications provide a window into what the silicon operation is doing and the burdens placed on its on- and off-prem systems resources.
Univa Grid Engine schema
“I get statistics on the jobs running with the Unisight feature,” Sayag said. “We get a GUI and we get graphs according to our spec job types, we get information like run times, number of failures and so on, an overview of what we’re doing and where we have bottlenecks. Then we can prioritize tasks to fix them.”
He cited the example of scheduling compilation jobs. “Then we see that the compilations are taking a lot of time,” he said, “so engineers writing the compilation configuration can make them more efficient, reducing the run times. Then we see the results in Unisight,” confirming improved efficiencies.
He said Navops Launch helps the silicon operation increase its on-premises data center efficiencies while addressing unmet peak performance needs with a cloud bursting ‘pay-as-you-go’ scenario – which enables the team to run as many jobs as required. It all happens quite transparently, Sayag said, so that developers aren’t aware when busting happens. With compute resource wait times virtually eliminated, Mellanox estimates that the Univa solution saves one third of the time of a skilled FTE.
“By extending Univa Grid Engine to the Microsoft Azure Cloud, we gained practically inﬁnite capacity from our hybrid cloud solution in a very cost-effective manner. The impact to our engineering teams is noticeable in terms of throughput and turnaround time, even though use of the cloud is completely transparent to them.”
Univa CEO Gary Tyreman said that while its hybrid HPC cloud orchestration software is broadly adopted in the life sciences, biotech and pharmaceuticals verticals, Mellanox is a pioneer among EDA semiconductor companies, most of which are “very sensitive to intellectual property exposure concerns,” he said. “The surprise I think to us is for a semiconductor company and a technology company to be embracing HPC in the cloud.”
Dec. 11, 2018 — In response to recent media reports, Super Micro Computer, Inc. has conducted testing on its motherboards and found no evidence of malicious hardware. The investigation’s results were announced in a letter from the CEO, which is published in full below.
Super Micro Computer, Inc.
980 Rock Avenue
San Jose, CA 95131 USA
December 11, 2018
Testing Finds No Malicious Hardware on Supermicro Motherboards
Dear Valued Customer,
Recent reports in the media wrongly alleged that bad actors had inserted a malicious chip or other hardware on our products during our manufacturing process.
Because the security and integrity of our products is our highest priority, we undertook a thorough investigation with the assistance of a leading, third-party investigations firm. A representative sample of our motherboards was tested, including the specific type of motherboard depicted in the article and motherboards purchased by companies referenced in the article, as well as more recently manufactured motherboards.
Today, we want to share with you the results of this testing: After thorough examination and a range of functional tests, the investigations firm found absolutely no evidence of malicious hardware on our motherboards.
These findings were no surprise to us. As we have stated repeatedly, our process is designed to protect the integrity and reliability of our products. Among other safeguards:
We test our products at every step of the manufacturing process. We test every layer of every board we manufacture throughout the process.
We require that Supermicro employees be onsite with our assembly contractors, where we conduct multiple inspections, including automated optical, visual, electrical, and functional tests.
The complexity of our motherboard design serves as an additional safeguard. Throughout our supply chain, each of our boards is tested repeatedly against its design to detect any aberration and to reject any board that does not match its design.
To guard against tampering, no single employee, team, or contractor has unrestricted access to our complete board design.
We regularly audit our contractors for process, quality, and controls.
We appreciate the industry support regarding this matter from many of our customers, like Apple and AWS. We are also grateful for numerous senior government officials, including representatives of the Department of Homeland Security, the Director of National Intelligence, and the Director of the FBI, who early on appropriately questioned the truth of the media reports.
As we have stated repeatedly since these allegations were reported, no government agency has ever informed us that it has found malicious hardware on our products; no customer has ever informed us that it found malicious hardware on our products; and we have never seen any evidence of malicious hardware on our products.
Today’s announcement should lay to rest the unwarranted accusations made about Supermicro’s motherboards. We know that many of you are also addressing these issues with your own customers. To assist in those conversations, we have prepared a short video that highlights our quality assurance process.
We appreciate your patience as we have diligently conducted a thorough investigation into the reports. We are truly proud of the security, integrity, and quality of our products. And we are proud to stand by our products. Please contact our team if you have any questions.