AI Contest Seeks to Cut Data Center Costs

DOE/Thomas Jefferson National Accelerator Facility

NEWPORT NEWS, VA – Who, or rather what, will be the next top model?

Data scientists and developers at the U.S. Department of Energy's Thomas Jefferson National Accelerator Facility are trying to find out, exploring some of the latest artificial intelligence (AI) techniques to help make high-performance computers more reliable and less costly to run.

The models in this case are artificial neural networks trained to monitor and predict the behavior of a scientific computing cluster, where torrents of numbers are constantly crunched. The goal is to help system administrators quickly identify and respond to troublesome computing jobs, reducing downtime for scientists processing data from their experiments.

In almost fashion-show style, these machine learning (ML) models are judged to see which is best suited for the ever-changing dataset demands of experimental programs. But unlike the hit reality TV series "America's Next Top Model" and its international spinoffs, it doesn't take an entire season to pick a winner. In this contest, a new "champion model" is crowned every 24 hours based on its ability to learn from fresh data.

"We're trying to understand characteristics of our computing clusters that we haven't seen before," said Bryan Hess, Jefferson Lab's scientific computing operations manager and a lead investigator – or judge, so to speak – in the study. "It's looking at the data center in a more holistic way, and going forward, that's going to be some kind of AI or ML model."

While these models don't win any glitzy photoshoots, the project recently took the spotlight in the peer-reviewed scientific magazine IEEE Software as part of a special edition dedicated to machine learning in data center operations (MLOps).

The results of the study could have big implications for Big Science.

The Need

Large-scale scientific instruments, such as particle accelerators, light sources and radio telescopes, are critical DOE facilities that enable scientific discovery. At Jefferson Lab, it's the Continuous Electron Beam Accelerator Facility (CEBAF), a DOE Office of Science User Facility relied on by a global community of more than 1,650 nuclear physicists.

Experimental detectors at Jefferson Lab collect faint signatures of tiny particles originating from the CEBAF electron beams. Because CEBAF produces beam 24/7, those signals translate into mountains of data. The information collected is on the order of tens of petabytes per year. That's enough to fill an average laptop's hard drive about once a minute.

Particle interactions are processed and analyzed in Jefferson Lab's data center using high-throughput computing clusters with software tailored to each experiment.

Among the blinking lights and bundled cables, complex jobs requiring several processors (cores) are the norm. The fluid nature of these workloads means many moving parts – and more things that could go wrong.

Certain compute jobs or hardware problems can result in unexpected cluster behavior, referred to as "anomalies." They can include memory fragmenting or input/output overcommitments, resulting in delays for scientists.

"When compute clusters get bigger, it becomes tough for system administrators to keep track of all the components that might go bad," said Ahmed Hossam Mohammed, a postdoctoral researcher at Jefferson Lab and an investigator on the study. "We wanted to automate this process with a model that flashes a red light whenever something weird happens.

"That way, system administrators can take action before conditions deteriorate even further."

A DIDACT-ic Approach

To address these challenges, the group developed an ML-based management system called DIDACT (Digital Data Center Twin). The acronym is a play on the word "didactic," which describes something that's designed to teach. In this case, it's teaching artificial neural networks.

DIDACT is a project funded by Jefferson Lab's Laboratory Directed Research & Development (LDRD) program . The program provides the resources for laboratory staff to pursue projects that could make rapid and significant contributions to critical national science and technology problems of mission relevance and/or advance the laboratory's core scientific and technical capabilities.

The DIDACT system is designed to detect anomalies and diagnose their source using an AI approach called continual learning.

In continual learning, ML models are trained on data that arrive incrementally, similar to the lifelong learning experienced by people and animals. The DIDACT team trains multiple models in this fashion, with each representing the system dynamics of active computing jobs, then selects the top performer based on that day's data.

The models are variations of unsupervised neural networks called autoencoders. One is equipped with a graph neural network (GNN), which looks at relationships between components.

"They compete using known data to determine which had lower error," said Diana McSpadden, a Jefferson Lab data scientist and lead on the MLOps study. "Whichever won that day would be the 'daily champion.' "

The method could one day help reduce downtime in data centers and optimize critical resources – meaning lower costs and improved science.

Here's how it works.

The Next Top Model

To train the models without affecting day-to-day compute needs, the DIDACT team developed a testbed cluster called the "sandbox." Think of the sandbox as a runway where the models are scored, in this case based their ability to train.

The DIDACT software is an ensemble of open-source and custom-built code used to develop and manage and ML models, monitor the sandbox cluster, and write out the data. All those numbers are visualized on a graphical dashboard.

The system includes three pipelines for the ML "talent." One is for offline development, like a dress rehearsal. Another is for continual learning – where the live competition takes place. Each time a new top model emerges, it becomes the primary monitor of cluster behavior in the real-time pipeline – until it's unseated by the next day's winner.

"DIDACT represents a creative stitching together of hardware and open-source software," said Hess, who is also the infrastructure architect for the High Performance Data Facility Hub being built at Jefferson Lab in partnership with DOE's Lawrence Berkeley National Laboratory. "It's a combination of things that you normally wouldn't put together, and we've shown that it can work. It really draws on the strength of Jefferson Lab's data science and computing operations expertise."

In future studies, the DIDACT team would like to explore an ML framework that optimizes a data center's energy usage, whether by reducing the water flow used in cooling or by throttling down cores based on data-processing demands.

"The goal is always to provide more bang for the buck," Hess said, "more science for the dollar."

Further Reading

Establishing Machine Learning Operations for Continual Learning in Computing Clusters

Jefferson Lab Devotes $3 Million to Testing New Ideas

High Performance Data Facility Hub

24s: A Businesslike Name for a 'High-Performing Machine'

Rolling in the Deep: Norfolk Street Flooding Predicted in Seconds With Machine Learning Models

Steering Electrons Out of the Drift with Deep Learning

Unlocking Hidden Potential Through Artificial Intelligence

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.