Scientists today have access to more, and more precise, data than ever before in Earth’s history.
Even as massive supercomputers unlock the secrets of material performance over centuries and the movement of individual molecules within cells, internet-enabled consumer devices are also becoming more prevalent and useful.
Smart thermostats and washing machines do more than make consumers’ lives easier—they also generate valuable data regarding power and water usage patterns over a day or across a city. Smartwatch readings could reveal how physical activity rates vary in neighborhoods with and without public parks, while car GPS information might help identify inefficiencies within a highway system.
“While the exponentially increasing volume of data creates exciting opportunities for scientists,” said EECS Dongarra Professor Michela Taufer, “it also creates very serious theoretical and logistical challenges in data management. Not only do we have to manage and process the data, we also have to do it in a way that is equitable, efficient, and economical.”
Analyzing the data is also complicated by limited computational resources; supercomputers can analyze large datasets astonishingly quickly—in days or weeks rather than months—but reserving time on them is expensive.
Fortunately, Taufer has a solution.
As the head of UT’s Global Computing Laboratory, Taufer has long been focused on bringing high performance computing (HPC) to scientists. HPC involves aggregating computer power from various sources to make data analytics much more time efficient.
“HPC has already enabled all sorts of important scientific discoveries,” Taufer said. “Now, supercomputers keep getting faster while cloud and edge computing are also getting more powerful. This means that any mode of managing data must operate comprehensively, allowing data from these different sources to flow together like streams flowing into a river. We have to build pipelines that are deep and wide enough to accommodate a flood of data from all directions.”
Taufer is now embarking on a collaborative research project that will upgrade HPC networks to accommodate that flood. The joint effort, co-led by Professor Ewa Deelman from the University of Southern California, is supported by a $624,000 award from the Software and Hardware Foundation program of the National Science Foundation (NSF).
Along with Deelman and EECS Research Assistant Professor Jack Marquez, Taufer will develop a catalogue of common dataflow motifs.
“A dataflow motif is like a set of dance steps,” said Taufer. “It tells us the possible routes data can take at any given time and place. The first step in managing data is understanding what kind of data is moving and how it flows.”
The team will then design a software program to optimize the distribution of data to the computer hardware that can perform analysis, whether locally or in the cloud.
Perhaps the most impactful aspect of Taufer’s project will be training new HPC experts in how to use the motif catalogue and new software, creating a broad community of specialists who can help scientists optimize their dataflow pipelines.
These experts, recruited from the UT and USC graduate student populations, will be able to identify potential data analysis bottlenecks before they happen, then adjust the pipeline to maximize both processing efficiency and accuracy.
Taufer and Deelman strongly believe in promoting participation in computer science from historically underrepresented populations, particularly women. They plan to recruit HPC trainees through partnerships with campus groups like Systers, a student-run volunteer organization at UT focused on recruiting, mentoring, and retaining women in the EECS department.
“This is a very exciting time to be working with Big Data,” Taufer said. “As we are able to generate more and more data, we will also be able to execute increasingly extended analyses. We anticipate that this will accelerate scientific discovery faster than ever before.”
Contact
Izzie Gall (865-974-7203, egall4@utk.edu)