Data science is becoming increasingly ubiquitous in all areas of knowledge and is reshaping data-driven discovery in strategic research and industry domains. A major aspect of data science is understanding the characteristics and complexities of data to select, extract, construct and analyze the features that best represent and model a problem. This understanding of the nature of data is crucial for developing interpretable models.
A new project by Research Assistant Professor Silvina Caino-Lores addresses the challenge of understanding scientific data, specifically those generated through large scale simulations. She has been awarded with a grant from the South Big Data Hub for this project, “Training Next-Generation Data Scientists in Non-Deterministic Scientific Data Generation.” The award is part of the S.E.E.D.S Grant Program: Southern Engagement and Enrichment in Data Science.
Due to the complexity and size of these simulations, the usage of high-performance computing (HPC) and parallel programming techniques such as message passing interface is crucial to generate simulated data efficiently. Achieving high speed and scalability requires asynchronous execution of simulation code. However, such asynchronous executions often exhibit non-deterministic behavior— for example, multiple runs of the same code could produce different patterns of execution due to changes in the order of simulated events.
It is essential that data scientists gain knowledge of how non-determinism can affect the generation and analysis of scientific data. This project bridges this knowledge gap between data science and HPC-enabled domain science. The goal is to develop and deliver training modules targeted at data scientists and data science students seeking to understand the impact of non-determinism in their data science.