
Foundry-ML - Data, Models, Science
Foundry-ML is a platform to discover and share machine learning ready datasets. To use data, load it directly into a DataFrame with Python and it's ready to go!
The Problem With the Current State of ML Data
Many datasets used in Machine Learning research are not fully accessible to the public. This makes it difficult to reproduce research and build off of it. The reasons people can't access it usually come down to the same reasons.
- The infrastructure isn't there to share it. ML requires large datasets and many common infrastructures aren't capable of accommodating that.
- There's no context as to what's going on in the data. People will just dump their data online without any thought to how it can be interpretted by someone else. Most of the time, it can't.
- The data has no structure and isn't formatted uniformly. The time it takes someone to clean up a datset to get it to a place where the data is actually usable can be weeks to months.
- The data isn't ready to used in a ML workflow. Once the data is structured, formatted, and cleaned, it still needs more work to be ready for a ML work environment.
Where Foundry-ML Comes In
Foundry-ML solves each of those issues:
- Our infrastructure is capable of transferring terebytes worth of data. We use Globus behind the scenes to make this possible.
- We have required metadata for every dataset includes dataset keys with a description of each one, data type, data size, and many more fields that make understanding the data trivial.
- Each dataset fits our structure and format standards, making it easy to use and ready to go in a ML environment. You can load the data directly into a DataFrame and start coding!
Acknowledgments
This work is a collaboration between the University of Chicago, Argonne National Lab, and the University of Wisconsin - Madison.
This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure".
This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Material Design (CHiMaD).