Researchers are faced with an increasingly complex data landscape in which data are obtained from a number of different sources (e.g., instruments, computers, published data), stored in disjoint storage systems, and analyzed on an area of high performance and cloud computers. Given the increasing speed at which data are produced, combined with increasingly complex scientific processes and the requisite data management, munging, and organization activities required to make sense of data, researchers are faced with new bottlenecks in the discovery process. Improving data lifecycle management practices is essential to enhancing productivity, facilitating reproducible research, and encouraging collaboration. We posit that researchers require automated methods for managing their data such that tedious and repetitive tasks (e.g., transferring, archiving, and analyzing) are accomplished without continuous user input.

The Globus automation services enable users to create, share, and run reliable, secure, and efficient distributed data management pipelines. Globus Flows makes it possible, for example, for a researcher to indicate that new data should trigger, in turn, a transfer to a remote computer, analysis of those data, updates to registries, and email collaborators.

Our research focuses on three core areas: (1) Developing cyberinfrastructure to enable autonomous scientific pipelines to be described and performed; (2) investigate policies to effeciently schedule the placement and execution of tasks; and (3) exploring techniques and programming models which enable non-technical users to design and configure custom automations.

Further reading