A Provenance-Based Infrastructure for Creating Reproducible Papers - Juliana Freire and Claudio Silva
While computational experiments have become an integral part of the scientific method, it is still a challenge to repeat such experiments, because often, computational experiments require specific hardware, non-trivial software installation, and complex manipulations to obtain results. Generating and sharing repeatable results takes a lot of work with current tools. Thus, a crucial technical challenge is to make this easier for (i) the author of the paper, (ii) the reviewer of the paper, and, if the author is willing to disseminate code to the community, (iii) the eventual readers of the paper. While a number of tools have been developed that attack sub-problems related to the creation of reproducible papers, no end-to-end solution is available. Besides giving authors the ability to link results to their provenance, such a solution should enable reviewers to assess the correctness and the relevance of the experimental results described in a submitted paper. Furthermore, upon publication, readers should be able to repeat and utilize the computations embedded in the papers. But even when the provenance associated with a result is available and contains a precise and executable specification of the computational process (i.e., a workflow), shipping the specification to be run in an environment different from the one it has been designed at raises many challenges. From hard-coded locations for input data, to dependencies on specific version of software libraries and hardware, adapting a workflow to run on a new environment can be challenging and sometimes impossible.
We posit that integrating data acquisition, derivation, analysis, and visualization as executable components throughout the publication process will make it easier to generate and share repeatable results. To this end, we have built an infrastructure to support the life-cycle of 'reproducible publications'---their creation, review and re-use. In particular, in our design we have considered the following desiderata: Lower Barrier for Adoption---it should help authors in the process of assembling their submissions; Flexibility---it should support multiple mechanisms that give authors different choices as how to package their work; Support for the Reviewing Process---reviewers should be able to unpack and reproduce the experiments, as well as validate them. We have used VisTrails?, a provenance-enabled, workflow-based data exploration tool, as a key component of our infrastructure. We leverage the VisTrails?' provenance infrastructure to systematically capture useful meta-data, including workflow provenance, source code, and library versions. We have also taken advantage of the extensibility of the system to integrate components and tools that address issues required to support reproducible papers, including: linking results to their provenance; the ability to repeat results, explore parameter spaces, and interact with results through a Web-based interface; the ability to upgrade the specification of computational experiments to work in different environments and with newer versions of software. In this talk, we outline challenges we have encountered and present some of the components we have developed to address them. We also present a demo where we show real-world uses of our infrastructure.