Virtual Appliances, Cloud Computing, and Reproducible Research - Bill Howe
Science in every discipline is becoming data-intensive, requiring researchers to interact with their data solely through computational and statistical methods as opposed to direct manipulation. Perhaps paradoxically, these in silico experiments are often more difficult to reproduce than traditional "manual" laboratory techniques. Software pipelines used to acquire and process data have complex version-sensitive interdependencies, datasets are too large to efficiently transport from place to place, and interfaces are often complex and underdocumented.
At the UW eScience Institute, we are exploring the use of virtual machines and cloud computing to mitigate these challenges. A virtual machine can capture a researcher's entire working environment as a snapshot, including the data, software, dependencies, intermediate results, logs and other usage history information, operating system and file system context, convenience scripts, and more. These virtual machines can then be saved, made publicly available, and referenced in a publication. This approach not only facilitates reproducibility, but incurs essentially zero overhead for the researcher. Coupled with cloud computing, this approach offers additional benefits: experimenters need not allocate local resources to host the virtual machine, large datasets and long-running computations can be managed efficiently, and resource costs are more easily shared between producer and consumer.
In this talk, I motivate this approach with case studies from our experience and consider some of the implications and future directions.