Report from Alan Turing Institute Symposium now publicly available

Report from Alan Turing Institute Symposium now publicly available

A report has now been published on the Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research, which was held on 6-7 April 2016 at the University of Oxford.

It was organised by a team led by Oxford comprising Bodleian Libraries, the e-Research Centre, Computer Science and Oxford University Press, and also involving representatives of the Alan Turing Institute (ATI) joint venture partners (the universities of Cambridge, Edinburgh, Oxford, UCL and Warwick), the University of Manchester, Newcastle University and the British Library.

The key aim of the symposium was to address the challenges around reproducibility of data intensive research in science, social science and the humanities, and to present recommendations for the ATI to take forward. Achieving reproducibility requires that papers in experimental science describe the results and provide a sufficiently clear protocol to allow successful repetition and extension of their analyses. However, according to a recent report from the January 2016 Dagstuhl seminar on the topic of reproducibility of data-oriented experiments in e-science, most computational experiments are only specified informally in papers, and the code that produced the results is seldom available.

The ATI is the UK's leading data science institute and as such has a key role in supporting and promoting the reproducibility of data intensive research. The report summarises three key ways in which this can be achieved. Firstly, through funding and conducting world-leading research into technical, social and cultural aspects of reproducibility; secondly, through implementing practical mechanisms that support reproducibility during the pursuit of interdisciplinary data science research; and thirdly, through acting as an advocate, exemplar and champion of reproducible research, engaging with the community and developing partnerships with existing institutions in this area. The report is intended both to inform the data science research programme; to inform researcher practices; to enable the development of key policies; and to inform the development of the ATI's data and computer infrastructure.

The symposium format was an interactive workshop, opening with a keynote presentation by Professor Carole Goble CBE, University of Manchester, who gave a definition of reproducibility in the context of computational data analytics. Professor Jared Tanner, University of Oxford, gave an introduction to the mission and objectives of the ATI. Members of the Organising Committee then chaired five workshop sessions on:

• Data provenance to support reproducibility
• Computational models and simulations
• Reproducibility for real-time big data
• Publication of data-intensive research
• Novel architectures and infrastructures to support reproducibility

A major goal of the symposium was to encourage researchers to exchange information and ideas around the topic and to maximise participation from the developing ATI community. The symposium convened an invited inter-disciplinary group of researchers who employ data-intensive computational methods in their research, from many areas of computer science. Stakeholders from key institutions such as the Digital Curation Centre and the Software Sustainability Institute, and from publishers and data repositories such as Elsevier, GigaScience and F1000 research also attended, as well as a small cohort of early career researchers from the Turing Institute joint venture partner.

Professor David De Roure, who was a member of the Organising Committee and co-authored the report, chaired the session on Reproducibility for Real-Time Big Data with Dr Suzy Moat (University of Warwick) and Dr Eric Meyer (University of Oxford).

This session discussed reproducibility in the context of real-time big data and new forms of digital scholarship, characterised by machines and people operating together at scale. The participants considered how we might reproduce data science research using social media analytics, which examines new social processes at the scale of the population and in real time, and looked ahead to our increasingly automated future, asking whether it is meaningful to automate reproducibility, and if and how we should keep the human in the loop.

The discussions ranged around issues such as the need to clarify what is meant by best practice in relation to reproducibility, embracing rather than rejecting challenges to existing assumptions, the conflict between ethics and reproducibility, using machine learning approaches alongside statistical methods to improve calibration or reduce bias, and confirming that interdisciplinarity is key to data science and its reproducibility. A series of recommendations were developed, including that the ATI has a role in fostering collaboration between researchers using established methods with those working with new methods, and in particular between social scientists and data scientists in order to understand better new and emerging forms of data, for example created through the Internet of Things.

Dr Susanna Assunta-Sansone, the Centre's Associate Director of Life, Natural and BioMedical Sciences, also gave a lightning talk on Data sharing stories from Scientific Data (Nature Publishing Group).

Further information
The report, the symposium programme, speaker biographies and delegate list, slides and video recordings from the presentations and talks are publicly available online at the Open Science Framework.