Dr Alejandra Gonzalez-Beltran, Research Lecturer, talks about her recent Kellogg College Junior Research Fellowship, reporting the statistical methods used in data analysis and the technical challenges around data stewardship.
When did you start at the Centre and what was your first role here?
I joined the Centre in June 2012 as a Senior Research Software Engineer. Now, I am a Research Lecturer. From the start, I have been involved projects related to enhancing and extending the Investigation/Study/Assay (ISA) infrastructure for tracking metadata about biological experiments. Furthermore, I have also contributed to the BioSharing portal of standards, databases and policies in the life, environmental and biomedical sciences.
What is your background?
I am a computer scientist: after a degree in Computer Science from the Universidad Nacional de Rosario, Argentina (equivalent to BSc + MSc), I was awarded a PhD in Computer Science from Queen's University Belfast, UK. My PhD work was about efficient ways to access information on a distributed network using probabilistic data structures. Then I worked as a post-doctoral researcher at University College London, UK, collaborating with the UK National Cancer Research Institute, the US National Cancer Institute, the UCL Cancer Institute and others, on methods to find and integrate distributed cancer data as well as best practices to record therapy experiments information.
Summarise the research you are doing / your research interests in a few sentences.
My research interests involve applying Computer Science methodologies to applications in the life, environmental and biomedical domains. In particular, I develop models, methods and software tools for data curation, data discovery, knowledge management, data publication, data analysis looking at enabling data sharing, data re-use and reproducible research.
Why is this important (to the scientific community / the world at large)?
Technological advances have propelled data generation to levels previously unimaginable. Managing the generated data in an efficient way is paramount for the advancement of science, as it would help to avoid duplication of efforts, would enable data sharing and improve data re-use, and would support reproducibility, which requires a detailed description of the methods used, as well as the availability of the software tools yielding the results.
I collaborate with scientists, technologists, service providers, journals and communities developing standards that support data sharing, interoperability, re-use and reproducibility. I hope to contribute to innovative ways of enhancing scholarly communication and the ways in which all the outcomes of research are made available as I believe it would accelerate science and discoveries.
What would you like to do next, funding permitting?
I would like to explore further the issues around reporting the statistical methods used in data analysis and how these relate to the quality of the data produced. For this purpose, I want to apply the STATistical Ontology (STATO) we have built to help reporting and reason over (logically connect) the statistical methods applied in different experiments.
Some questions that the ontology can answer:
Are you involved in any wider collaborations? Why are these important?
I am involved in multiple projects with collaborations in the UK, Europe, the US, China and beyond. These collaborations are crucial for the work we do, as making the data FAIR (Findable, Accessible, Interoperable and Reusable) requires the wide adoption of standards, which per definition have to be agreed by large communities of practitioners and all stakeholders.
What publication /paper are you most proud of and why?
I have published papers on the software tools I have worked on, such as Risa (a tool that bridges the information about an experiment and the data analysis using the R language) and linkedISA (a tool that converts experiments described using spreadsheets to linked data, allowing to search and establish connections between datasets).
These papers are important, but it is also great to apply these tools to specific use cases, demonstrating how the can help in making data reproducible. So, I would highlight a paper for which we collaborated with other researchers and publishers and made recommendations on how to report results to move "From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics".
Have you received any awards or fellowships?
I have recently been elected to be a Junior Research Fellow at Kellogg College, which will start in Michaelmas Term (October 2016). This is a great opportunity to increase the links between the Centre and Kellogg College, promote my research and potentially create new collaborations. In addition, I hope to use the fellowship to raise the profile of women in Science, Technology, Engineering and Mathematics by organising seminars and outreach activities.
I have also received a Lockey grant to fund my trip to the "Semantics, Analytics, Visualisation: Enhancing Scholarly Data" workshop, which I co-chaired, and the 25th International World Wide Web Conference, which took place in Montreal, Canada, in April 2016.
Previously, I won the ORCID codefest that took place at Oxford, UK, on May 2013 and as a prize, got invited to participate in the ORCID and DataCite Interoperability Network (ODIN) codesprint and first year conference at CERN, Switzerland.
And even before that, I won a best paper award for my paper and presentation on "Ontology-based queries over cancer data" at the 3rd International Workshop on Semantic Web Applications and Tools for the Life Sciences (SWAT4LS 2010).
What do you think the most important issues/challenges in your field will be in the next decade and how is the Centre placed to address them?
There are technical challenges around data stewardship: distinguishing what data to preserve, evaluating data quality, ensuring the accessibility and privacy of confidential data, and providing easy to use tools to help in all of the above. But there are also important social challenges which need to be tackled, which revolve around highlighting the societal benefits of making the research process more transparent, as well as revising the academic credit system, so that instead of seeing it as being a disadvantage, researchers get rewarded for sharing their well-described data and methods.
What do you think the Centre does best?
The Centre's interdisciplinary nature is definitely what makes us unique compared to other departments across the University. This interdisciplinary nature allows us to have a stimulating working environment with common interest groups (e.g. in linked data or machine learning) while we work applying these methodologies to different domains of application.
Watch Alejandra's presentation on scholarly publishing in the life sciences at the 16th Annual BioInformatics Open Source Conference.