Dr David Johnson

Dr David Johnson              

Senior Research Associate/Research Software Engineer Dr David Johnson talks about his recent Teaching Award, ethical and privacy issues around data use - and an abundance of cake.

When did you start at the Centre and what was your first role here?

I joined the e-Research Centre in September 2015 as a Senior Research Associate/Research Software Engineer.

What is your background?

I did all my Computer Science degrees (Bachelors, Masters and doctorate) at the University of Reading, just down the road from Oxford. After receiving my PhD in 2010, I joined the Computing Laboratory (now called the Department of Computer Science) here at Oxford as a postdoc, then spent a couple of years at Imperial College London, where I helped set up their Data Science Institute.

Summarise the research you are doing / your research interests in a few sentences.

My primary role is as a Research Software Engineer where I work in a group that develops infrastructure to support data management and data sharing for the life sciences (biology, medicine, ecology etc.). While my role is technical as a software engineer, I have an interest in applying software engineering practice and tooling in my work. I believe data engineering is still a few steps behind software engineering in terms of its maturity of best practice and tools. People working with scientific data and ‘Big Data’ could learn a lot from the state-of-the-art in software engineering research.

Why is this important (to the scientific community / the world at large)?

‘Big Data’ is a moving target of what ‘big’ means. What I think is most important is being able to maximize the value of data. I believe this can be achieved through well-planned computational infrastructure and tools, as well as through best practices and a culture of openness, sharing and creativity.

What would you like to do next, funding permitting?

I like the idea of working towards the automation of data curation, or towards making machines more easily deal with messy data. No disrespect to data curators (I have data curator friends – sorry guys!), but data wrangling and curating is often spoken of as the most significant bottleneck in building data processing pipelines. If we can loosen this bottleneck, we can further unleash the potential of data science. In software engineering there’s a lot of work on automated testing and continuous integration of software code, and I think this is something that we can learn from in trying to automate data processing pipelines.

Are you involved in any wider collaborations? Why are these important?

I am very lucky to have been recruited to work on two projects, one national and one international. The Horizon 2020 PhenoMeNal project is developing European-scale digital infrastructure for medical metabolomics (studying the human metabolism), which allows me to work with researchers across the whole of Europe. I also work on the COPO (Collaborative Open Plant ‘Omics) project, developing data sharing infrastructure for plant science, funded by the BBSRC.

During my whole research career, I’ve worked on EU funded projects. I believe in order to enable world-leading science, you have to be able to use as well as keep developing computing infrastructure, as well as build scientific communities, without borders. The EU’s research programmes have put Europe, the UK, and us at Oxford in a prime position of world-leading research.

What publication /paper are you most proud of and why?

My paper 'The role of markup for enabling interoperability in health informatics', published in Frontiers in Physiology, describes how markup languages (machine-readable codes that can be embedded into other documents and data) are really useful for enabling data exchange in biomedicine and health research. I feel it’s important as it highlights how mathematical models can be decoupled from computer source code, while at the same time discusses how software engineering concepts, such as Generics and Inheritance can enable interoperability and reusability in computer-based modelling.

Have you received any awards or fellowships?

I was recently elected to a Junior Research Fellowship at Kellogg College, along with my Centre colleague Alejandra Gonzalez-Beltran. I also received an MPLS (Maths, Physical and Life Sciences division) Teaching Award, to run a trial in teaching with personal activity data generated by the students themselves. If successful, I hope to encourage MPLS to roll out the teaching method across the division. [Read how the teaching method has been adopted by Oxford University's Continuing Education Department on their 10 week evening Applied Data Science course.] I am very proud of these achievements, and you can read more about my JRF here and about the teaching awards on the MPLS division website.

What do you think the most important issues/challenges in your field will be in the next decade and how is the Centre placed to address them?

In the life sciences, a major challenge lies in the more specific area of medical/clinical data. For years I can recall researchers discussing, time and again, the same ethical and privacy issues surrounding the use of clinical records and patient data for research purposes, and how these issues limit the access and use of such data. The solution to this I do not think is a technical one, but rather one of social engineering. There needs to be better education on what data is, what it means, and what potential good and harm can come from sharing of personal data, such as one’s genome, spending patterns, or even personal likes and dislikes on social media sites like Facebook. Perhaps then individuals would be more open to giving consent for their data to be used for research purposes. The Centre is well placed to address this challenge as it is inherently a multidisciplinary problem – and the Centre is already multidisciplinary in nature.

What do you think the Centre does best?


There’s always an awful lot of cake going around.