Open data sharing accelerates COVID-19 research

Open access data benefits millions of scientists around the world and is essential for a rapid response to the COVID-19 pandemic

Open access COVID-19 data sharing — Artist's impression of open access COVID-19 data sharing. Credit: Spencer Phillips, EMBL-EBI.

Scientific information is described as open access (OA) when it's available online and free of charge to the end user. This can be in the form of scientific research articles or data such as genomic sequences, protein structure information, or scientific images. OA to research data and publications is vital for making research reusable, allowing data to be explored and re-analysed and leading to new discoveries in a diverse range of contexts.

OA is more important now than ever before. We've moved from a time when a single human genome would take 10 years to analyse, costing approximately $100 million, to a time when thousands of genetic sequences are produced every day, and a single human genomic sequence takes up only around 1 GB of storage space. This abundance of sequence data can be accessed and shared across the globe, all thanks to OA databases.

The ability of scientists to produce, store, and readily access genomic data has led to the inception of large-scale sequencing projects that only 10 years ago would have been completely unfeasible. Initiatives such as the Darwin Tree of Life project, which aims to sequence the genomes of 60 000 species across the UK and Ireland, and the 100 000 Genomes Project, which sequenced the genomes of NHS patients affected by a rare disease or cancer, demonstrate this striking increase in scale.

OA data-sharing has accelerated life science research exponentially and furthered our understanding of health and disease. With the initiation of new projects to improve clinical reporting from genomic sequence information, it's increasingly easy to access clinically relevant data. Large-scale sequencing projects such as the Pan-Cancer Analysis of Whole Genomes (PCAWG) have made huge discoveries such as the chronology of genomic changes for many different cancer types. Open access to genome data is transforming the way we administer clinical diagnosis and personalised treatments.

OA at EMBL

EMBL's European Bioinformatics Institute (EMBL-EBI) maintains a vast range of freely available, open access data resources. These allow scientists to upload, access, and analyse a broad variety of biological datasets. Researchers can access reference genome annotation through Ensembl, 3D protein structural data through the Protein Data Bank in Europe (PDBe), or access life science publications and preprints through Europe PMC, to name just a few of the services available.

EMBL is funded by the public, and EMBL's OA policy requires that all scientific publications coming out of EMBL are made freely available in Europe PMC. Any article deposited in Europe PMC can be searched in full and is accessible to anyone in the world with an internet connection.

"Open access and open science policies are extremely important," says Jo McEntyre, Associate Director of EMBL-EBI Services. "They encourage people to make their research open and reusable, so it can be explored and re-analysed as new methods and technologies come online."

Accelerating research through OA

There are many benefits to making research data and publications freely accessible to both scientists and the public. Much of today's research is funded by the taxpaying public and so it only seems fair that any research funded in this way should be accessible to everyone who wishes to read it. OA also helps those researchers and institutions struggling with increasing journal subscription fees to access current research articles.

OA increases the visibility of research data and information. This has many positive effects on science globally, including an ability to rapidly build upon and react to existing research. Scientists can carry out huge collaborative projects on a global scale. One famous example of this is the Human Genome Project, which involved thousands of scientists all over the world who developed a deeply valuable and publicly available resource for future research.

OA SARS-CoV-2 data

Now that we find ourselves amid a global pandemic, the world is looking to scientists to find new treatments and a vaccine for the SARS-CoV-2 virus. With thousands of new cases recorded around the world each day, there is no time to lose. Researchers are busy collecting large amounts of data relating to the pandemic. OA sharing of the data is essential for accelerating our understanding of the biology and spread of COVID-19.

Submission of SARS-CoV-2 data to OA databases, such as UniProt or those of the International Nucleotide Sequence Database Collaboration (INSDC), makes the data rapidly and freely available to everyone. Submitted SARS-CoV-2 data are automatically incorporated into a host of further OA COVID-19 databases, from specialised databases to the overarching European COVID-19 Data Platform, to aid SARS-CoV-2 research.

Exploring the viral proteins

Understanding how SARS-CoV-2's proteins function, and the specific amino acid residues or sequences involved in binding to receptors on host cells, determining which cell types the virus can infect, and how disease develops are key to understanding the biology of the virus for developing new diagnostic tests and therapeutics. To help researchers answer these questions, UniProt has launched a dedicated COVID-19 Portal featuring the latest SARS-CoV-2 proteins, receptors, and host protein entries.

For example, UniProt contains extensive annotation on glycosylation sites. The coronavirus spike protein, which allows the virus to invade a host cell, is heavily glycosylated - surrounded by sugar-like molecules. This plays a critical role in protein folding and immune evasion by shielding specific epitopes - regions of the protein that can be recognised by the human immune system - from antibody neutralisation.

UniProt also provides extensive annotations on protein function and protein sequence with evidence from existing literature. Researchers can also submit their COVID-19 papers referencing UniProt data to be collected into a community bibliography.

In the future, the COVID-19 UniProt Portal will also integrate coding variants of UniProt proteins and use text-mining approaches to identify papers associated with these variants. This will create a map of potentially critical variations associated with virus infection. There are also plans to integrate a protein COVID-19 knowledge graph to represent direct and indirect relationships between the virus and host proteins, host pathway mechanisms, and drug-target interactions to explore known and new therapeutics.

To ensure the most up to date SARS-CoV-2 data is always available, the UniProt COVID-19 Portal is updated independently of the general UniProt release cycle.

The European COVID-19 Data Platform

EMBL-EBI launched the European COVID-19 Data Platform in conjunction with the European Commission, the European Open Science Cloud, ELIXIR, and a number of partner institutions. The Platform enables rapid access to datasets and results pertaining to the SARS-CoV-2 pandemic, which will accelerate research and support the development of diagnostics, therapeutics, and effective vaccines.

The Platform organises, for example, the collection and analysis of viral sequence data to provide global open data sharing through the SARS-CoV-2 Data Hubs. Six months later, the Platform features over 60 000 SARS-CoV-2 sequences, along with additional molecular data including proteins, compounds, and drug targets. It also contains the Federated European Genome-phenome Archive (EGA), designed to support national data management requirements for genomic and clinical data as part of healthcare or biomedical research projects. It includes a secure authorised access mechanism to support research use of human data across Europe. All these data are then connected to the COVID-19 Data Portal, which makes them readily available to researchers.

The COVID-19 Data Portal

One of the biggest challenges in a fast-moving pandemic is to share data and findings in a coordinated way. To address this challenge, EMBL-EBI and partners operate the COVID-19 Data Portal. The COVID-19 Data Portal features SARS-CoV-2 data from EMBL-EBI data resources including the European Nucleotide Archive (ENA), UniProt, PDBe, the Electron Microscopy Data Bank (EMDB), Expression Atlas, and Europe PMC. The Portal is constantly updated with new datasets and tools.

"Users can upload their SARS-CoV-2 data and get access to data from other sources around the world," says Guy Cochrane, Team Leader for Data Coordination and Archiving at EMBL-EBI. "We're working hard to make the Portal intuitive and easy to use."

In its first six months, the COVID-19 Data Portal has seen nearly 3 million web requests and thousands of data submissions. Over 300 institutions from 30 countries have deposited data and the Portal now offers open access to over 180 000 scientific publication records relating to the COVID-19 outbreak.

A better response to future pandemics

"If 2020 has taught us anything, it's that no country alone can stop a pandemic. Collaboration is key," explains Guy. "One of the things that gives me hope is that we are now in a much stronger position. We have enhanced international partnerships and we have built robust data-sharing infrastructure. The European COVID-19 Data Platform is helpful in the short term, and more importantly it's a model for how to share infectious disease data in the future, enabling collaboration between countries and disciplines. It means we can reuse and adapt the data infrastructure to help understand, monitor, and stop other infectious diseases - and this is a very encouraging thought."

Using the COVID-19 Data Platform in research

To get a better idea of the importance of open access (OA) during the pandemic and how researchers are using the COVID-19 Data Platform, Guy Cochrane, one of the scientists behind the development of the Portal, and Andrea Zaliani, a scientist using the Portal for COVID-19 research, share their insights on the initiative.

Guy Cochrane is Team Leader for Data Coordination and Archiving at EMBL-EBI.

Guy is the head of the European Nucleotide Archive (ENA), a comprehensive repository for public nucleotide sequence data. He is also jointly responsible for the inception, maintenance, and development of the COVID-19 Data Platform.

Andrea Zaliani is Senior Bioinformatics Scientist at Fraunhofer Institute for Translational Medicine and Pharmacology (TMP).

Andrea has extensive experience in pharmaceutical research and development, including the chemogenomic application of bioinformatics tools. He is currently conducting bioinformatics research on COVID-19 at Fraunhofer TMP.

Q: How do you think open access data plays a role in the COVID-19 pandemic?

Zaliani: I am all in favour of open access data sharing: it helps to cut the time, costs, and relative stress of biomedical research. Open access data sharing has been a paradigm for quick answers in this pandemic. The focus shift for quite a number of biosafety level 4 laboratories working on viruses has been unprecedented. One thing is for certain, this could not have happened so quickly without open access data sharing.

Cochrane: One of the key features of the COVID-19 Data Platform is the volume of open access SARS-CoV-2 raw sequence data available. Open access to these data is really important to accelerate accurate understanding of the genetic variation of the virus. The reason we care about this variation is because it informs us about the biology, transmission, and spread of the virus, which in turn leads us to drug discovery, intervention, and vaccine development.

Q: How have you or other scientists used the COVID-19 Data Platform for COVID-19 research?

Zaliani: We have been producing COVID-19 screening data, and the COVID-19 Data Portal has been very effective for us. With a sudden change of focus during the pandemic, we needed to have a reference point where we could search for data, and to share our own data with the public. The non-flashy surface of the Portal offers users all the functions they need. I hope soon we will be able to tap into in vivo model data as well.

I will definitely recommend anyone involved in COVID-19 research to check out the Portal and appreciate the data curation which has been devoted to it.

Cochrane: The European COVID-19 Data Platform allows scientists around the world to access different data types relating to COVID-19, from the virus itself or from the patients affected by the virus. There are three technical components behind the Platform. The SARS-CoV-2 Data Hubs enable scientists to manipulate, validate, interpret, and ultimately share viral data. The federated European Genome-phenome Archive (EGA) allows scientists securely to share genetic data related to humans. Finally, we have the COVID-19 Data Portal, which is a website that allows researchers to access data and submit their own data into the system.

Q: How do you see the COVID-19 Data Platform evolving over time?

Cochrane: Over the first six months since the Data Platform launched we have achieved a lot, including over 60 000 viral sequences from over 300 institutions across the world. These numbers are growing and growing, and we are regularly updating the COVID-19 Data Portal to ensure everyone has rapid access to these data. The project will also grow to include further partners across Europe. The COVID-19 Data Portal Sweden was recently released, and we envisage other countries will have their own Portals as the project matures.

We have seen a great deal of collaboration from researchers working through the Platform. Researchers who had never met before started excellent collaborations powered by videoconferences. This sense of unity has helped us build robust data-sharing infrastructures, such as the COVID-19 Data Platform, which could be repurposed in the future for other infectious diseases. We now have a model not just for times of crisis, but for how we do science in the future.