One in every 10 people worldwide is impacted by a rare genetic disease but about 50% of them remain undiagnosed despite rapid increases in genetic technology and testing. Even when a person does have access to testing, the process of getting a diagnosis can take about five years or more, which is sometimes too late for patients, who are often children, to start the right treatment.
This is partly because current clinical testing uses a method called short-read sequencing, which cannot access information in certain regions of the genome and so may miss crucial evidence to help make a diagnosis. But UC Santa Cruz researchers are pushing forward research on a cutting-edge alternative method, called long-read sequencing, which can provide a more comprehensive dataset for finding variation, eliminate the need for multiple specialized tests, and streamline the diagnosis of rare diseases.
A new study shows that long-read sequencing has the potential to improve the rate of diagnosis while reducing the time to diagnosis from years to days — in a single test and at a much lower cost. The study was published in The American Journal of Human Genetics and led by core members of the UCSC Genomics Institute Professor of Biomolecular Engineering (BME) Benedict Paten and Associate Professor of BME Karen Miga, as well as former UCSC postdoctoral scholar Jean Monlong.
"Rare diseases are something that people have been struggling to diagnose for so many years, and if we have a sequencing technology which streamlines diagnostic testing, I think that will be a huge contribution — and that is what we tested as part of this paper," said Shloka Negi, a UC Santa Cruz BME Ph.D. student who is the paper's first author.
"Today, the diagnostic yield of genetic sequencing is frustratingly low," Paten said. "One likely cause is the incomplete sequencing methods used in clinical practice. In this work, we test the hypothesis that new, more comprehensive long-read sequencing can generate additional information useful for genetic diagnosis. We were excited to discover numerous additional potentially interesting genetic variants and epigenetic signals in our cohort. While it is still early days, there is great promise in this information, and it will take time for the community to interpret and fully understand much of this new information."
Finding rare disease
This study focused on rare monogenic diseases, which are those caused by a disruption to a single gene.
Scientists diagnose genetic diseases by searching through their genetic material to find variants — differences in a gene that may prevent it from functioning properly. The typical approach for finding these variants uses a technique called short-read sequencing, which reads the genetic base pairs — combinations of adenine (A), cytosine (C), guanine (G), and thymine (T) — in sequences of about 150-250 at a time.
The limitation of short-read sequencing, however, is that it can miss crucial information in certain regions of the genome, like patterns of base pairs that are much longer than just 250 base pairs. It also can't perform "phasing," the process of determining which variants are inherited from the mother and which are from the father. This can help clinicians discover from whom variants are inherited; for example if two variants are inherited from the same parent, one from each parent, or not inherited at all. This can be a very useful piece of information for genetic diagnoses, especially when parental data is not available.
In contrast, long-read sequencing can read lengthy stretches of DNA at once, eliminating gaps that may lead scientists and clinicians to miss important information about gene variation. Long-read sequencing also provides direct phasing data as well as information about methylation, a chemical process in DNA that causes genes to be "turned on or off," and can contribute to disease.
"Long-read sequencing is going to be a lot better in certain cases, and we are taking steps to prove that," Negi said.
Leading in methods
UC Santa Cruz Genomics Institute researchers have a rich history of innovation and expertise in long-read sequencing and are actively developing methods to optimize sequencing and analysis for a wide range of health research applications. Many of the techniques researchers developed to achieve feats, such as the first truly complete "telomere-to-telomere" reference genome , are now being used to improve patient outcomes.
"Reinforcing earlier findings, we found that the benefits of using long-read sequencing were increased substantially by using a complete, so-called 'telomere-to-telomere' reference genome in place of the existing incomplete but widely used genomic reference," Miga said. "We anticipate that pangenomes — references that represent diverse human variation — will extract even more benefit from new long-read sequencing technologies."
Paten and Miga's labs partnered with clinicians to work on the cases of 42 patients with rare diseases — some of whom received a diagnosis via short-read methods or other specialized testing, and some of whom were still undiagnosed. In some cases, the researchers had access to parental genetic information, but in others, they did not.
Long-read sequencing of the patients was led by the Miga Lab using nanopore sequencing, a method for long-read sequencing pioneered at UCSC , to achieve highly accurate, end-to-end reads of the patients' genomes for about $1,000 per sample.
The genomic data was analyzed using computational methods developed in Paten's lab to find small and large variants, phasing data, and methylation data, all using one pipeline called the Napu pipeline . The analysis process takes around a day or less, depending on the computer processing speed, and costs $100.
Solving cases
After sequencing and analyzing the patient data, the researchers found that long-reads provided a more exhaustive dataset as compared to what can be derived with short-read sequencing.
Long-read sequencing delivered conclusive diagnosis for 11 of the 42 patients in the cohort, providing everything that was known from the short-read data as well as additional information, including additional rare candidate variants, long-range phasing, and methylation — all in a single, cost-efficient, and rapid protocol.
The 11 diagnosed cases included four of congenital adrenal hypoplasia (a rare condition where the adrenal glands are enlarged and fail to function properly). The gene responsible for this disease is in a particularly challenging region of the genome — it can't be characterized with short read sequencing technology, and the current clinical test is cumbersome and incomplete.
"To solve these cases, we developed a new pangenomic tool that integrates new high-quality assemblies like the 'telomere-to-telomere' reference genome," said Monlong, who began this project as a postdoctoral scholar in Paten's lab and continued in his current position at INSERM in France. "We were excited to see that we could find and phase the pathogenic variants of all four patients suffering from this disease in our cohort. In the future, it might offer a rapid and comprehensive clinical test. We know many rare diseases involve regions of the human genome that have been historically difficult to study, so our results encourage us to extend our approach to more of those diseases that have been at a standstill for a long time."
In addition, two cases involved disorders of sex development, while one rare case of Leydig cell hypoplasia affected male sexual development due to underdeveloped Leydig cells in the testes. Additionally, four cases of neurodevelopmental disorders, each representing long and challenging diagnostic odysseys, were finally resolved.
"Long read sequencing is likely the next best test for unsolved cases with either compelling variants in a single gene or a clear phenotype," Negi said. "It can serve as a single diagnostic test, reducing the need for multiple clinical visits and transforming a years-long diagnostic journey into a matter of hours."
On average, each patient had 280 genes (including some Mendelian disease genes, which are linked to inherited disorders caused by single-gene mutations) with significant protein-coding regions uniquely covered by long reads and undetected by short reads.
"There's so much more of the genome that the long reads can unlock," Negi said. "But, it will take some time until we can fully interpret this new information revealed by long reads. This data has been absent from our clinical databases, which were built using short-read analysis and mapping to the standard reference. We showed that long reads are uncovering about 5.8% more of the telomere-to-telomere genome that short reads simply couldn't access."
Other UC Santa Cruz researchers involved in this research include Brandy McNulty, Ivo Violich, Joshua Gardner, Todd Hillaker, and Sara O'Rourke.
This research was funded in part by the Chan Zuckerberg Initiative.