Researchers at USF Health and Weill Cornell Medicine, as part of an expansive, multi-institutional project investigating voice as a biomarker for disease, have reached a significant milestone by publishing the first version of their clinically validated voice dataset to an online artificial intelligence platform where it will be an invaluable resource for researchers across the globe.
The National Institutes of Health-funded project, Voice as a Biomarker of Health, seeks to build an ethically sourced AI-enabled database of 10,000 human voices from patients with different illnesses to help doctors diagnose and treat diseases, such as cancer and depression, based on the sound of a patient's voice.
The initial data release includes more than 12,500 separate recordings from 306 participants across the United States and Canada. The dataset will be published on multiple platforms including Health Data Nexus and is available to the community of health researchers studying voice. The release comes at the end of the second year of the four-year $14 million project, with several additional releases scheduled for the next two years. Already among the largest collections of human voices, by the end of the study the repository will become the world's flagship database for AI voice and health.
"There is so much information in these first recordings and we are excited to receive feedback on it because what we are developing what will be an unequalled resource for the scientific community," said Yaël Bensoussan, MD, director of the USF Health Voice Center and co-lead of the project. "It is really important for us to understand what people can do with this initial data and what kinds of clinical questions they can answer."
As one of four precision health data projects funded by the NIH Common Fund's Bridge2AI program, Voice as a Biomarker of Health aims to introduce a transformative new method of diagnosing and treating diseases by training AI models to identify illnesses through changes in the human voice, with vast implications for the clinical setting.
The University of South Florida is the lead institution for the project in collaboration with Weill Cornell Medicine and 10 other institutions across the United States and Canada. Dr. Bensoussan of the USF Health Morsani College of Medicine, and Olivier Elemento, PhD, director of the Englander Institute for Precision Medicine at Weill Cornell Medicine, are the project's co-principal investigators.
While previous research utilizing voice and AI to detect disease is encouraging, it has been limited due to the small size of datasets, as well as concerns over data security, ownership and bias. Voice as a Biomarker of Health is addressing that shortcoming by bringing together medical voice, AI engineering and ethics experts to generate a landmark voice database using privacy-preserving AI.
"Artificial intelligence is revolutionizing our ability to detect and understand disease, and this groundbreaking voice dataset is a monumental step forward in that journey," said Dr. Elemento, who is also a professor of physiology and biophysics at Weill Cornell Medicine. "These clinically validated data, combined with cutting-edge AI techniques, pave the way for new diagnostic possibilities and groundbreaking innovations that will transform patient care globally."
The newly published dataset is particularly notable for the breadth and quality of its recordings, which were collected across numerous institutions from outpatient clinical settings. Data is clinically validated and standardized across locations and all participants perform the same tests and acoustic tasks. The three types of acoustic tasks - respiratory, voice, and speech and linguistic - include over 20 tasks such as breathing at rest, coughing, enunciating "E" at long intervals, reading specific passages, free speech and other voice-related activities.
This highest-quality, standardized data will be essential in corroborating existing voice algorithms and fueling the development of new discoveries, said Dr. Bensoussan.
"Researchers will be able to use it as a benchmark dataset to confirm that their algorithms are valid," she said. "For example, some startups have already developed algorithms to diagnose voice biomarkers with their proprietary data, and our dataset can be used to see if it also works with people with different types of diseases."
Accompanying the data release is a Bridge2AI Voice Prep Kit offering a host of tools to researchers for preprocessing and utilizing the data. The Bridge2AI consortium is also hosting the 2025 Voice AI Symposium and Hackathon, April 22-24, in Tampa, FL, which will connect clinician-scientists, researchers, patients and top minds in AI to advance the application of voice AI in health care and the utility of the database in pioneering new discoveries.
"An undertaking of this size and scope is a team effort, with so many investigators and institutions in the United States and Canada coming together to fuel discoveries in health care," Dr. Bensoussan said. "From this work, and the research it will enable in the rest of the world, I think you are going to see a lot of progress and some very impactful products developed."
Voice as a Biomarker of Disease lead investigators
· Yaël Bensoussan, MD, USF Health Morsani College of Medicine (co-principal investigator)
· Olivier Elemento, PhD, Weill Cornell Medicine (co-principal investigator)
· Alexandros Sigaras, Weill Cornell Medicine
· Anaïs Rameau, MD, Weill Cornell Medicine
· Maria Powell, PhD Vanderbilt University
· Ruth Bahr, PhD, University of South Florida College of Behavioral and Community Sciences
· Philip Payne, PhD, Washington University in St. Louis
· David Dorr, MD, Oregon Health & Science University
· Jean-Christophe Belisle-Pipon, PhD, Simon Fraser University
· Vardit Ravitsky, PhD, University of Montreal
· Satrajit Ghosh, PhD, Massachusetts Institute of Technology
· Jennifer Siu, MD, SickKids Toronto
· Frank Rudzicz, PhD, University of Toronto
· Jordan Lerner-Ellis, PhD, University of Toronto