MADISON — University of Wisconsin–Madison researchers are warning that artificial intelligence tools gaining popularity in the fields of genetics and medicine can lead to flawed conclusions about the connection between genes and physical characteristics, including risk factors for diseases like diabetes.
The faulty predictions are linked to researchers' use of AI to assist genome-wide association studies. Such studies scan through hundreds of thousands of genetic variations across many people to hunt for links between genes and physical traits. Of particular interest are possible connections between genetic variations and certain diseases.
Genetics' link to disease not always straightforward
Genetics play a role in the development of many health conditions. While changes in some individual genes are directly connected to an increased risk for diseases like cystic fibrosis, the relationship between genetics and physical traits is often more complicated.
Genome-wide association studies have helped to untangle some of these complexities, often using large databases of individuals' genetic profiles and health characteristics, such as the National Institutes of Health's All of Us project and the UK Biobank . However, these databases are often missing data about health conditions that researchers are trying to study.
"Some characteristics are either very expensive or labor-intensive to measure, so you simply don't have enough samples to make meaningful statistical conclusions about their association with genetics," says Qiongshi Lu , an associate professor in the UW–Madison Department of Biostatistics and Medical Informatics and an expert on genome-wide association studies.
The risks of bridging data gaps with AI
Researchers are increasingly attempting to work around this problem by bridging data gaps with ever more sophisticated AI tools.
"It has become very popular in recent years to leverage advances in machine learning, so we now have these advanced machine-learning AI models that researchers use to predict complex traits and disease risks with even limited data," Lu says.
Now, Lu and his colleagues have demonstrated the peril of relying on these models without also guarding against biases they may introduce. The team describe the problem in a paper recently published in the journal Nature Genetics . In it, Lu and his colleagues show that a common type of machine learning algorithm employed in genome-wide association studies can mistakenly link several genetic variations with an individual's risk for developing Type 2 diabetes.
"The problem is if you trust the machine learning-predicted diabetes risk as the actual risk, you would think all those genetic variations are correlated with actual diabetes even though they aren't," says Lu.
These "false positives" are not limited to these specific variations and diabetes risk, Lu adds, but are a pervasive bias in AI-assisted studies.
New statistical method can reduce false positives
In addition to identifying the problem with overreliance on AI tools, Lu and his colleagues propose a statistical method that researchers can use to guarantee the reliability of their AI-assisted genome-wide association studies. The method helps removing bias that machine learning algorithms can introduce when they're making inferences based on incomplete information.
"This new strategy is statistically optimal," Lu says, noting that the team used it to better pinpoint genetic associations with individuals' bone mineral density.
AI not the only problem with some genome-wide association studies
While the group's proposed statistical method could help improve the accuracy of AI-assisted studies, Lu and his colleagues also recently identified problems with similar studies that fill data gaps with proxy information rather than algorithms.
In another recently published paper appearing in Nature Genetics , the researchers ring the alarm about studies that over-rely on proxy information in an attempt to establish connections between genetics and certain diseases.
For instance, large health databases like the UK Biobank have a ton of genetic information about large populations, but they don't have very much data regarding the incidence of diseases that tend to crop up later in life, like most neurodegenerative diseases.
For Alzheimer's disease specifically, some researchers have attempted to bridge that gap with proxy data gathered through family health history surveys, where individuals can report a parent's Alzheimer's diagnosis.
The UW–Madison team found that such proxy-information studies can produce "highly misleading genetic correlation" between Alzheimer's risk and higher cognitive abilities.
"These days, genomic scientists routinely work with biobank datasets that have hundreds of thousands of individuals, however, as statistical power goes up, biases and the probability of errors are also amplified in these massive datasets," says Lu. "Our group's recent studies provide humbling examples and highlight the importance of statistical rigor in biobank-scale research studies."