Advanced AI Tackles Visual Puzzles, Abstract Reasoning

University of Southern California

Artificial Intelligence has learned to master language, generate art, and even beat grandmasters at chess. But can it crack the code of abstract reasoning—those tricky visual puzzles that leave humans scratching their heads? Researchers at USC Viterbi School of Engineering Information Sciences Institute (ISI) are putting AI's cognitive abilities to the test, pushing the multi-modal large language models (MLLMs) to solve visual problems once reserved for human IQ tests. The result? A glimpse into how far AI has come—and where it still stumbles.

USC Viterbi ISI Research Assistants Kian Ahrabian and Zhivar Sourati recently investigated whether MLLMs can perform nonverbal abstract reasoning, tasks that require both visual perception and logical reasoning, and presented their findings at the Conference on Language Modeling (COLM 2024) in Philadelphia, PA October 7-9, 2024.

Jay Pujara, research associate professor of computer science at the USC Viterbi School of Engineering and an author on the paper said, "Every day we're bombarded with new headlines about what AI can (and can't) do, which are often very surprising. We still have such a limited understanding of what new AI models can do, and until we understand these limitations we can't make AI better, safer, and more useful. This paper helps fill in a missing piece of the story of where AI struggles."

The Challenge: Can AI See and Think?

"We wanted to see if this new generation of large models, which are able to process images, can reason on their own," Ahrabian explained. "For example, if you see a yellow circle turning into a blue triangle, can the model apply the same pattern in a different scenario?"

To answer this question, the team tested 24 different MLLMs on puzzles based on Raven's Progressive Matrices, a well-known test of abstract reasoning. They found that open-source models struggled significantly. "They were really bad. They couldn't get anything out of it," Ahrabian said plainly.

In contrast, closed-source models, such as GPT-4V—models developed by private companies and not publicly available for modification—performed better. These models are typically trained with more advanced resources, including larger datasets and more powerful computing systems, giving them a noticeable edge. "We saw some nontrivial results with closed-source models," Ahrabian added, "Specifically, GPT-4V was relatively good at reasoning, but it's far from perfect."

Where the AI Stumbles

A critical part of the study involved dissecting where these models were failing. One key issue was the AI's ability to accurately process visual information. "We wanted to know if the models could see the details—like colors or lines colliding—and whether that was where they were going wrong," Ahrabian said.

To isolate the problem, the researchers provided detailed textual descriptions of the images, ensuring the models had all the necessary information in a different format "Even when we removed the visual element and just gave them text, many models still couldn't reason effectively," Sourati explained. This revealed a crucial insight: the issue wasn't just with visual processing—it was with the reasoning itself. Now, the team had a clearer picture of what wasn't working, which allowed them to refine their focus and guide future improvements.

The Path Forward: Improving AI's Reasoning

One promising method the researchers explored was "Chain of Thought prompting," where the AI is prompted to think step by step through reasoning tasks. This approach led to significant improvements in some cases. "By guiding the models with hints, we were able to see up to 100% improvement in performance," Ahrabian noted.

Despite the remaining challenges, the researchers are optimistic. The study's findings highlight both the current limitations of AI and the exciting possibilities for future advancements. As these models continue to develop, USC's research could pave the way for AI that not only understands but reasons—blurring the line between machine intelligence and human cognition.

New Research at a New Conference

Ahrabian and Sourati, Ph.D students at the Thomas Lord Department of Computer Science, presented the paper, The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models, at COLM this week, marking the conference's inaugural year.

Pujara, who is also the director of the Center on Knowledge Graphs at ISI, commented, "AI is undergoing a major shift with the advent of language models. The emergence of new conferences like COLM to support this evolution is a great way to foster collaboration and inspire students eager to contribute to this rapidly advancing field."

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.