For the last year, ChatGPT has been able to analyse images as well as text as a feature of its latest version - GPT-4V(ision).
For instance, if you upload a photograph of the contents of your fridge, ChatGPT can describe what's in the photo and then recommend potential meal ideas based on those ingredients, along with suitable recipes. Or you can photograph a hand-drawn sketch of how you'd like your new website to look and ChatGPT will take that image and provide you with the HTML code to make the site.
You can also upload a still frame from part way through a film. ChatGPT can identify the film and summarise the plot up to that point only. The list of applications is virtually endless.
As a researcher interested in face perception, I'm particularly curious about how ChatGPT handles face images - matching two different images of the same person, for example. But how can we judge just how good the chatbot is at recognising faces? To explore how well people perform with faces, psychologists have come up with numerous tests that assess different abilities, so I decided to try ChatGPT on some of these.
First, I tried it on the "reading the mind in the eyes" test. In this task, only the eye regions of photographs are presented, along with four descriptive words as options regarding what the person in the picture is thinking or feeling (with one of these being the correct answer).
The test, which you can try for yourself, is considered a measure of "theory of mind". This refers to someone's ability to interpret another person's behaviour based on their mental state. People typically score around 26-31 out of a possible 36. ChatGPT answered 29 questions correctly, slightly more than in a recent study where other researchers gave it the same test.
Going beyond facial expressions, I next tested ChatGPT on a task called the "Glasgow face matching test", in which participants are presented with 40 pairs of face images. Half of the pairs comprise two photos showing the same person, taken using different cameras. For the other half, the two photos show two different but similar-looking people.
When asked to decide if the images show the same person or not, the average score for participants is 81.3%. When I subjected ChatGPT to the test, it scored 92.5%.
Finally, I wanted to consider face recognition. To avoid uses that infringe on people's privacy, ChatGPT has been designed to refuse when asked to identify people in images. However, when pressed for its best "guess", it was willing to provide answers when I presented it with what's known as the "famous faces doppelgangers test".
A pair of faces is shown on each of 40 trials, along with a celebrity's name, and participants are asked to identify which face is that particular celebrity (left or right). They're also asked if they know the celebrity or not.
The task is made difficult because the other face is very similar in appearance to the celebrity - in other words, a doppelganger. People generally score around 81.5% for those trials where the celebrity is known to the person. (If they don't know who the celebrity is, their choice would simply be a guess.)
Impressively, ChatGPT scored 100% correct across all of the trials for this test.
Putting it all together
On the basis of my experience, ChatGPT seems well-equipped to perform tasks related to the recognition and identification of human faces - including their expressions. It performed as well as or even better than people do for these three tests, at least.
Of course, these were my initial explorations rather than a peer-reviewed study, so more work is needed to firmly establish its abilities. But it does suggest that ChatGPT can handle face images.
ChatGPT is based on a type of artificial intelligence (AI) program called a large language model (LLM), which means that it has been trained on an extensive amount of text (and now image) data. This allows it to learn the structure and patterns that exist within the data, and subsequently generate sensible responses to almost any question or request by the user.
ChatGPT says that face images were also a significant part of its training data, although it doesn't store and recall specific images. Instead, it appears to rely on the general patterns and associations it has learned during its training. Other sources seem to confirm this.
Presumably, through exposure to numerous face images alongside text that included the word "suspicious", for example, it was able to develop a representation of that facial expression which was distinct from other expressions like "sarcastic".
Similarly, refining its representation of a celebrity's face through multiple exposures meant that it could subsequently differentiate them from other, similar-looking faces. However, again, this is admittedly informed speculation on my part.
Based on my results and other demonstrations of this latest version of the chatbot, it seems likely that ChatGPT's already remarkable performance across a wide variety of tasks will continue to improve with each new version released.