PULLMAN, Wash. – While large language models like ChatGPT can do well when choosing multiple-choice answers on financial licensing exams, they falter when dealing with more nuanced tasks.
A Washington State University-led study analyzed more than 10,000 responses to financial exam questions by the artificial intelligence language models BARD, Llama and ChatGPT.
The researchers asked the models to not only choose answers but also explain the reasoning behind them, then compared those text answers to those by human professionals. While two versions of ChatGPT performed the best at these tasks, they still showed a high level of inaccuracy with more advanced topics.
"It's far too early to be worried about ChatGPT taking finance jobs completely," said study author DJ Fairhurst of WSU's Carson College of Business. "For broad concepts where there have been good explanations on the internet for a long time, ChatGPT can do a very good job at synthesizing those concepts. If it's a specific, idiosyncratic issue, it's really going to struggle."
For this study, published in the Financial Analysts Journal , Fairhurst and co-author Daniel Greene of Clemson University used questions from licensing exams including the Securities Industry Essentials exam as well as the Series 6, 7, 65 and 66.
To move beyond the AI models' ability to simply pick the right answer, the researchers asked the models to provide written explanations. They also chose questions based on specific job tasks financial professionals might actually perform.
"Passing certification exams is not enough. We really need to dig deeper to get to what these models can really do," said Fairhurst.
Of all the models, the paid version of ChatGPT, version 4.0, performed the best, providing answers that were the most similar to human experts. Its accuracy was also 18 to 28 percentage points higher than the other models. However, this changed when the researchers fine-tuned the earlier, free version of ChatGPT 3.5, by feeding it examples of correct responses and explanations. After this tuning, it came close to ChatGPT 4.0 in accuracy and even surpassed it in providing answers' that were similar to those of human professionals.
Both models still fell short, though, when it came to certain types of questions. While they did well reviewing securities transactions and monitoring financial market trends, the models gave more inaccurate answers for specialized situations such as determining clients insurance coverage and tax status.
Fairhurst and Greene, along with WSU doctoral student Adam Bozman, are now working on other ways to determine what ChatGPT can and cannot do with a project that asks it to evaluate potential merger deals. For this, they are taking advantage of the fact that ChatGPT is trained on data up until September 2021, and using deals made after that date where the result is known. Preliminary findings are showing that so far, the AI model isn't very good at this task.
Overall, the researchers said that ChatGPT is still probably better used as a tool to assist rather than as a replacement for an established financial professional. On the other hand, AI may change the way some investment banks employ entry-level analysts.
"The practice of bringing a bunch of people on as junior analysts, letting them compete and keeping the winners – that becomes a lot more costly," said Fairhurst. "So it may mean a downturn in those types of jobs, but it's not because ChatGPT is better than the analysts, it's because we've been asking junior analysts to do tasks that are more menial."