DeepSeek Tested: Performance vs. Other AI Tools

China's new DeepSeek Large Language Model (LLM) has disrupted the US-dominated market , offering a relatively high-performance chatbot model at significantly lower cost.

Author

  • Simon Thorne

    Senior Lecturer in Computing and ​Information Systems, Cardiff Metropolitan University

The reduced cost of development and lower subscription prices compared with US AI tools contributed to American chip maker Nvidia losing US$600 billion (£480 billion) in market value over one day. Nvidia makes the computer chips used to train the majority of LLMs, the underlying technology used in ChatGPT and other AI chatbots. DeepSeek uses cheaper Nvidia H800 chips over the more expensive state-of-the-art versions.

ChatGPT developer OpenAI reportedly spent somewhere between US$100 million and US$1 billion on the development of a very recent version of its product called o1. In contrast, DeepSeek accomplished its training in just two months at a cost of US$5.6 million using a series of clever innovations.

But just how well does DeepSeek's AI chatbot, R1, compare with other, similar AI tools on performance?

DeepSeek claims its models perform comparably to OpenAI's offerings, even exceeding the o1 model in certain benchmark tests. However, benchmarks that use Massive Multitask Language Understanding (MMLU) tests evaluate knowledge across multiple subjects using multiple choice questions. Many LLMs are trained and optimised for such tests, making them unreliable as true indicators of real-world performance.

An alternative methodology for the objective evaluation of LLMs uses a set of tests developed by researchers at Cardiff Metropolitan, Bristol and Cardiff universities - known collectively as the Knowledge Observation Group (KOG). These tests probe LLMs' ability to mimic human language and knowledge through questions that require implicit human understanding to answer. The core tests are kept secret, to avoid LLM companies training their models for these tests.

KOG deployed public tests inspired by work by Colin Fraser, a data scientist at Meta , to evaluate DeepSeek against other LLMs. The following results were observed:

The tests used to produce this table are "adversarial" in nature. In other words, they are designed to be "hard" and to test LLMs in way that are not sympathetic to how they are designed. This means the performance of these models in this test is likely to be different to their performance in mainstream benchmarking tests.

DeepSeek scored 5.5 out of 6, outperforming OpenAI's o1 - its advanced reasoning (known as "chain-of-thought") model - as well as ChatGPT-4o, the free version of ChatGPT. But Deepseek was marginally outperformed by Anthropic's ClaudeAI and OpenAI's o1 mini, both of which scored a perfect 6/6. It's interesting that o1 underperformed against its "smaller" counterpart, o1 mini.

DeepThink R1 - a chain-of-thought AI tool made by DeepSeek - underperformed in comparison to DeepSeek with a score of 3.5.

This result shows how competitive DeepSeek's R1 chatbot already is, beating OpenAI's flagship models. It is likely to spur further development for DeepSeek, which now has a strong foundation to build upon. However, the Chinese tech company does have one serious problem the other LLMs do not: censorship.

Censorship challenges

Despite its strong performance and popularity, DeepSeek has faced criticism over its responses to politically sensitive topics in China. For instance, prompts related to Tiananmen Square, Taiwan, Uyghur Muslims and democratic movements are met with the response: "Sorry, that is beyond my current scope."

But this issue is not necessarily unique to DeepSeek, and the potential for political influence and censorship in LLMs more generally is a growing concern. The announcement of Donald Trump's US$500 billion Stargate LLM project , involving OpenAI, Nvidia, Oracle, Microsoft, and Arm, also raises fears of political influence.

Additionally, Meta's recent decision to abandon fact-checking on Facebook and Instagram suggests an increasing trend toward populism over truthfulness.

DeepSeek's arrival has caused serious disruption to the LLM market. US companies such as OpenAI and Anthropic will be forced to innovate their products to maintain relevance and match its performance and cost.

DeepSeek's success is already challenging the status quo, demonstrating that high-performance LLM models can be developed without billion-dollar budgets. It also highlights the risks of LLM censorship, the spread of misinformation, and why independent evaluations matter.

As LLMs become more deeply embedded in global politics and business, transparency and accountability will be essential to ensure that the future of LLMs is safe, useful and trustworthy.

The Conversation

Simon Thorne does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.

/Courtesy of The Conversation. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).