HKU Business School Evaluates AI Image-Generation Models

Table 1: Model Rankings for Image Content Quality in the New-Image Generation Task

Table 1: Model Rankings for Image Content Quality in the New-Image Generation Task

HKU Business School released a Comprehensive Evaluation Report on the Image Generation Capabilities of Artificial Intelligence Models, providing a systematic assessment of 15 text-to-image models and 7 multimodal large language models (LLMs). The results showed that ByteDance's Dreamina and Doubao, as well as Baidu's ERNIE Bot ranked among the top performers in terms of image content quality for new-image generation and image revision. However, despite DeepSeek having attracted global attention, its newly released text-to-image model, Janus-Pro, did not perform as well in new-image generation. HKU Business School researchers also found that while some text-to-image models excelled in content quality, their performance in safety and responsibility was significantly lacking. In general, multimodal LLMs demonstrated better overall performance compared to text-to-image models.

With the continuous advancement of generative AI, major breakthroughs have been made in image analysis and generation. This has brought much interest and excitement to both traditional and emerging image analysis. That said, AI image generation models are only in their early stages, with much room for development. Current systems are often prone to bias and can fail to meet safety and accountability standards.

Building on their previously published articles, Comprehensive Rankings of Assessments for Artificial Intelligence Large Language Models and Assessing the Image Understanding Capabilities of Large Language Models in Chinese Contexts, Professor of Innovation and Information Management and the Padma and Hari Harilela Professor in Strategic Information Management Zhenhui Jack Jiang, with his research team, conducted a systematic evaluation of the image generation capabilities of AI models. They focused on new-image generation and image revision. Using a range of approaches, their evaluation framework is meant to help users make informed decisions regarding model selection. Another goal is to provide developers with insights for optimisation and improvement.

Professor Jiang said, "Amid the rapid technological advancements in China, we must strike a balance between innovation, content quality, safety, and responsibility considerations. This multimodal evaluation system will lay a crucial foundation for the development of generative AI technology and help establish a safe, responsible, and sustainable AI ecosystem."

Evaluation Methods

The analysis primarily focused on assessing the models' performance in two tasks: new-image generation and the revision of existing images.

New-Image Generation: The analysis included image content quality, as well as safety and responsibility.

Content Quality: Evaluated based on three dimensions, which included alignment with prompts (the extent to which the generated image accurately represents the objects, scenes, or concepts described in the prompt); image integrity (the factual accuracy and reliability of the generated image, ensuring that it adheres to real-world principles); and image aesthetics (the artistic quality of the generated image, including composition, colour harmony, clarity, and creativity). Experts conducted pairwise model comparisons, and final rankings were determined using the Elo rating system to ensure scientific rigor.

Safety and Responsibility: Assessed based on an AI model's compliance with safety regulations and its awareness of social responsibility when generating new images. The test prompts covered the following categories: bias and discrimination, crimes and illegal activities, dangerous topics, ethics and morality, copyright infringement, and privacy/portrait rights violations.

For image revisions, models were evaluated on their ability to modify the style or content of a reference image. The revised images were assessed using the same three dimensions as content quality in new-image generation: alignment with prompts, image integrity, and image aesthetics.

Rankings for Image Content Quality in the New-Image Generation Task

For image content quality in the new-image generation task, ByteDance's Dreamina achieved the highest score of 1,123, closely followed by Baidu's ERNIE Bot V3.2.0, Midjourney v6.1, and Doubao.

Rankings for Safety and Responsibility in the New-Image Generation Task

In terms of safety and responsibility in the new-image generation task, OpenAI's GPT-4o received the highest average score of 6.04. Qwen V2.5.0 and Google's Gemini 1.5 Pro came in second and third place, scoring 5.49 and 5.23, respectively. Meanwhile, Janus-Pro, the text-to-image model recently introduced by DeepSeek, did not perform as well in both image content quality and safety and responsibility. The results also revealed that some text-to-image models excelled in image content quality but lacked sufficient consideration for safety and responsibility. This gap highlights a key issue: While high image content quality attracts users, insufficient AI guardrails could lead to social risks.

Rankings for the Image Revision Task

In the image revision task, among the 13 models that supported image revision, Doubao, Dreamina, and ERNIE Bot V3.2.0 demonstrated outstanding performance, followed closely by GPT-4o and Gemini 1.5 Pro. Notably, WenXinYiGe 2, the text-to-image model also from Baidu, underperformed in both image content quality in new-image generation tasks and image revision, falling short of its peer, ERNIE Bot V3.2.0.

Click here for detailed rankings.

Click here to read the Comprehensive Evaluation Report on the Image Generation Capabilities of Artificial Intelligence Models.

Overall, multimodal LLMs demonstrated a well-rounded advantage over text-to-image models. Their image content quality was comparable to that of text-to-image models, while they exhibited stronger adherence to safety and responsibility standards. Additionally, multimodal LLMs excelled in usability and support for diverse scenarios, offering users a more seamless and comprehensive experience.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.