FIFAWC: Dataset With Rich Group Activity Recognition

Higher Education Press

Group Activity Recognition (GAR), which aims to identify activities performed collectively in videos, has gained significant attention recently. Existing GAR datasets typically annotate only a single Group Activity (GA) instance per sample, carefully selected from original videos. This approach, while precise, diverges significantly from real-world contexts, which often involve multiple GA instances. Moreover, single word-level annotations are insufficient to encapsulate the complex semantic information in GA, thereby constraining the expansion and research of other GA-related tasks.

To mitigate these limitations, a research team led by Wang Yun-Hong (Beihang University, China) published their new research on 15 December 2024 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.

The team proposed FIFAWC, a novel dataset for GAR characterized by three notable distinctions:

  1. Comprehensive annotation: They thoroughly annotate all included GAs in each sample and retain the original frame count, diverging from previous datasets that focus on a single GA annotation and uniform frame normalization, which enhances the dataset's complexity and practical application potential for advanced research.
  2. Semantic description: Each clip in FIFAWC is accompanied by an elaborate caption from sports commentators, ensuring content accuracy and professionalism. This positions FIFAWC as a data foundation for a variety of tasks, such as video captioning and retrieval.
  3. New scenario: FIFAWC marks a novel divergence from previous ones by featuring soccer match footage. The expansive spatial areas and rapid movements characteristic of soccer introduce new challenges, such as dynamic camera movements and smaller targets in frames, significantly elevating the complexity and difficulty of GAR.

In the research, they benchmark FIFAWC on two tasks: traditional GAR and innovative GA video captioning. For GAR, they evaluate the classical detector-based approach ARG, and the state-of-the-art detector-free DFWSGAR. The results in Table 2 reveal high accuracy at category level, but low accuracy at sample level because of multiple GAs per sample, reflecting the complexity and challenge of FIFAWC. The assessment of the traditional captioning method PDVC and the Large Language Model-based VTimeLLM in GA video captioning is listed in Table 3. Compared to the exemplary performance (25.87 in terms of CIDEr) of PDVC on the ActivityNet dataset, the poor performance on FIFAWC indicates that further research is necessary for GA video captioning.

DOI: 10.1007/s11704-024-40027-3

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.