Making AI More Accessible In Football

Technology is improving football - from helping referees make more accurate decisions to developing better on-field tactics. ETH Zurich and FIFA are exploring how AI can make these advancements more accessible to competitions worldwide.

This sample image shows a soccer field with a bird's eye view of the players
A precise analysis of a game by artificial intelligence is only possible when the digital and real players overlap perfectly. (Image: AIT Lab / ETH Zurich)

In Brief

  • Researchers from ETH Zurich have digitised sequences of play from FIFA World Cup 2022, creating a dataset in which 3D poses are available for all players on the pitch simultaneously.
  • The dataset is now being used as a reference as part of an international research challenge, organised by FIFA.
  • The aim is to develop technologies that use a single broadcast camera, as opposed to the expensive multi-camera systems currently in use.
  • This would one day make performance analysis, officiating or fan engagement affordable even for countries and leagues with limited resources.

Artificial intelligence (AI) is already being used in football today, analysing individual moves and assisting referees with assessing whether someone was offside or not. Semi-Automated Offside Technology (SAOT) is used by Video Assistant Referees (VARs) to make fairer decisions. The system works by using real-time digital tracking of the movements and positions of players.

Until now, computer-assisted systems have only been within reach for large football competitions. After all, these systems are complex and expensive: 10 to 12 static cameras that record the action from various angles are required for each stadium. "All of the cameras must be perfectly synchronised in order to produce an accurate digital likeness," says Tianjian Jiang, a doctoral student in computer sciences.

Jiang is conducting research at ETH Zurich's Advanced Interactive Technologies (AIT) Lab. Together with colleagues from the lab, he is helping FIFA - the Fédération Internationale de Football Association - to explore technological solutions that would increase access to AI in football. The underlying idea is to simplify the system to such an extent that, rather than multiple cameras, it requires only one. After all, every professional competition has a camera that is used to record and broadcast the games. This broadcasting camera stands on the touchline and is the source of almost three-quarters of all footage of a televised game.

Fully digitised sequences of play

It will still be a few years before the video analysis of a game works reliably with just a single camera, but the AIT Lab has now made a decisive step in this direction. The researchers have completely digitised almost 50 minutes of video recordings from various games in the 2022 FIFA World Cup.

The ETH dataset, known as external page WorldPose , contains over 2.5 million individual player poses in three dimensions. It is therefore possible to track all of the players on the field, from both teams, at the same time and to analyse where they're standing and what they're doing with or without the ball.

In machine learning, this is known as pose estimation. Unlike a human, a computer cannot see and therefore relies on data in order to detect where people or objects are within a space and how they are moving.

Through constant training, the computer learns to process and interpret information from image and video data. Computer vision requires large volumes of data, which the computer repeatedly analyses until it identifies differences and ultimately detects patterns. Algorithms allow the machine to learn by itself instead of having to be programmed by humans.

3D with just a single camera

There are already algorithms that can generate three-dimensional objects and bodies directly from a two-dimensional image. In "monocular pose estimation" (MPE), a computer uses images from a single camera to detect where people or objects are in the space, how they are moving and to where. The computer therefore analyses each player's pose and trajectory without the sort of depth information that would be provided by a 3D camera or multiple cameras.

Existing MPE methods are now very good at predicting the poses of individual players. However, they have trouble tracking several people at the same time - particularly over large distances such as those covered by footballers over a 90-minute game. "We want to find an algorithm that is accurate enough even over large distances," says Jiang.

A cameraman looks into the camera and films a football game
In the future, such broadcast cameras should be able to perform AI analyses directly. (Image: FIFA)

Harder than expected

FIFA approached ETH Zurich in 2021 in search of a dataset so that computers could be trained to estimate poses. They also wanted to know how good existing MPE methods really were. To this end, FIFA provided the researchers with various video sequences from World Cup 2022 in Qatar, which were recorded using different cameras (stationary and movable), as well as further data such as the exact playing field dimensions within the individual stadiums.

This task kept the ETH researchers busy for three years - an eternity in the rapidly advancing world of AI. "At first, we thought we would quickly be able to gain a precise dataset," Jiang recalls. "We already had a system that could represent poses and trajectories precisely in digital form, and we assumed that this would be easy to apply to the World Cup footage."

They soon realised that there's a big difference between simply digitising individual sequences and applying the system to a larger dataset. For example, the technical challenges included player obstruction, motion blur and problems with camera calibration. Distortions from the various cameras or the zoom of the broadcasting camera also proved to be tricky.

Lines need to match perfectly

To ensure that the digital players ended up precisely superimposed on top of the real players, the researchers first had to calibrate and compare the video footage from a stadium's various static cameras - with different angles. Calibration serves to precisely determine the specific properties of each camera, such as the focal length or sensor size, and to adjust the camera so that it records reality as accurately as possible. This is because every camera suffers from certain distortions due to its optics, such as when it comes to depicting straight lines.

Digital reference lines are then placed over the camera image as a visual aid. This overlay shows how well the calibration is working or if there are still distortions. "If the calibration is correct, the digital field line overlaps perfectly with the real one - from all angles," says Jiang.

The computer can then use the accurately coordinated parameters of the static cameras to estimate the players' poses and trajectories. Using the SMPL model, which is widely used in computer vision, the digital body is represented so that it is as close as possible to the human original.

This data is then used to "feed" the movable broadcasting camera, which is also calibrated - by moving it in all directions, for example, and zooming it in and out. If the real and digital data overlaps correctly, it is now possible to represent the exact position, trajectory and pose of the individual players on the pitch digitally in three dimensions - using only one camera.

Zoom pushed the system to its limits

Using their dataset, the ETH researchers were then able to make a detailed comparison of whether a single camera with existing MPE technology is able to detect a player in an offside position sufficiently or not. In their study, which was presented at the European Conference on Computer Vision in Milan, the computer scientists found that existing methods struggle with this new dataset, highlighting potential new research directions.

Pose estimations with just one camera can determine poses and movements in a small space with a high degree of accuracy, even in the case of a long focal length or if there is a long distance between the person and the camera. MPE models also perform relatively well with individual motion sequences, but they struggle to determine the relative positions of multiple players in the same space. Zooming in and out with the camera proved to be particularly demanding. "This confirmed to us that a lot of research is still needed in order to achieve a working and stable system," says Jiang.

Data published for competition

With the WorldPose dataset, the aim is now for other scientists to train their systems and develop algorithms so that accurate AI analysis is possible with a single movable camera in the future. To this end, FIFA has launched an external page Innovation Challenge . In addition to the ETH dataset, FIFA is also providing video sequences of football games for this international research competition, albeit - this time - only from the broadcasting camera.

"As we're sharing the data with others, this could speed up research in this area," says Jiang. "If models that provide precise analysis with a single camera one day achieve the same quality as our dataset, the technology will be suitable for widespread use."

So far, over 150 researchers around the world have already responded to the competition announcement. ETH Zurich is also continuing to train its systems. Jiang says: "We'll continue working on the dataset and develop further models ourselves."

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.