Towards Universal Soccer Video Understanding

1School of Artifitial Intelligence, Shanghai Jiao Tong University, China
2CMIC, Shanghai Jiao Tong University, China    3Alibaba Group, China
Teaser

Overview. We present the largest soccer dataset to date, SoccerReplay-1988, and the first vision-language foundation model for soccer, MatchVision, capable of tasks like event classification and commentary generation.

Abstract

As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition, and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.

SoccerReplay-1988 Dataset

Dataset statistics 1

SoccerReplay-1988 Dataset Overview. Left: Data Curation Pipeline. The collected soccer video data are automatically processed for temporal alignment, event summarization, and anonymization by our curation pipeline; Right: Statistics of Soccer Datasets. Our SoccerReplay-1988 significantly surpasses existing datasets in both scale and diversity. Here, # Anno. and # Com. refer to the number of event annotations and textual commentaries, respectively.

Dataset statistics 2

Event Labels and Commentaries in SoccerReplay-1988. SoccerReplay-1988 dataset consists of full-match videos, event descriptions and match related information of 1,988 soccer matches. Left: Distribution of Event Labels of 24 Classes in SoccerReplay-1988. In our dataset, we summarized 24 classes of event labels for soccer senarios with second-level timestamps. Right: Word Cloud of Commentaries. For each event in our dataset, we also provide a textual commentary to describe events with more details.

Method

Overall Structure

Overview of Our Proposed Soccer Visual Encoder: MatchVision. (a) The model architecture and its spatiotemporal feature extraction process; (b) Details of visual encoder pretraining, such as supervised training and video-language contrastive learning; (c) Implementation details of specific heads for various downstream tasks, including commentary generation, foul recognition, and event classification.

Results

Results1

Quantitative Results on Event Classification and Commentary Generation. Here, SN, MT, and SR represent curated SoccerNet-v2, MatchTime, and SoccerReplay-1988, respectively. Moreover, B, M, R-L, and C refer to BLEU, METEOR, ROUGE-L, and CIDEr metrics, respectively. Within each unit, we denote the best performance in RED and the second-best performance in BLUE.

Results2

Ablation Studies on Downstream tasks of Our New Benchmarks. Left: Event Classification. We explore the impact of various training settings of our MatchVision encoder on the SoccerReplay-test benchmark. Here, Sup., Contra., and SR refer to supervised training, visual-language contrastive learning, and SoccerReplay-1988 dataset, respectively.Right: Commentary Generation. We investigate the impact of different training strategies and datasets on MatchVision using the SoccerReplay-test benchmark. `V' and `L" denote the visual encoder and the LLM decoder, respectively, where and indicate whether each component was trainable.

Qualitative Results

Results3

Qualitative Results for Event Classification and Commentary Generation. GT denotes ground truth, while "w/o SR" and "w/ SR" indicate models trained without and with the SoccerReplay-1988 dataset, respectively. Training with the SoccerReplay-1988 dataset improves event classification performance. Specifically for commentary generation, this expanded training data enables the MatchVoice model to demonstrate notable advantages in (a) more detailed descriptions, (b) greater linguistic variety, (c) higher accuracy in event depiction, (d) better adherence to updated rules, and (e) improved specificity in scenario response.

Results4

More Qualitative Results for Commentary Generation of Different Classes. We provide more qualitative visualizations of commentary generation across various events on the field.

BibTeX


        @misc{rao2024unisoccer,
          title   = {Towards Universal Soccer Video Understanding},
          author  = {Rao, Jiayuan and Wu, Haoning and Jiang, Hao and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
          journal = {arXiv preprint arXiv:2412.01820},
          year    = {2024},
        }