ICPR2022 Extracting subtitles with both visual and audio annotations

Organized by icprmsr - Current server time: March 27, 2025, 1:36 a.m. UTC
Reward $5,300

First phase

March 7, 2022, midnight UTC


Competition Ends
May 12, 2022, 11 p.m. UTC

Extracting subtitles with both visual and audio annotations

In this subtask, for the training set, we present 50 hours of video content with both the visual and audio supervisions and 200-hour video content with no annotation. Another 20 and 5 h of videos will be provided to serve as validation and testing sets, respectively. For the visual annotation, we will provide characters of all text in key frames, we will present speech transcripts of segment or the audio modal. With these data, participants will be required to produce each VAD subtitle for each video in our testing set, and the submitted results will be ranked with the CER metric.


  • submission should be a .zip file containing all the results in the format of xml
  • results for all the test videos must be provided
  • we suggest using our code of splitting video frames, ensuring that the frame id is consistent
  • please find the detailed evaluation method from our website


The submitted xml files should be as the following format.


<video_id> test1.mp4 <video_id>


<start_time> s1 </start_time>

<end_time> e1 </end_time>

<content> content1 </content>



<start_time> s2 </start_time>

<end_time> e2 </end_time>

<content> content2 </content>




<start_time> sn </start_time>

<end_time> en </end_time>

<content> contentn </content>



Evaluation criteria



The subtitle information in the video is provided by the audio and visual modalities. Specifically, when both the audio and visual modalities contain subtitles, the subtitles are from two modalities. For example, when the subtitle from the visual modal is "ICPR MSR competition" and the subtitle from the audio modal is "ICPR ICPR MSR competition". Then, the final subtitle is "ICPR ICPR MSR competition". Second, when only the visual modal contains subtitles, we employ the subtitle from the visual modal as the ground truth. Third, when only the audio modal contains subtitles, we employ the subtitle from the audio modal as the ground truth.


Participants need to predict all subtitles in the video. The format of the submitted prediction file is required to be the xml format. The name of xml files must be the same as the video name. For example, if the video name is "000001.mp4", the name of the xml file needs to be "000001.xml".


Participants need to concat all the subtitles in order of time. The CER between the predicted subtitle and the ground truth is calculated to evaluate the models.


Please find the detailed evaluation method on our website

Note: To avoid the ambiguity in annotations, we perform the following preprocessing before evaluation:

  • The English letters are not case sensitive;
  • The Chinese traditional and simplified characters are treated as the same label;
  • The blank spaces and symbols will be removed;
  • All illegible videoes will not contribute to the evaluation result.


Terms and Conditions

This page enumerated the terms and conditions of the competition.


Start: March 7, 2022, midnight


Start: March 12, 2022, midnight


Start: April 22, 2022, midnight

Competition Ends

May 12, 2022, 11 p.m.

You must be logged in to participate in competitions.

Sign In