ICPR2022MSR: Extracting subtitles in visual modality with audio annotations

Organized by icprmsr - Current server time: April 26, 2025, 10:16 p.m. UTC
Reward $5,300

First phase

Registration
March 7, 2022, midnight UTC

End

Competition Ends
May 12, 2022, 11 p.m. UTC

Extracting subtitles in visual modality with audio annotations

To extract subtitles from video frames, a large number of keyframes should be annotated with bounding boxes and contents, which is extremely costive. However, speech transcripts are much easier to obtain, and they contain almost all the content of the subtitles. In this subtask, we present a challenge that explores learning visual subtitles with the supervision of speech scripts. We expect that the annotation from the audio modality can improve the subtitle extraction from the visual modality. In this subtask, we will present 75h of video content, divided into sets of 50, 5, and 20 as training, validation, and testing sets, respectively. For the training set, only audio annotations will be provided. The participants are required to design subtitles OCR systems with these annotations. To pretrain an OCR system, participants can also use a limited number of open datasets and fine-tune their model with audio supervision. Under these conditions, will be asked to produce subtitle text for each video in our testing set, and the submitted results will be ranked using the CER metric.

 

Submission

  • submission should be a .zip file containing all the results in the format of xml
  • results for all the test videos must be provided
  • we suggest using our code of splitting video frames, ensuring that the frame id is consistent
  • please find the detailed evaluation method from our website

The submitted xml files should be as the following format.

<annotation>

<video_id> test_video1.mp4 </video_id>

<subtle>

<start_id> s1 </start_id>

<end_id> e1 </end_id>

<content> content1 </content>

</subtile>

<subtile>

<start_id> s2 </start_id>

<end_id> e2 </end_id>

<content> content2 </content>

</subtile>

.....

<subtile> *** </subtile>

</annotation>

 

Evaluation criteria

Prediction:

Participants need to predict the subtitle information in the video frame, remove repeated subtitles in adjacent consecutive frames, and provide the start frame and end frame of the subtitle. The submitted predicted subtitle file should be the format of xml file.


Evaluation:

For each subtitle in the ground-truth, we find from the predicted results with the largest tIoU and tIoU >= 0.5 as the matched prediction, calculating the cer between the characters. If no prediction can be matched, then this subtitle is missing, the corresponding cer is 1. Later, for the remaining predictions, if one has a tIou >= 0.5 with a ground-truth subtitle, but is not the best match in the previous step, it will be regarded as a false detection, and the cer is 1. The final cer is the average of all the cer numbers.

Please find the detailed evaluation method on our website


Note: To avoid the ambiguity in annotations, we perform the following preprocessing before evaluation:

  • The English letters are not case sensitive;
  • The Chinese traditional and simplified characters are treated as the same label;
  • The blank spaces and symbols will be removed;
  • All illegible videoes will not contribute to the evaluation result.

Terms and Conditions

This page enumerated the terms and conditions of the competition.

Registration

Start: March 7, 2022, midnight

Devlopment

Start: March 12, 2022, midnight

Description: Test Evaluation System Base

Evaluation

Start: April 22, 2022, midnight

Competition Ends

May 12, 2022, 11 p.m.

You must be logged in to participate in competitions.

Sign In