To extract subtitles from video frames, a large number of keyframes should be annotated with bounding boxes and contents, which is extremely costive. However, speech transcripts are much easier to obtain, and they contain almost all the content of the subtitles. In this subtask, we present a challenge that explores learning visual subtitles with the supervision of speech scripts. We expect that the annotation from the audio modality can improve the subtitle extraction from the visual modality. In this subtask, we will present 75h of video content, divided into sets of 50, 5, and 20 as training, validation, and testing sets, respectively. For the training set, only audio annotations will be provided. The participants are required to design subtitles OCR systems with these annotations. To pretrain an OCR system, participants can also use a limited number of open datasets and fine-tune their model with audio supervision. Under these conditions, will be asked to produce subtitle text for each video in our testing set, and the submitted results will be ranked using the CER metric.
The submitted xml files should be as the following format.
<annotation>
<video_id> test_video1.mp4 </video_id>
<subtle>
<start_id> s1 </start_id>
<end_id> e1 </end_id>
<content> content1 </content>
</subtile>
<subtile>
<start_id> s2 </start_id>
<end_id> e2 </end_id>
<content> content2 </content>
</subtile>
.....
<subtile> *** </subtile>
</annotation>
Prediction:
Participants need to predict the subtitle information in the video frame, remove repeated subtitles in adjacent consecutive frames, and provide the start frame and end frame of the subtitle. The submitted predicted subtitle file should be the format of xml file.
Evaluation:
For each subtitle in the ground-truth, we find from the predicted results with the largest tIoU and tIoU >= 0.5 as the matched prediction, calculating the cer between the characters. If no prediction can be matched, then this subtitle is missing, the corresponding cer is 1. Later, for the remaining predictions, if one has a tIou >= 0.5 with a ground-truth subtitle, but is not the best match in the previous step, it will be regarded as a false detection, and the cer is 1. The final cer is the average of all the cer numbers.
Please find the detailed evaluation method on our website
Note: To avoid the ambiguity in annotations, we perform the following preprocessing before evaluation:
This page enumerated the terms and conditions of the competition.
Start: March 7, 2022, midnight
Start: March 12, 2022, midnight
Description: Test Evaluation System Base
Start: April 22, 2022, midnight
May 12, 2022, 11 p.m.
You must be logged in to participate in competitions.
Sign In