In this subtask, for the training set, we present 50 hours of video content with both the visual and audio supervisions and 200-hour video content with no annotation. Another 20 and 5 h of videos will be provided to serve as validation and testing sets, respectively. For the visual annotation, we will provide characters of all text in key frames, we will present speech transcripts of segment or the audio modal. With these data, participants will be required to produce each VAD subtitle for each video in our testing set, and the submitted results will be ranked with the CER metric.
The submitted xml files should be as the following format.
<annotation>
<video_id> test1.mp4 <video_id>
<subtitle>
<start_time> s1 </start_time>
<end_time> e1 </end_time>
<content> content1 </content>
</subtitle>
<subtitle>
<start_time> s2 </start_time>
<end_time> e2 </end_time>
<content> content2 </content>
</subtitle>
...
<subtitle>
<start_time> sn </start_time>
<end_time> en </end_time>
<content> contentn </content>
</subtitle>
</annotation>
Annotation:
The subtitle information in the video is provided by the audio and visual modalities. Specifically, when both the audio and visual modalities contain subtitles, the subtitles are from two modalities. For example, when the subtitle from the visual modal is "ICPR MSR competition" and the subtitle from the audio modal is "ICPR ICPR MSR competition". Then, the final subtitle is "ICPR ICPR MSR competition". Second, when only the visual modal contains subtitles, we employ the subtitle from the visual modal as the ground truth. Third, when only the audio modal contains subtitles, we employ the subtitle from the audio modal as the ground truth.
Predictions:
Participants need to predict all subtitles in the video. The format of the submitted prediction file is required to be the xml format. The name of xml files must be the same as the video name. For example, if the video name is "000001.mp4", the name of the xml file needs to be "000001.xml".
Evaluation:
Participants need to concat all the subtitles in order of time. The CER between the predicted subtitle and the ground truth is calculated to evaluate the models.
Please find the detailed evaluation method on our website
Note: To avoid the ambiguity in annotations, we perform the following preprocessing before evaluation:
This page enumerated the terms and conditions of the competition.
Start: March 7, 2022, midnight
Start: March 12, 2022, midnight
Start: April 22, 2022, midnight
May 12, 2022, 11 p.m.
You must be logged in to participate in competitions.
Sign In