M2MeT Challenge 2022:Task2(Multi-Speaker ASR)-Sub-trackI(Fixed Training Condition)

Organized by M2MeT - Current server time: Aug. 15, 2025, 7:24 a.m. UTC

Current

Evaluation Set Phase
Dec. 23, 2021, midnight UTC

End

Competition Ends
Dec. 23, 2031, midnight UTC

Background & Task Overview

Recent development of speech signal processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for speech technologies. Because such scenarios have free speaking styles and complex acoustic conditions such as overlapping speech, unknown number of speakers, far-field signals in large conference rooms, noise and reverberation etc.

However, the lack of large public real meeting data has been a major obstacle for advancement of the field. Since meeting transcription involves numerous related processing components, more informa- tion have to be carefully collected and labelled, such as speaker identity, speech context, onset/offset time, etc. All these information require precise and accurate annotations, which is expensive and time- consuming. Although several relevant datasets have been released, most of them suffer from various limitations, ranging from corpus setup such as corpus size, number of speakers, variety of spatial loca- tions relative to the microphone arrays, collection condition, etc., to corpus content such as recording quality, accented speech, speaking style, etc. Moreover, almost all public available meeting corpora are collected in English, and the differences among different languages limit the development of Mandarin meeting transcription.

Therefore, we release the AliMeeting corpus, which consists of 120 hours of real recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants’ headset microphone. Moreover, we will launch the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT), as an ICASSP2022 Signal Processing Grand Challenge. The challenge consists of two tracks, namely speaker diarization and multi-speaker ASR. We provide a detailed introduction of the dateset, rules, evaluation methods and baseline systems, aiming to further promote reproducible research in this field. For details, please also refer to the paper we have already published: M2MET paper https://arxiv.org/abs/2110.07393

We will provide the code of the baseline system for speech recognition and speaker diarization in conference scenario as a reference. The goal is to simplify the training and evaluation procedures, so that participants can easily and flexibly experiment and verify the neural network-based method.GitHub link: https://github.com/yufan-aslp/AliMeeting

All teams need to submit a system description paper along with the submitted results on the final test set. The organizer will select papers with high system ranking and technical quality and include them in the ICASSP2022 Proceedings.

More details:https://www.alibabacloud.com/m2met-alimeeting

Evaluation

The challenge of multi-speaker ASR is to handle overlapped speech and to recognize the content of multiple speakers, and the organizer will only provide the Train and Eval data of AliMeeting and AISHELL4 in Track 2 as constrained data. Certainly, the provided data of final test set (Test) is the same as Track 1. Finally, participants are required to transcribe each speaker, but are not required to identify a corresponding speaker for each transcript.

The accuracy of multi-speaker ASR system in Track 2 is measured by Character Error Rate (CER). The CER compares, for a given hypothesis output, the total number of characters, including spaces, to the minimum number of insertions (Ins), substitutions (Subs) and deletions (Del) of characters that are required to obtain the reference transcript. Specifically, CER is calculated by: CER=(N_Ins+N_Subs+N_Del)/N_Total * 100% , where N_Ins, N_Subs, N_Del are the character number of the three errors, and N_Total is the total number of characters.

Considering the permutation invariant training (PIT) problem, we propose two schemes to calculate CER of the overlapping speech.

First, we sort the reference labels according to the start time of each utterance and join the utterances with the token, which called utterance-based first-in first-out (FIFO) method.

The second methods is based on speaker, where utterances from the same speaker are combined, and then we will calculate all possible concatenation patterns.

More details:https://www.alibabacloud.com/m2met-alimeeting

Timeline(AOE Time)

November 17, 2021 : Registration deadline

November 19, 2021 : Train and Eval data release

January 13, 2022 : Test data release

January 17, 2022 : Final results submission deadline

January 24, 2022 : System description paper submission deadline

January 31, 2022 : Evaluation result and ranking release

February 10, 2022 : ICASSP2022 Grand Challenge paper acceptance

February 17, 2022 : Camera-ready paper submission deadline

More details:https://www.alibabacloud.com/m2met-alimeeting

Evaluation Set Phase

Start: Dec. 23, 2021, midnight

Description: scoring on evaluation Set

Competition Ends

Dec. 23, 2031, midnight

You must be logged in to participate in competitions.

Sign In
# Username Score
1 halsay 16.08
2 csf 17.48
3 shenchen 19.25