NTIRE 2025 Challenge on Video Quality Enhancement for Video Conferencing

Organized by jainvarun - Current server time: Nov. 9, 2025, 3:43 p.m. UTC

First phase

Development

Jan. 28, 2025, 5 a.m. UTC

End

Competition Ends

March 23, 2025, 7 a.m. UTC

Overview & Results
Details & Baseline
Datasets, Evaluation & Submission

NTIRE 2025 Challenge on Video Quality Enhancement for Video Conferencing

Image showing input and ground truth images

Figure 2: Ground truth from (top) our synthetics framework, (bottom) our AutoAdjust solution. The top row shows the input with suboptimal foreground illumination which is fixed by adding a studio light setup in front of the subject which is simulated in synthetics and predicted via global changes in the real data.

Introduction

Design a Video Quality Enhancement (VQE) model to enhance video quality in video conferencing scenarios by (a) improving lighting, (b) enhancing colors, (c) reducing noise, and (d) enhancing sharpness in video calls – giving a professional studio-like effect.

We provide participants with a differentiable Video Quality Assessment (VQA) model, training and test videos. The participants submit enhanced videos which are used for evaluation in our crowdsourced framework.

Our main website can be found at: https://www.microsoft.com/en-us/research/academic-program/ntire-2025-vqe (opens in new tab)

Motivation

Light is a crucial component of visual expression and key to controlling texture, appearance, and composition. Professional photographers often have sophisticated studio-lights and reflectors to illuminate their subjects such that the true visual cues are expressed and captured. Similarly, tech-savvy users with modern desk setups employ a sophisticated combination of key and fill lights to give themselves control over their illumination and shadow characteristics.

On the other hand, many users are constrained by their physical environment which may lead to poor positioning of ambient lighting or lack thereof. It is also commonplace to encounter flares, scattering, and specular reflections that may come from windows or mirror-like surfaces. Problems can be compounded by poor-quality cameras that may introduce sensor noise. This leads to poor visual experience during video calls and may have a negative impact on downstream tasks such as face detection and segmentation.

The current production light correction solution in Microsoft Teams – called AutoAdjust, finds a global mapping of input to output colors which is updated sporadically. Since this mapping is global, the method is sometimes unable to find a correction that works well for both foreground and background. A better approach may be Image Relighting which only performs local correction in the foreground and gives users the option to dim their background – creating a pop-out effect. A possible side effect of local correction can be the reduction of local contrast which often serves as a proxy to convey depth in 2D images – thereby making people appear dull in some cases.

Registration

Participants are required to register on the CodaLab website. Email used during registration will be used to add participants to our Slack workspace (opens in new tab) which will be the default mode of day-to-day communication and where participants will submit their videos for subjective evaluation. For objective evaluation, please make your submission to the CodaLab website contains the correct (a) team name, (b) names, emails & affiliations of your team members, and (c) team captain.

Please reach out to the challenge organizers at jain.varun@microsoft.com if you need assistance with registration.

Awards & Paper Submission

Top-ranking participants will receive a winner certificate. They will also be invited to submit their paper to NTIRE 2025 and participate in the challenge report – both of which will be included in the archived proceedings of the NTIRE workshop at CVPR 2025.

Citation

If you use our method, data, or code in your research, please cite:

@inproceedings{ntire2025vqe,
  title={{NTIRE} 2025 Challenge on Video Quality Enhancement for Video Conferencing: Datasets, Methods and Results},
  author={Varun Jain and Zongwei Wu and Quan Zou and Louis Florentin and Henrik Turbell and Sandeep Siddhartha and Radu Timofte and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  year={2025}
}

Subjective Results

We received 5 complete submissions for both the mid-point and final evaluations. For each team’s submission, we utilized our crowdsourced framework to evaluate their 3,000-video test set. This involved presenting human raters with 270K side-by-side video comparisons. Raters were asked to provide a preference rating on a scale of 1 to 5, where 1 and 5 represent strong preference for the left and right video respectively, and 2 and 4 represent weak preference. A rating of 3 indicates no preference. Furthermore, raters were prompted to specify if their decision was primarily influenced by (a) colors, (b) image brightness, or (c) skin tone.

Here are the Bradley–Terry scores for each team that maximize the likelihood of the observed P.910 voting:

results_combined_mid_final

Figure 5: Interval plots illustrating the mean P.910 Bradley-Terry scores and their corresponding 95% confidence intervals for the 5 submissions, input videos, and the provided baseline. (Top) Overall preference, and (bottom) factors influencing preference.

Timeline

timeline

Figure 4: Timeline for the challenge.

Problem Statement

Linear figure showing p910 scores of various methods

Figure 1: P.910 study indicates that people prefer AutoAdjust over auto-corrected images (L), which are both preferred over the Image Relighting approach.

We ran three P.910 [2] studies totaling ~350,000 pairwise comparisons that measured people’s preference for AutoAdjust (A) and Image Relighting (R*) over No effect (N) and images auto-corrected using Lightroom (L). We used the Bradley–Terry model [1] to compute per-method scores and observed that people preferred our current AutoAdjust more than any other method in all three studies.

To take the next step towards achieving studio-grade video quality, one would need to (a) understand what people prefer and construct a differentiable Video Quality Assessment (VQA) metric, and (b) be able to train a Video Quality Enhancement (VQE) model that optimizes this metric. To solve the first problem, we used the abovementioned P.910 data and trained a VQA model that, given a pair of videos x₁ and x₂, gives the probability of x₁ being better than x₂. Given a test set, this information can be used to construct a ranking order of a given set of methods.

We would now like to invite researchers to participate in a challenge aimed at developing Neural Processing Unit (NPU) friendly VQE models that leverage our trained VQA model to improve video quality.

We look at the following properties of a video to judge its studio-grade quality:

Foreground illumination – the person (all body parts and clothing) should be optimally lit.
Natural colors – correction may make local or global color changes to make videos pleasing.
Temporal noise – correct for image and video encoding artefacts and sensor noise.
Sharpness – to make sure that correction algorithms do not introduce softness, the final image should at least be as sharp as the input.

We realize that there may be many other aspects to a good video. For simplicity, we discount all except the ones mentioned above. Specifically, submissions are not judged on:

Egocentric motion – unstable camera may introduce sweeping motion or small vibrations that we do not aim to correct.
Masking of Background – spatial modification of background such as blurring or replacement with minimal changes to the foreground may improve subjective scores but we consider these out of domain.
Makeup and beautification – it is commonplace for users to apply beautification filters that alter their skin tone and facial features such as those found on Instagram and Snapchat. We do not aim for that aesthetic.
Removal of reflection on glasses and lens flare – despite being a common occurrence in video teleconference scenarios, we do not aim to remove reflections that may come from screens and other light sources onto users’ glasses due to the risk associated with altering the users’ eyes and gaze direction.
Avatars – A solution that synthesizes a photorealistic avatar of the subject and drives it based on the input video may score the highest in terms of noise, illumination and color. If it indeed minimizes the total cost function that takes into account all these factors, it is acceptable.

Solutions that significantly rely on altering properties other than what are discussed above will be asked to resubmit. Ensembles of models are allowed. Manual tweaking of hyperparameters for individual videos would lead to disqualification.

Baseline Solution & Starter Code

Since AutoAdjust was ranked higher than expert-edited images and Image Relighting methods, we will provide the participants with a baseline solution so that they can reproduce the AutoAdjust feature as currently shipped in Microsoft Teams.

More details can be found at our repository: https://github.com/varunj/cvpr-vqe (opens in new tab)

Compute Constraints

The goal is to have a computationally efficient solution that can be offloaded to NPU for CoreML inference. We establish a qualifying criterion of CoreML uint8 or fp16 models with at most 20.0×10⁹ MACs/frame for an input resolution of 1280*720. We anticipate such a model to have a per frame processing time of ~9ms on an M1 Ultra powered Mac Studio and ~5ms on an M4 Pro powered Mac Mini for the given input resolution. Submissions not meeting this criterion will not be considered for evaluation.

Dataset

We provide 13,000 real videos for training, validation and testing of VQE methods. The videos are 10s long, encoded at 30 FPS and amount to a total of 3,900,000 frames. We keep 3,000 (23%) videos for testing and ranking submissions and make 10,000 (77%) available to the teams. They can choose to split it between training and validation sets as they desire. The teams are also free to use other publicly available open datasets but will have to be mindful about data drift.

In addition to this data, we also provide some paired data for supervised training as shown in Fig 2.

Note that it is possible, and encouraged, to learn a correction that is different from these ground truth labels and achieves a higher MOS score. Hence, these labels should be treated as suggestive improvements and not the global optima.

Real data
- Of the 13,000 videos, we selected 300 high quality videos where P.910 voters vote strongly in favor of AutoAdjust. We assume these to be the ground truth.
- P.910 done on these videos shows a MOS of 3.58 in favor of the target.
Synthetic Portrait Relighting data
- 1,500 videos for training and 500 videos for testing. 5s long encoded at 30 FPS.
- The source image only has lighting from the HDRI scene. For the target, we add 2 diffuse light sources to simulate a studio lighting setup.
- P.910 done on these videos shows a MOS of 4.06 in favor of the target indicating that these make for a better target compared to the current baseline AutoAdjust solution.
- Some examples of these pairs are shown in Fig 2. and more details about the rendering framework can be found at [3].

Dataset	Azure Blob Storage Link	Google Drive Link
train.tar.gz	link	-
train_supervised_synthetic.tar.gz	link	link
train_supervised_real.tar.gz	link	link
train_unsupervised.tar.gz	link	link
test.tar.gz	link	-
test_supervised_synthetic.tar.gz	link	link
test_unsupervised.tar.gz	link	link

Folder Structure for Dataset and Submissions

folder_structure_train

Figure 3: Folder structure for the provided dataset and the expected submissions.

To ensure efficient evaluation, please organize your submissions according to the folder structure depicted in Fig. 3. Additionally, please strictly adhere to the video coding guidelines detailed within the starter code.

Metrics & Evaluating Submissions

Our final goal is to rank submissions based on P.910 scores. We will require the teams to submit their predictions on the 3,000-video real test set. We then compare the submissions relative to the given input as well as against each other. Similar to Fig 1, comparison using the Bradley–Terry model gives us the score for each submission that maximizes the likelihood of the observed P.910 voting. Our current P.910 framework has a throughput of ~210K votes per week.

In case two methods have statistically insignificant difference in subjective scores, we will use individual objective metric discussed below to break ties.

objective metric is the mean VQA score

Due to the infeasibility of getting P.910 scores in real-time, the teams can use the mean VQA score S_obj given by the provided VQA model as shown above for continuous & independent evaluation.

For the 3,000 unsupervised videos, teams are required to submit the per-video VQA score along with the 11 auxiliary scores predicted by the VQA model as shown in Fig 3. For the synthetic test set, teams should report the per-video Root Mean Squared Error (RMSE). These scores will also be published on the leaderboard so that participants can track their progress relative to other teams. However, we do not rank teams based on these objective metrics since it might be possible to learn a correction that is different from and subjectively better than the ground truth provided.

How to Submit Videos

To submit test videos for subjective evaluation, create a zip archive of all 3,000 real videos. Then, make an entry in this spreadsheet (opens in new tab) to notify the organizers of your submission. Teams can choose one of the following options:

Upload to your own cloud storage: Upload the zip file to your Google Drive, OneDrive, or other cloud storage service. Grant read and download access to jain.varun@microsoft.com and include the link in the spreadsheet.
Upload to Azure Storage using azcopy: This option allows you to avoid using your own cloud storage. Refer to this link (opens in new tab) for instructions on installing the azcopy command-line interface (CLI) tool.
- Upload your file: azcopy cp --check-length=false team_name.zip "https://ic3midata.blob.core.windows.net/cvpr-2025-ntire-vqe/final/?sv=2023-01-03&st=2025-02-19T06%3A25%3A24Z&se=2025-05-01T05%3A25%3A00Z&sr=c&sp=wl&sig=34%2Ff%2BtkD8hYYVD1u4d00m0PSnW%2Fkn5XhOByFqWnhZDA%3D"
- To check your submission: azcopy ls "https://ic3midata.blob.core.windows.net/cvpr-2025-ntire-vqe?sv=2023-01-03&st=2025-02-19T06%3A25%3A24Z&se=2025-05-01T05%3A25%3A00Z&sr=c&sp=wl&sig=34%2Ff%2BtkD8hYYVD1u4d00m0PSnW%2Fkn5XhOByFqWnhZDA%3D"

References

[1] Bradley, Ralph Allan, and Milton E. Terry. “Rank analysis of incomplete block designs: I. The method of paired comparisons.” Biometrika 39, no. 3/4 (1952): 324-345.
[2] Naderi, Babak, and Ross Cutler. “A crowdsourcing approach to video quality assessment.” In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2810-2814. IEEE, 2024.
[3] Hewitt, Charlie, Fatemeh Saleh, Sadegh Aliakbarian, Lohit Petikam, Shideh Rezaeifar, Louis Florentin, Zafiirah Hosenie et al. “Look Ma, no markers: holistic performance capture without the hassle.” ACM Transactions on Graphics (TOG) 43, no. 6 (2024): 1-12.

Development

Start: Jan. 28, 2025, 5 a.m.

Description: Train models, submit objective metrics and corrected videos from test set

Final

Start: March 7, 2025, 8 a.m.

Description: Submit model, training scripts, fact sheet, objective metrics and corrected videos from test set

Competition Ends

March 23, 2025, 7 a.m.

You must be logged in to participate in competitions.