NTIRE 2025 Real-World Face Restoration Forum

Go back to competition Back to thread list Post in this thread

> Questions about evaluation metrics

“Evaluation
The evaluation consists of the comparison of the face image before and after restoration. To comprehensively assess the results, we employ 6 non-reference image quality metrics, including NIQE, Clip-IQA, ManIQA, MUSIQ, FID (take FFHQ as reference), and Q-Align. The score will be calculated by following steps:

Given the low-quality image, calculate the increased ratio of the predicted image.
Calculate the weighted average value across the validation set or test set as the validation score or final score.”

Dear Organizers,

I have some questions regarding the evaluation method described:

1. For "Given the low-quality image, calculate the increased ratio of the predicted image", does this mean that, for example, using CLIP-IQA, the calculation follows this formula:

(CLIP-IQA(prediction) − CLIP-IQA(LQInput)) / CLIP-IQA(LQInput)

or is there another approach used to compute the increased ratio?

2. For "Calculate the weighted average value across the validation set or test set as the validation score or final score", does this mean that each image in the validation/test set has a specific weight when computing the final score? If so, could you kindly clarify how these weights are determined?

I would really appreciate your guidance on these points. Thank you!

Posted by: changxin.zhou @ Feb. 25, 2025, 2:07 a.m.

Thank you very much for your insightful questions!

Regarding the evaluation metrics for the competition, participants are encouraged to optimize their models with the goal of improving the following six metrics: NIQE, Clip-IQA, ManIQA, MUSIQ, Q-Align, and FID. These metrics will be used to assess the performance of the restored images.

To clarify, each image in the test set will contribute equally to the final score. The final score is calculated by taking the weighted average of the scores from all images in the set, without any image-specific weights. The weights for each metric (NIQE, Clip-IQA, ManIQA, MUSIQ, Q-Align, FID) will be pre-announced during the final test data release period. Below is a sample implementation showing how the final score is computed:

```python
def weighted_average(img):
return a*NIQE(img) + b*ClipIQA(img) + c*ManIQA(img) + d*MUSIQ(img) + e*Qalign(img) + f*FID(img)

def final_score(restored_images):
assert adaface_identity_consistency_test(restored_images)
total_score = 0
n = 450
for sub_dataset in restored_images:
for image in sub_dataset:
total_score += weighted_average(image)
return total_score / 450
```

Here, **a**, **b**, **c**, **d**, **e**, and **f** represent the pre-announced weights, which may be either positive or negative.

I hope this clarifies your questions.

Posted by: jkwang @ Feb. 25, 2025, 5:08 a.m.

Dear Organizers,
Could you kindly confirm if the weights a, b, c, d, e, and f for the evaluation metrics can be provided?
Thank you for your assistance!

Posted by: MakkaBazi @ Feb. 27, 2025, 6:08 a.m.

Thanks!
The weights for each metric (NIQE, Clip-IQA, ManIQA, MUSIQ, Q-Align, FID), i.e. a,b,c,d,e,f, will be announced during the final test data release period.

Posted by: jkwang @ Feb. 27, 2025, 6:11 a.m.

Dear Organizers, may I know the format of images in the metric calculation is BGR or RGB?

Posted by: changxin.zhou @ Feb. 28, 2025, 3:12 a.m.

It's should be RGB. We will update codalab evaluation code soon.

Posted by: jkwang @ Feb. 28, 2025, 3:29 a.m.

The evaluation code is updated to RGB. Apologize for the potential inconvenience. And thanks for you point out this problem.

Posted by: jkwang @ Feb. 28, 2025, 7:29 a.m.

Hi, can you share the code of adaface_identity_consistency_test() ?

Posted by: duanxiongwa @ March 3, 2025, 3:10 a.m.
Post in this thread