CodaLab -

> Evaluation metrics

Hello. We found that there is a big difference between the local score and the online score on the val set(bleu1,bleu2,bleu3,bleu4,meteor,rouge_l,cider,spice) . Could you please open source the evaluation metrics code of the competition?

Posted by: jinx @ Feb. 18, 2023, 12:55 p.m.

Hi,

The evaluation code uses "pycocoevalcap" library to compute the metrics that we use.

For example, we use this line to import the implemented Bleu metric:
from pycocoevalcap.bleu.bleu import Bleu

, and multiply 100 for the final score for each metrics.

Posted by: p.ahn @ Feb. 27, 2023, 5:12 a.m.

Hi, I use the "pycocoevalcap" metrics too, but the submitted scores are much lower than the local scores.

Posted by: jinx @ Feb. 27, 2023, 5:20 a.m.

I also have a question, the val set includes "id", "gt" and "category", so does the test set include "category" information?

Posted by: jinx @ Feb. 27, 2023, 5:25 a.m.

Hi, please check the format of the submission in case you get significantly lower score than expected.

We do not process the score computed by the pycocoeval library any more.

Also, category information is for computing statistics in extended analysis of the results, and does not affect the final score.

Posted by: p.ahn @ Feb. 28, 2023, 1:26 a.m.

the same problem.

Posted by: TXHmercury @ March 3, 2023, 1:52 p.m.

online metircs are much lower than offline, and we have cheked csv format.

Posted by: TXHmercury @ March 3, 2023, 1:53 p.m.

Hi, here is a sample of our evaluation code.
@TXHmercury, if you put in this code your last submission csv and our validation gt csv, you will see the same result as the leaderboard.

import pandas as pd
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.spice.spice import Spice

pred_df = pd.read_csv("pred.csv")
gt_df = pd.read_csv("nice-val-5k.csv")

combined_df = pred_df.merge(gt_df, on="public_id")

dict_gt = {}
dict_pred = {}
for idx, row in combined_df.iterrows():
dict_gt[str(row["public_id"])] = [row["caption_gt"]]
dict_pred[str(row["public_id"])] = [row["caption"]]

score, _ = Bleu(4).compute_score(dict_gt, dict_pred)
idx = 1
for s in score:
print(f'Bleu_{idx} : {s*100}')
idx += 1

score, _ = Rouge().compute_score(dict_gt, dict_pred)
print(f'Rouge : {score*100}')

score, _ = Cider().compute_score(dict_gt, dict_pred)
print(f'Cider : {score*100}')

score, _ = Meteor().compute_score(dict_gt, dict_pred)
print(f'Meteor : {score*100}')

score, _ = Spice().compute_score(dict_gt, dict_pred)
print(f'Spice : {score*100}')

Posted by: syun.kim @ March 8, 2023, 4:56 a.m.

thanks for the evaluation code. I notice that the official COCOEVAL.py contains a so-called 'tokenize' procedure, which will transforms pred sentences and groundtruth sentences to lower, and remove punctuitaions. which is removed from your evaluation code. THis may be not reasonbale cause i have found that the groundtruth file of valiadation set contains upper letter randomly(not only the first letter). This will make 'i eat an apple' and 'i eat an Apple' not match. I suggest that you could transform the all letters to lower for test groundtruth file, or add the toknizer procedure to your evaluation code. Thanks very much!

Posted by: TXHmercury @ March 12, 2023, 12:34 p.m.

I also agree that my CIDEr scores 0.804 in local system with text processor as upper user metioned!

Posted by: danielchoi @ March 16, 2023, 1:54 a.m.

Hi,

We have decided to add tokenizing process in the evaluation code in order to prevent further confusion.

The change in the evaluation code is notified through email to every registered participants.

Thank you.

Posted by: p.ahn @ March 22, 2023, 4:54 a.m.

I apologize, but would it be possible for me to receive the updated evaluation code? I may not have received the email due to a late registration on my part.

Posted by: tkdgur658 @ April 6, 2023, 1:15 p.m.

We never shared the whole evaluation code to the participants except the example in this thread.

Our evaluation code is quite similar to the other benchmarks that use pycocoevalcap library.

Thank you.

Posted by: p.ahn @ April 7, 2023, 6:41 a.m.

Post in this thread

Forums

New frontiers for zero-shot Image Captioning Evaluation Forum

> Evaluation metrics