CodaLab -

> Clarification on final evaluation

Hi, wanted to know about the final calculation score. Was my best submission considered in final evaluation? The reason I am asking this is because I see a huge deviation in my final score as compared to my dev score.

Posted by: ColdWheels @ June 6, 2022, 11:10 a.m.

Same here. I would love to know the score of my best dev submission but it doesn't show up on the sub list. Thanks!

Posted by: mariofilho @ June 6, 2022, 11:51 a.m.

I also want to know that if there are such huge deviations in scores in dev and final phase, the ranking is completely changed, then what is the point of having dev phase leaderboard? How can one gauge effectiveness of his/her solution?

Posted by: ColdWheels @ June 6, 2022, 12:08 p.m.

@ColdWheels: yes your best phase 1 submission was evaluated in final phase.

Remember that the unseen domain evaluated in dev phase is not the same as the one in final phase - hence it is expected that the scores of the 2 phases diverge.

Posted by: eustache @ June 6, 2022, 12:59 p.m.

@mariofilho: Codalab automatically migrates your best submission of dev phase by cloning it into the final phase, attributing it a new id in the process.

So basically two cases can happen:

1) your best phase 2 submission is one of the 2 allowed submissions in final phase
=> then you know directly which one it is as it corresponds to a manual submission after June 1st

2) your best dev phase submission is also your best as evaluated in phase 2
=> then you should have a submission in Final that is a clone of your best Dev submission. Usually looking up in the UI by filename and date allows to recognize which Dev submission it corresponds to.

HTH

Posted by: eustache @ June 6, 2022, 1:11 p.m.

Ah - one thing that is maybe not straightforward with the UI is that there are 2 lists of submissions: one for Dev and one for Final.
So make sure to click on Dev or Final phase in the "Participate" > "Submit / View Results" menu to see your submission list in each phase.

Posted by: eustache @ June 6, 2022, 1:15 p.m.

Thanks for clarifying. A few questions:

1) your best phase 2 submission is one of the 2 allowed submissions in final phase

Aren't 3 submissions allowed? 2 additional plus the best? The sub page says: "Your best submission from the previous phase is automatically cloned and used to compute the final score on the full test set. Additionally, you can submit up to 2 additional submissions that you choose during that phase." So I assumed I could send two more and still have my best evaluated.

Yes, I don't see the filename "submission-2022-05-22_15-41-39.851554.zip" which was my best Dev submission, can you check it, please?

Posted by: mariofilho @ June 6, 2022, 1:18 p.m.

@mariofilho:

1) Indeed every participant has 1 auto-migrated + up to 2 manual submissions in Final for a total of 1 to 3 evaluated submissions in Final.

2) We checked it precisely, I'm sending you the details by e-mail.

Posted by: eustache @ June 6, 2022, 2:16 p.m.

Thanks for the info. I really would like to know more on how this metric is helpful in building models in dev phase. It is expected that final phase data will have distribution shift, then what is the point in evaluating the models from dev phase with this metric? Maybe some other robust metric might help which can give an idea about the performance on final data during dev phase. Otherwise, here in this case I somehow feel we squandered our time in dev phase.

Posted by: ColdWheels @ June 7, 2022, 2:50 a.m.

@ColdWheels:

Participants had access to 3 domains in training with labels with the goal to build the most generalizable model. In Dev phase everyone could test their model against a new, unseen domain, with only access to loss feedback but not labels.

The idea was not to overfit wrt the Dev phase domain but rather to check if the model was generalizing.

To avoid this overfitting and stay in the ODD paradigm (= you never saw the domain neither have you had prior feedback, eg in form of a loss) the Final phase domain is a new, unseen domain, different from the Dev. So indeed choosing the best model wrt to its Dev phase loss is perhaps not the best model selection strategy (i.e. chasing 1st place in Dev leaderboard was not necessarily a good idea). Remember also that we warned in the documentation that model selection was probably a big part of the challenge.

Nonetheless at least some of the participants seem to have atteigned this goal as the winning solution is best against Dev, Final and Robustness domains! It is hard to believe that it is pure luck... I guess we are all curious about their model selection strategy (and solution in general) and can't wait for their presentation at ECML to know more :)

Posted by: eustache @ June 7, 2022, 7:35 a.m.

Post in this thread

Forums

PRINCE Out-of-distribution Generalization Challenge @ ECML-PKDD Forum

> Clarification on final evaluation