Dear Organizer,
It is possible to overfit the challenge score by selecting the best submission for each ROI. The issue stem from: 495 per-ROI score on the test set can be obtained by clicking `Download output from scoring step`.
I did offline submission stimulation to demonstrate this issue, please checkout the issue demo in https://gist.github.com/huzeyann/480959f28456e36f3425b5adf94d0a4b
proposed fix:
## Plan A (The perfect fix)
# keep showing per-ROI score.
# add another test set, which is not used for the leaderboard,
# compute the score on the new test set only once.
# the new test set can be the NSD synthetic images, which is still withhold.
## Plan B (The big fix)
# stop showing per-ROI score.
# keep current metric.
# renew the whole dataset with new pre-processing or denosing pipeline.
## Plan C (The dirty fix)
# stop showing per-ROI score.
# keep current data.
# find another metric, which is not highly correlated with current `leaked` metric.
Hi Huze, thank you for pointing out the risk of overfitting, and for taking time to develop the notebook with the simulation! Here are our thoughts on this.
(1) We made the choice as a team to provide ROI-level test result feedback for your benefit knowing it may contribute to potential leakage, but we believe it is better for the participants and this challenge endeavor as a whole to provide this feedback.
(2) We limited the number of daily and total challenge submissions to make overfitting more difficult in practice.
(3) If one wants to overfit the test data with their submissions, they should also be prepared to defend their scientific contribution at CCN and in their written report. The spirit of the challenge is to understand the brain, so while one may be able overfit the challenge score, we look forward to reading the reports and hearing CCN talks that describe their scientifically meaningful approaches.
(4) Overall, while overfitting the challenge score may be possible, we believe it is in everyone’s best interest to not make changes to the challenge, also considering that, following your simulations, overfitting to the test data by exploiting the 495 (10^2.69) per-ROI scores results in small gains in prediction accuracy (<= 1%).
- Algonauts Organizers
I made a script to exploit this issue
https://gist.github.com/huzeyann/b5e2f69d4d07cbbe37accc3992bd30bb
The strategy is: 1. initialize score for every voxel as zero. 2. for each small ROI, fill the voxels 3. for each stream, fill the voxels not in the small ROIs. 4. fill the rest voxels with global score. 5. choose the best submission for each voxel.
using this script, I combined my first 4 submissions
`model zoo: [55.64305289 55.7769935 56.12478256 54.74309283]`
the new submission score is 57.2101462142, 1% gain compared to the best submission.
My best guess for the big gain is: 1. different from my 70 models stimulation with minor hparam changes, these 4 models are trained with major hparams difference. 2. different from stimulated non-overlapping random ROIs, overlaps in small ROIs and streams make the overfit even more.
The exact potential of overfitting is still to be discovered, which requires large number of submission. Organizers please consider increasing daily submission limit to 250, and ban teams with multiple accounts.
Besides, There may be a better strategy for filling submission voxel score than my script, to make this a fair game, can teams please share their strategy? Teams are welcome to use and modify my script.
Posted by: huze @ June 2, 2023, 8:25 a.m.update: I just combine all my 8 submissions
script output is:
```
model zoo: [55.64305289 55.7769935 56.12478256 54.74309283 57.21014621 57.4574949
55.78534827 57.26675688]
best model zoo: 57.45749489666356
estimated overfit submission score:
mean: 60.167309410907244
median: 60.630195417481424
```
real submission score is:
58.2671
Hi Huze,
Thank you again very much for taking the time to make the simulations and for pointing out this issue! We are well aware of the possibility of overfitting to the test data, and this is why we decided to limit the maximum number of submissions, to forbid users from having multiple Challenge accounts, and to make it mandatory for winners to share their code: we believe that in such way the risk of users trying to cheat is greatly reduced.
The Algonauts Team
Posted by: giffordale95 @ June 5, 2023, 10:19 a.m.