"New Results on Phase 2 Test Set: In terms of results on the new test set, our internal baseline testing of Phase 1 metric provides 0.6315. This is comparable to Phase 1 test set’s 0.6256. However, in terms of the Phase 2 patient-wise metric, the new test set produces a score of 0.6943. This is significantly less compared to the old test set’s 0.7514. The difference is because of 167 patients in the new test set compared to 40 in the old testset."
It is possible to release the csv file? We want to use to to compare with our own predictions and make sure that our model isn't wildly underperforming on Phase 2.
Posted by: Lancelot53 @ Aug. 31, 2023, 6:52 a.m.We will not be releasing the PHASE 1 test labels. But we have extended codalab evaluations until Sept. 3rd EOD for PHASE 1. Please keep on using this for evaluation for your model. Additionally the evaluation script for PHASE 2 should be available on github.
Posted by: OLIVES @ Aug. 31, 2023, 1:41 p.m.I was actually referring to the output of your baseline model on Phase 2 dataset. Similar to how you released baseline_result.csv for Phase 1. Will that be possible?
Posted by: Lancelot53 @ Aug. 31, 2023, 1:45 p.m.Thank you. We are choosing to not release the baseline csv file for test set 2 since it might be possible to draw patterns regarding personalizability.
However the baseline model that we provided initial results is the same model that was trained by us in phase 1 and whose training regimen is currently up on GitHub. Participants can still use this to train the model on the training set. We suggest to first use it on codalab phase 1 testset to get similar results (~0.62) as a sanity check.
Posted by: OLIVES @ Aug. 31, 2023, 3:03 p.m.I understand. However, I tried running the repo without any modification beyond setting [--epoch 100 --model "resnet50"] and am consistently scoring around 0.58 on the Phase 1 LB. Is this expected?
Posted by: Lancelot53 @ Aug. 31, 2023, 6:47 p.m.Yes this is reasonable. Note that you are seeing results on only 70% of the data and there is an additional 30% data that is under closed validation (which has generally tended higher). Moreover, in our experience, architectural choice creates a ~0.05 difference (atleast until the results approach 0.65-0.7)
Posted by: OLIVES @ Aug. 31, 2023, 8:31 p.m.