CXR-LT: Multi-Label Long-Tailed Classification on Chest X-Rays Forum

Go back to competition Back to thread list Post in this thread

> The val/test labels in the LongTailCXR paper github are one-hot.....not multilabel

all are single classed.... and the code uses softmax not sigmoid in a multiclass not multilabel setting....

Posted by: peratham.bkk @ July 12, 2023, 5:31 a.m.

I am curious since most multilabel works/datasets/tasks in chest xrays got ~0.7-0.9, it depends.
But the leader board looks ~0.2-0.4.

Posted by: peratham.bkk @ July 12, 2023, 5:33 a.m.

If the current labels in the competition/leaderboard is one-hot, all multilabel systems will score ~0.2 since according to the training set imbalanced statistics and precision calculation (~ 1 out of 5ish if one hot).

Posted by: peratham.bkk @ July 12, 2023, 5:37 a.m.

Even a harder dataset like ImageNet-lt, their SOTAs has more F1/acc......

Posted by: peratham.bkk @ July 12, 2023, 5:45 a.m.

(Hopefully, when I was posting this sample from MIMIC-CXR for this issue does not breach any agreement....)

Posted by: peratham.bkk @ July 12, 2023, 6:09 a.m.

You make a few different observations, which I'll try to address one by one.

First, about the data. Indeed, the data from https://github.com/VITA-Group/LongTailCXR is "multi-class". The associated paper (https://link.springer.com/chapter/10.1007/978-3-031-17027-0_3) studied *single-label* (multi-class) long-tailed learning. While the vast majority of advances in long-tailed learning are in the multi-class classification setting (where classes are mutually exclusive), this competition's setting is different in that poses both a multi-label and long-tailed problem for medical image classification. (Also, I would be extremely careful about using labels from the above repository for this competition since some labels may be identical to this competition's val/test sets, which you are not permitted to use for model development in any way.)

Second, about performance. When you claim "most multilabel works/datasets/tasks in chest xrays got ~0.7-0.9", it is critical to note that almost all papers using MIMIC-CXR or NIH ChestXRay use AUROC as the performance metric. This is appropriately reflected in this competition's leaderboard, as top performers are reaching mean AUROC values above 0.83. As stated on this competition's Evaluation page (https://codalab.lisn.upsaclay.fr/competitions/12599#learn_the_details-evaluation), AUROC is known to become inflated in the presence of severe class imbalance. For this reason, we primarily adopt average precision (AP), which we feel more closely reflects model performance on our highly imbalanced dataset.

Third, about difficulty. What makes you say ImageNet-LT is "harder"? It's certainly *larger*, but that might not reflect task difficulty. I would argue this task is quite difficult, which is what makes it worth studying! :)

Happy to continue discussing over email at cxr.lt.competitions.2023@gmail.com.

Greg Holste | CXR-LT Organizer

Posted by: gholste @ July 12, 2023, 1:18 p.m.

You make a few different observations, which I'll try to address one by one.

First, about the data. Indeed, the data from https://github.com/VITA-Group/LongTailCXR is "multi-class". The associated paper (https://link.springer.com/chapter/10.1007/978-3-031-17027-0_3) studied *single-label* (multi-class) long-tailed learning. While the vast majority of advances in long-tailed learning are in the multi-class classification setting (where classes are mutually exclusive), this competition's setting is different in that poses both a multi-label and long-tailed problem for medical image classification. (Also, I would be extremely careful about using labels from the above repository for this competition since some labels may be identical to this competition's val/test sets, which you are not permitted to use for model development in any way.)
-----The datasets being used are the same, NIH/MIMIC-CXR which are multilabel in nature. From the example I depicted, some positive labels are set to zeros to make it multiclass? It's good to have this competition multilabel.

Second, about performance. When you claim "most multilabel works/datasets/tasks in chest xrays got ~0.7-0.9", it is critical to note that almost all papers using MIMIC-CXR or NIH ChestXRay use AUROC as the performance metric. This is appropriately reflected in this competition's leaderboard, as top performers are reaching mean AUROC values above 0.83. As stated on this competition's Evaluation page (https://codalab.lisn.upsaclay.fr/competitions/12599#learn_the_details-evaluation), AUROC is known to become inflated in the presence of severe class imbalance. For this reason, we primarily adopt average precision (AP), which we feel more closely reflects model performance on our highly imbalanced dataset.
-----I am talking about either mAP or F1 as most papers in the paperswithcode leaderboard.

Third, about difficulty. What makes you say ImageNet-LT is "harder"? It's certainly *larger*, but that might not reflect task difficulty. I would argue this task is quite difficult, which is what makes it worth studying! :)
-----Harder datasets mean many factors such as number of categories, imbalance factors, noise factors, etc. (See some literatures like in SIGKDD* where these issues have been addressed for > 30 years.) These are simple rule-of-thumb where a person can roughly estimate and expect for some numbers showing the performance. Given the factors like in this competition, it is cross-domain and less variations where I should say it is some what similar to CIFAR-ish datasets with the BW/gray setting.

Happy to continue discussing over email at cxr.lt.competitions.2023@gmail.com.
-----I used that repo as the starting point and just screamed :)

Posted by: peratham.bkk @ July 13, 2023, 7:29 a.m.

(I am fully aware that since this is a competition, the organizers can crate the development/test splits in many ways to reflect their purposes though. They can have more samples from the less frequent classes in the development/test sets or either.)

Posted by: peratham.bkk @ July 13, 2023, 7:38 a.m.

((Some similar competitions addressing class-imbalance and long-tail classification in images/videos are such as iNaturalists or many tasks in FGVC in *CV conferences. This competition is different though since it is about medical data.))

Posted by: peratham.bkk @ July 13, 2023, 7:52 a.m.

To clarify, you're absolutely free to use any code from https://github.com/VITA-Group/LongTailCXR -- it's publicly available. The problem is that using any pretrained models or labels based on the MIMIC data in that repo may not be allowed, so I would avoid this.

Greg Holste | CXR-LT Organizer

Posted by: gholste @ July 13, 2023, 1:54 p.m.

I am not trolling clearly. I would withdraw/correct some erroneous publications personally. It is better in the long run.

Posted by: peratham.bkk @ July 13, 2023, 3:08 p.m.
Post in this thread