1st OmniLabel workshop will be held on June 18th, 2023 in conjunction with CVPR 2023.
We aim to foster research on the next generation of visual perception systems that reason over label spaces that go beyond a list of simple category names. Modern applications of computer vision require systems that understand a full spectrum of labels, from plain category names (“person” or “cat”), over modifying descriptions using attributes, actions, functions or relations (“women with yellow handbag”, “parked cars”, or “edible item”), to specific referring descriptions (“the man in the white hat walking next to the fire hydrant”). Natural language is a promising direction not only to enable such complex label spaces, but also to train such models from multiple datasets with different, and potentially conflicting, label spaces.
As a part of the workshop, we have a challenge on object detection with a novel evaluation dataset that goes beyond generic object detection, open-vocabulary detection, or referring expression datasets. We collected complex, free-form text descriptions of object instances in images. Please visit our website at https://www.omnilabel.org for all details about the benchmark dataset.
We invite the research community to participate in our challenge and to test your V&L detection models on our benchmark. For now, we have released the validation set along with a Python toolkit for evaluation. This should allow you to learn about the dataset and tune your models.
Challenge website: https://sites.google.com/view/omnilabel-workshop-cvpr23/challenge
Please refer to https://www.omnilabel.org/dataset/download for the detailed description of how to download and set up the data.
Allowed training datasets: Any
Allowed pre-trained models: Any
Disallowed: Do not use the validation/test sets of COCO, Objects-365 or OpenImages for training. Also, we discourage the use of the OmniLabel validation set for training/fine-tuning.
1. Create answer.json file in the following format:
[
{
image_id ... the image id this predicted box belongs to
bbox ... the bounding box coordinates of the object (x,y,w,h)
description_ids ... list of description IDs that refer to this object
scores ... list of confidences, one for each description
},
...
]
2. Create a ZIP archive by compressing answer.json file. (See an example submission file: example_dummy_result.zip)
Please refer to https://www.omnilabel.org/task for the detailed description of the task
Inputs:
Each sample in the dataset is a pair of image I and label space list D. Each image I has a unique image_id. Similarly, the label space D is a list of object descriptions, each coming with a unique description_id and a text, e.g., D = [{id=321, text="Person"}, {id=223, text="Donuts"}, {id=4321, text="Chocolate donut"}, {id=12, text="Donut with green glaze"}].
Outputs:
The expected output of a model M is a set of triplets (bbox, score, description_id). Each triplet consisting of a bounding box, a confidence score, and an index linking the prediction to an object description in D. A bounding box bbox = (x,y,w,h) consists of 4 coordinates in the image space and defines the extent of an object. The confidence score is a real-valued scalar expressing the confidence in the model's prediction. Finally, the index description_id points to an ID in the label space D and indicates that the box is described by the corresponding text. Note that one object in the image may be referred to by multiple object descriptions (e.g., “person” and “woman in red shirt”), in which case the output should be one bounding box which points to two descriptions.
To evaluate a model M on our task, we can follow the basic evaluation protocol of standard object detection with the Average Precision (AP) metric. This metric computes precision-recall curves at various Intersection-over-Union (IoU) thresholds to evaluate both classification and localization ability of a model M. Compared to traditional object detection, we adjust the evaluation protocol to account for the complex, free-form text object descriptions, which are virtually infinite in size:
The grouping of the label space is different. Instead of computing AP per-class and taking the mean, we compute AP for different, and potentially overlapping groups of the label space. One group computes AP for all descriptions, other groups only for free-form object descriptions of different length, etc.
One predicted box can be matched to multiple object descriptions, e.g., "person" and "woman in red shirt"
Our evaluation toolkit is available on github
The participants own the copyright of their source code and trained models.
Start: March 28, 2023, midnight
Start: May 2, 2023, midnight
May 26, 2023, 11:59 p.m.
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | l_harold | 0.2981 |
2 | tuyentx | 0.2650 |
3 | sanghyeokchu | 0.2641 |