BioCreative VIII Task 3 - Genetic Phenotype Normalization from Dysmorphology Physical Examination

Organized by dweissen - Current server time: Oct. 9, 2025, 2:30 a.m. UTC

Previous

Task 3a, Normalization - Post-Evaluation
Sept. 19, 2023, 1 a.m. UTC

Current

Task 3b, Ext. and Norm. - Post-Evaluation
Sept. 19, 2023, 1 a.m. UTC

End

Competition Ends
Never

Important dates: (tentative)

  • Training data available: April 11,2023
  • Test data available: September 15, 2023
  • System predictions for test data due: September 18, 2023
  • Short technical systems description paper due: October 10, 2023
  • Paper acceptance notification: October 20, 2023
  • Camera ready: October 27, 2023

Task Motivation

The dysmorphology physical examination is a critical component of the diagnostic evaluation in clinical genetics. This process catalogues often minor morphological differences of the patient's facial structure or body, but it may also identify more general medical signs such as neurologic dysfunction. The findings enable correlation of the patient with known rare genetic diseases. They therefore directly influence clinical diagnosis, the selection of genetic testing, and the interpretation of results---particularly when testings reveals variants of uncertain clinical significance. Beyond the clinic, such information is also useful to researchers attempting to delineate undescribed genetic conditions or to further our understanding of existing ones.

Whereas the medical findings are key information, they are nearly always captured within the electronic health record (EHR) as unstructured free text, making it unavailable for downstream computational analysis. Advanced Natural Language Processing methods are therefore required to retrieve the information from the records.

Task Definition: Automatic extraction and normalization of genetic conditions in dysmorphology physical examination reports

For the BioCreative VIII shared task, we call for automated systems to extract and normalize the key findings in observations written during dysmorphology physical examinations.

Dysmorphology physical examinations are frequently documented in the EHR as a series of organ system observations. For example:

PHYSICAL EXAMINATION    

FACE: slightly inverted triangular face shape

EYES: long palpebral fissures with slight downslant. Sparse lateral eyebrows.

EARS: Thin inferior helices, low-set

NOSE: Short, wide nasal bridge. Anteverted nares.

MOUTH: thin upper lip; palate intact

CHEST: supernumerary nipple inferior to left nipple

HANDS FEET: Long fingers, normal toes

NEUROLOGIC: Resting tremor. Wide-based, unsteady gate.

Similar to clinical workflows, we will standardize the description of dysmorphic findings using the Human Phenotype Ontology, an ontology specially designed for human genetics.

A successful system should extract the span of text referring to the key positive findings and normalize them to term IDs in the HPO ontology. The system should ignore the normal findings. For example, in the organ system observation: EYES: long palpebral fissures with slight downslant. Normal eyebrows. A system should extract the spans of the two key findings [long palpebral fissures] and [palpebral fissures with slight downslant], and normalize them to the terms IDs HP:0000637 and HP:0000494, respectively. The system should ignore the normal finding of [Normal eyebrows].

During the competition, participants will be able to perform one of the following subtasks:

  • Subtask 3a: Given an observation, participants of this subtask will be required to submit the HPO term IDs of all key findings mentioned in the observation.
  • Subtask 3b: Given an observation, participants will be required to submit the spans of the key findings and their corresponding HPO term IDs

Details:

  • Training set: 1716 de-identified observations with key and normal findings manually annotated and normalized with their corresponding HPO terms
  • Validation/Development set: 454 de-identified observations with key and normal findings manually annotated and normalized with their corresponding HPO terms
  • Test set: 966 de-identified observations with key and normal findings manually annotated and normalized with their corresponding HPO terms + 2427 decoys observations
  • Training, Validation, and Test sets were annotated with the version hp/releases/2022-06-11 of the HPO
  • Baseline systems: Multiple systems are available to the participants and can be adapted/extended to resolve our task, e.g. doc2HPO, NeuralCR, PhenoTagger, PhenoBERT, and txt2HPO
  • Registration link  (Note: after registration, we communicate and release the data through dedicated google groups, please, make sure to check your spam folders)Participating teams are required to submit a paper describing the system(s) they ran on the test data. Sample description systems can be found in previous years proceedings, e.g. here.  Participating teams are also required to review at least one system description paper from another participating team.
  • Contact information: Davy Weissenbacher (davy.weissenbacher@cshs.org)

Evaluation Metrics

  • Subtask 3a: Given an observation, participants of this subtask will be required to submit the HPO term IDs of all key findings mentioned in the observation. This subtask will use the standard F1, Precision and Recall metrics, with the outcomes defined as follow:
    • True Positive: a mention of an HPO term in an observation labeled and normalized by an annotator, correctly detected and normalized by a system
    • True Negative: a mention of normal finding correctly ignored by a system - note that a normal finding can be an HPO term negated
    • False Positive: an HPO term incorrectly detected or normalized by a system - i.e. this HPO term was either not labeled by an annotator as being mentioned in the observation, or the term was mentioned but it was negated or normalized with a different HPO term ID
    • False Negative: a mention of an HPO term in an observation labeled and normalized by an annotator but not detected or incorrectly normalized by a system
  • Subtask 3b: Given an observation, participants will be required to submit the spans of the key findings and their corresponding HPO term IDs. This subtask will use the standard metrics: Exact and Overlapping Precision, Recall and F1-score, with the outcomes defined as follow:
    • Exact:
      • True Positive: a span, possibly disjoint, of a mention of an HPO term in an observation labeled and normalized by an annotator, correctly detected and normalized by a system, with the span predicted matching exactly the span labeled
      • True Negative: a mention of normal finding correctly ignored by a system
      • False Positive: a span incorrectly detected OR normalized by a system - i.e. the span predicted by the system was not labeled by an annotator as mentioning an HPO term, or the span predicted was a mention of a term but it was negated, or the predicted span and the labeled span differ, or the spans are identical but the span predicted is normalized with an incorrect HPO term ID
      • False Negative: a span of a mention of an HPO term labeled and normalized by an annotator but not detected OR incorrectly normalized by a system
    • Overlapping:
      • True Positive: a span, possibly disjoint, of a mention of an HPO term in an observation labeled and normalized by an annotator, correctly detected and normalized by a system, with the span predicted overlapping exactly or partially with the span labeled
      • True Negative: a mention of normal finding correctly ignored by a system
      • False Positive: a span incorrectly detected OR normalized by a system - i.e. the span predicted by the system was not labeled by an annotator as mentioning an HPO term, or the span predicted was a mention of a term but it was negated, or the spans predicted and labeled differ, or they overlap but the span predicted is normalized with an incorrect HPO term ID
      • False Negative: a span mentioning an HPO term labeled and normalized by an annotator but not detected OR incorrectly normalized by a system

Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the BioCreative workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You further agree to submit and present a short paper describing your system during the workshop.

You agree not to redistribute the training and test data except in the manner prescribed by its licence.

Challenges

Both steps, the extraction and the normalization, are particularly difficult on dysmorphology physical examinations given the current state-of-the-art of natural language processing.

Extraction. This step is challenging due to the descriptive style of the examinations and their polarity. The observations are short reports where, for conciseness, the span of a finding can be disjoint or overlapping with the span of another finding. The previous observation is an example of overlapping findings, with the span palpebral fissures contributing to both HP:0000637 and HP:0000494 terms. For disjoint findings, i.e. findings defined with non-consecutive segments of text, consider the term Short nasal bridge - HP:0003194 in the observation NOSE: Short, wide nasal bridge. Anteverted nares. Designed extractors should go beyond the standard sequence labeling approach which, designed to extract contiguous and mutually-exclusive named entities, fails to capture the disjoint and overlapping terms. As an additional challenge, the extractor should also resolve the polarity of the findings, that is, automatically detecting and ignoring normal findings, only returning the key positive findings.

Normalization. This step is also challenging, both due to the large scale of the HPO ontology and its incompleteness. Standard strategies for multi-label classification are designed to assign small sets of classes to input instances. However, to be successful in our task, a normalizer should adapt traditional strategies to assign one term from among the 17,000 terms in the HPO to each finding detected in an observation. This must frequently occur without supervision since our training set does not provide examples of use for all terms in the HPO. Furthermore, while specifically designed for human genetics, and constantly improving, the HPO does not have standardized levels of term detail. As a consequence, a key finding may need to be matched with a close ancestor in the hierarchy of the ontology, making the strict matching strategy inefficient since the string of the ancestor in the HPO will be different from the string of the key finding in the observation. For example, there exists both Naevus flammeus of the eyelid - HP:0010733 and Nevus flammeus of the forehead - HP:0007413, but no term for the nose, leaving only generic Nevus flammeus - HP:0001052 to normalize this abnormality of the nose when it is mentioned in an observation.

Registration

This task is a part of a bigger competition: BioCreative VIII. To register, please follow the link "team registration page" in the section TEAM REGISTRATION of the BioCreative VIII page.

The registration is free but required to access the training and evaluation data. If you have any questions, please contact Davy Weissenbacher

Task organizers

The results of the task will be released during the BioCreative VIII workshop and published in the proceedings of the event.

Task 3a, Normalization - Practice

Start: April 14, 2023, midnight

Task 3a, Normalization - Evaluation

Start: Sept. 14, 2023, 11 p.m.

Task 3a, Normalization - Post-Evaluation

Start: Sept. 19, 2023, 1 a.m.

Task 3b, Ext. and Norm. - Practice

Start: July 14, 2023, midnight

Task 3b, Ext. and Norm. - Evaluation

Start: Sept. 14, 2023, 11 p.m.

Task 3b, Ext. and Norm. - Post-Evaluation

Start: Sept. 19, 2023, 1 a.m.

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 QJW -
2 DUTIR-BioNLP -