CodaLab -

> Tokenization process doubt

Greetings everybody!

Hope you're having a nice week.
My name is Mariia and I have one question regarding the tokenization process used to divide texts in tokens that compose "boundary_annotation" fields of json data files.
More specifically, I'm interested in the way that newline characters are treated. Many documents contain two newline characters separated with spaces ("\n \n"), but among the annotations, as far as I could note, only one of those is present. To provide an example, I would like to cite a fragment from the note S0004-06142005001000015-1 (2nd entry of the dev set) as an example: "[...] urinoma.\n \n Se realizó [...]" (note_text) {
{
"span": "urinoma.",
"start_offset": 948,
"end_offset": 956,
"boundary": null
}, {
"span": "\n",
"start_offset": 956,
"end_offset": 957,
"boundary": null
}, {
"span": "Se",
"start_offset": 960,
"end_offset": 962,
"boundary": null
}, (boundary annotation)

As can be seen, the start offset of the newline character is the same as the end offset of the previous word and the second newline character is not present.

Can we have more details on the tokenization process to be able to correctly prepare the submission files, please?

Thank you in advance for your answer.

Posted by: mchizhik @ March 29, 2023, 12:36 p.m.

Hello Mariia,

I apologize for the delay in my response. It appears that there was a small error in the intermediary step used to create the JSON files, which caused contiguous new lines separated by white spaces to be represented as a single new line character '\n'. For this specific case, the desired token should be "\n \n" and the offsets should be 956-959.

We will upload an updated JSON file once we have resolved this issue. The number of tokens will remain the same, so it should not cause any issues with already-implemented systems and submissions.

In terms of tokenization details, please note the following:
- All text segments that are between parentheses or similar characters ('(', '[', '{') and do not contain any white spaces inside are kept as a single token.
- Punctuation marks are always attached to a word.
- Multiple contiguous line breaks that do not contain any words between them are kept as a single token. For example: "\n\n", "\n\t\n", "\n \n", "\n \n \n"...

Please let us know if you have any further questions or concerns.

Best regards,

Iker

Posted by: idelaiglesia004 @ March 30, 2023, 9:46 a.m.

Thank you very much for the explanation.
Looking forward to seeing the corrected dataset.

Have a nice day.
Best regards,

Mariia

Posted by: mchizhik @ March 30, 2023, 10:01 a.m.

Hello,

We have fixed the tokenization error and re-uploaded the training and development data. For this phase, we have disabled the offset checking in the scorer to prevent any issues with existing submissions. However, please note that it is still crucial to ensure that the number of tokens is the same.

In the final phase, offset checking will be enabled, so it's essential to ensure that the offsets are correct as well as the number of tokens.

I apologize for any inconvenience this may have caused and please don't hesitate to reach out if you have any further questions or concerns.

Best regards,

Iker

Posted by: idelaiglesia004 @ March 31, 2023, 5:08 p.m.

Greetings!

First of all, thank you very much for your effort and sorry to bring up se question again, but I'm still having troubles with getting my text split in tokens right for the unmodified evaluation script to work right.
I see that you don't divide the expressions like "ng\ml" and generally tokens with the punctuation mark in-between remain unseparated. Nevertheless , there are cases that digits with decimal separators form two tokens like in the following example: "[...] 1,5 cm [...]"
{
"span": "1,",
"start_offset": 585,
"end_offset": 587,
"boundary": null
}, {
"span": "5",
"start_offset": 587,
"end_offset": 588,
"boundary": null
}, {
"span": "cm",
"start_offset": 589,
"end_offset": 591,
"boundary": null
} (S0004-06142006000600010-1, dev set).
I'm also having troubles with, multiple newline characters at the end of some documents: in many cases they appear as a single "\n" among the annotation.

Given all this difficulties, may I suggest you release the tokenization code? For now, to get the tokens in right order I was using the development tokens from given "boundary annotation", but I suppose that in the final stage we will be given only plain text reports.

Have a nice day and thank you in advance for your answer.
Best regards,
Mariia Chizhikova

Posted by: mchizhik @ April 11, 2023, 8:03 a.m.

Thank you for your message.

We want to ensure that all participants are evaluated equally during the test phase, so we will provide both the plain text notes and the tokenization. Therefore, there is no need to replicate the exact tokenization process as we will provide the tokens. However, we will not provide the section annotations, and all tokens will have their boundaries set to "null." We encourage participants to use their own approaches to complete the task and map their results to the tokens provided as a last step.

Regarding the first question, we tried to keep measures like "ng/ml" as a single token. However, cases where dots or commas are used are harder to manage since they might sometimes separate two sentences without any whitespace in between. Therefore, we decided to separate them in all cases. Please note that text segments between parentheses or similar characters that do not have any whitespaces are kept as a single token.

We apologize for the inconvenience caused by the multiple newline characters. This error is the same as the previous one, but we overlooked this case. However, we have already uploaded the corrected files, and there should not be any more problems with multiple consecutive newline characters.

As for the tokenization script, we do not plan to release it since it is not required for the task. However, if you are interested in the details, please feel free to contact us, and we would be happy to share it.

We hope this clears up any questions you have regarding the tokenization. If you have any further concerns, please do not hesitate to let us know.

Best regards,

Iker

Posted by: idelaiglesia004 @ April 12, 2023, 12:38 p.m.

Good morning, Iker:

Thank you for your prompt and detailed reply. I appreciate your effort in providing both the plain text notes and the tokenization for the test set. This solves everything for me and makes submission preparation much easier.

I understand your rationale for separating dots and commas in all cases. I think it is a reasonable decision given the complexity of the text. I also appreciate your clarification on the newline characters and the section annotations.

I am not interested in the tokenization script at this moment, but I might contact you later if I have any questions about it. Thank you for offering to share it.

You have answered all my questions regarding the tokenization. I am very grateful for your support and guidance.

Best regards,

Mariia

Posted by: mchizhik @ April 13, 2023, 7:19 a.m.

Post in this thread

Forums

ClinAIS - IberLEF 2023: Automatic Identification of Sections in Spanish Clinical Documents Forum

> Tokenization process doubt