I have possibly found a small error on the test set for the initial practice stage:
i=2
df_test.iloc[i]['text'], df_test.iloc[i]['target'], df_test.iloc[i]['IOB_true']
yields:
('Xavier Labandeira:\xa0“España es uno de los países más vulnerables al cambio climático”',
'cambio climático',
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-TARGET', 'O'])
Given that doing a simple whitespace split on the text will keep the closing quotation marks on climático the target is not fully found among the text even though the ground truth shows the begin tag for it.
Posted by: lucasval @ March 14, 2023, 6:47 p.m.I have found some more errors on the test set, some of them are caused by a similar issue to before, where extra characters in some of the words of the target like . or , prevent automatic matching logic from fully matching the target.
target_len = df_test['target'].apply(lambda x:len(x.split())).values
IB_len = df_test['IOB_true'].apply(lambda x: len([k for k in x if k!='O'])).values
bad_idxs = []
for index, (first, second) in enumerate(zip(target_len, IB_len), start=0):
if first != second:
bad_idxs.append(index)
len(bad_idxs)
Shows 23 instances in the test set where the number of I, B tags don't match the target length.
This code prints the error instances in detail, I have also found some instances of spurious matching of individual words in the target:
for i in bad_idxs:
txt, tgt, tags= df_test.iloc[i]['text'], df_test.iloc[i]['target'], df_test.iloc[i]['IOB_true']
print(f'text {txt}')
print(f'target {tgt}')
print(f'tags {tags}')
print(' ')
Hi. Thanks for posting this issue. We will check the problem and give you a response as soon as possible
Kind regards
Posted by: JoseAGD @ March 14, 2023, 8:19 p.m.A possible new implementation of convert_IOB() that removes most of the errors would be:
import string
# Convert to IOB format
def convert_IOB2(text, target):
result = []
exclude = set(string.punctuation)
exclude.add('“')
text_clean = ''.join(ch for ch in text if ch not in exclude)
text_clean = text_clean.replace('”','')
target_clean = ''.join(ch for ch in target if ch not in exclude)
target_clean = target_clean.replace('”','')
target_list = target_clean.split(' ')
occur = 0
for token in text_clean.split(' '):
if target == '':
result.append('O')
else:
if token == target_list[occur]:
if occur == 0:
result.append('B-TARGET')
occur += 1
else:
result.append('I-TARGET')
occur += 1
if occur == len(target_list):
occur = 0
else:
result.append('O')
return result
This gets us down to a single error, possibly caused by the lack of capitalization in the target:
text Las lagunas legislativas sobre la intervención de las CCAA amenazan la eficacia de los Perte
target perte
tags ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Dear Lucasval. Today we have uploaded the Google Colab and the scoring program in Codalab in order to solve this issue. In the new version, punctuation markers (not only quotes) and capitalization are not considered in the evaluation script. Please, note that the datasets are the same (that is, they keep uppercase words and punctuation symbols), as words that starts with capital letters could be informative to decide what is the target.
Please, do not hesitate in ask any other concerns
Posted by: JoseAGD @ March 15, 2023, 11:06 a.m.