First of all, thanks for all the work that went into this shared task! Due to time constraints, I didn't get to try half the things I wanted to, but I had fun nonetheless.
That said, I have a few, mostly minor, issues with the data you provided:
- Validation and test data have apparently been pre-processed differently than the training data. For example, there are lots of paired single quotation marks (indicating italics) in the validation and test set which do not feature in the training data at all.
- There are question marks in the training set where other Unicode characters should be (e.g. in sentence 627: "[???sa?kl??]"). In the original data from https://github.com/babaknaderi/TextComplexityDE, the text looks fine ("[ɹɪˈsaɪklɪŋ]"). Obviously, this can lead to tokenisation errors.
- There's at least one sentence in the test set that just stops in the middle (2124).
- The train-test-split seems to be far from random, but that's probably due to the fact that the training data had already been published. Still, I think it's somewhat problematic that the sentences from the test set appear to originate from only a few different Wikipedia articles (from different domains than those in the training set); I think there should be more variation to be able to interpret the results meaningfully.