is the dev scores supposed to be generated from the models trained on the corresponding protocols and merge them together just for submission?