REI - Regular Expression Inference Forum

Go back to competition Back to thread list Post in this thread

> Validation set contamination

Hi,
I was analyzing the data, and it looks like the validation sets are proper subsets of the corresponding training sets. This is unusual (especially if the validation sets are meant to tune models), and not documented as far as I can tell.
Best regards,
Gabriel

Posted by: gbcolborne @ Nov. 10, 2023, 4:04 p.m.

Hi Gabriel,

Thanks for bringing this up.

We see how this is confusing, we will try and explain it better here:
As stated in the readme file in the starter kit, "dsX_train_public.csv" contains the training data for dataset X; that is, these contain *all* data that is disjoint from the corresponding test sets ("dsX_test_public.csv").

"dsX_dsY_train.csv" were created by merging and shuffling the corresponding datasets X and Y (DS1+DS2, and DS3+DS4 respectively), and keeping 90% for training. The remaining 10% are what is contained in "dsX_val.csv" and "dsY_val.csv" (individually) and "dsX_dsY_val.csv" (combined).
That is to say, the _val files do *not* correspond to the individual _train_public files, but to the merged and shuffled combined training data.

This is an artifact of how we train our ReGPT model. We always train on the combined datasets (DS1+DS2, or DS3+DS4), using the combined validation sets for early stopping during training. Then, we use the individual _val files to find the best ReGPT model per target dataset.
Our heuristic baselines do not require validation data in this fashion, and hence we always use the *full* datasets for (dsX_train_public), which contain everything except test instances.

We hope that clears it up, apologies for the confusion.

Best Regards,
REIC Team

Posted by: REIC @ Nov. 10, 2023, 5:51 p.m.

I understand, but I think it would have been better to first split individual datasets into train/val/test, then merge training and validation sets if required. I think it should also be mentioned in the README that if you are working on an individual dataset, then the validation set should not be used for tuning (unless you change the composition of the data).

Posted by: gbcolborne @ Nov. 10, 2023, 5:55 p.m.

As the saying goes, there are many ways to skin a cat.

We provide the data in this form mainly for reproducibility:
- Heuristic baselines use *all* available data for pattern extraction, except test data. This is contained in the _train_public files.
- Our ReGPT models are trained on combined data; this was merged, shuffled, and split into training and validation portions (dsX_dsY_train, dsX_dsY_val and individual dsX_val and ds_Y_val).

Where participants want to train models on combined data, we recommend using these splits.
Where participants want to train models that address a specific track, they can also use the merged training data, by filtering for uniform (1-valued) or non-uniform cost functions.

Kind Regards,
REIC Team

Posted by: REIC @ Nov. 10, 2023, 6:06 p.m.

It's up to you, but if you want to keep the data as is, I think you should at least add those 2 tips at the end to the dataset documentation.
Best regards,
Gabriel

Posted by: gbcolborne @ Nov. 10, 2023, 6:14 p.m.

Thanks for the suggestions, we will update the readme to make it more explicit how the data was created and is intended to be used.

We might also update the data archive with names more reflective of intended use, and additional de-merged dsX_train files for individual tasks, corresponding to the dsX_val files.
Note however, that DS1 and DS3 are really quite small on their own (around 5.5k examples each in total, only excluding test instances), so adding at least the DS2 (respectively DS4) instances that also have uniform cost might be advisable.

Best,
REIC Team

Posted by: REIC @ Nov. 10, 2023, 6:28 p.m.

Sounds good, and thanks for that suggestion.
Best,
Gabriel

Posted by: gbcolborne @ Nov. 10, 2023, 6:32 p.m.
Post in this thread