NeurIPS 2022 CausalML Challenge: Causal Insights for Learning Paths in Education Forum

Go back to competition Back to thread list Post in this thread

> Task3&4: bug in preprocessing script (by sorting timestamp)

In `starting_kit/task_3_and_4/task_4_util.py`: `load_and_process_eedi_data`, the loaded training data is row-sorted by columns ("UserId", "QuizSessionId", "Timestamp"):

```
loaded_data = (
pd.read_csv(data_path, index_col=False)
.sort_values(["UserId", "QuizSessionId", "Timestamp"], axis=0)
.reset_index(drop=True)
)
```

However, sorting by timestamp is not valid, basically since it only contains minute:second, but not the full date and hour. For example,

In user 2's quiz session 7261, the original saved timestamps of questions are '53:25.7', '53:53.8', ..., '59:54.7', '00:16.2', '02:22.9' (already sorted). However, by sorting on timestamp, the respective time sequence produced in `loaded_data` is '00:16.2', '02:22.9', '53:25.7',..., which then makes this quiz start with a checkout question.

A correct key might just be `.sort_values(["UserId", "QuizSessionId"], axis=0)`.

Posted by: overflow @ Sept. 12, 2022, 2:42 p.m.

The data we provided in `checkins_lessons_checkouts_training.csv` indeed includes the date, not just the minutes and seconds. We realize that when you open the data with Excel, it automatically hides the date information. We recommend using Pandas to handle this csv file.

Regarding the timestamp, the candidate should note that the original csv are not sorted based on time, meaning that the smaller session ID does not necessarily happen before the larger session ID.
Another thing worth noticing is that since the student can resume the previous unfinished session, sometimes a session can happen inside another session.

As for the starting kit, the strategy we adopt is for simplicity, and it is far from an optimal way of processing the data.

Best

Posted by: WenboGong @ Sept. 15, 2022, 9:54 p.m.
Post in this thread