I would push back pretty strongly on the premise here, because one of the best lessons to teach your students (not just in analysing experimental data like this, but in any data science role they might end up in the future) is don’t edit the raw data.
The more positive converse of that is that all data manipulations should be done in code, so that they are documented and allow for reproducibility. This also allows them to be reversed or changed.
If you are using R, then the first steps in a data analysis pipeline might look like this:
dat = readr::read_csv('your_data_file.csv') %>%
# select only variables of interest:
dplyr::select(-psychopy_version, some_var:some_other_var) %>%
# drop subject who was colour-blind:
dplyr::filter(subject_id != '014')
Anyone reading this can see that only some variables are of interest, but can also see easily how and where to reinstate others if required. They can also see that a principled decision was made to drop one subject.
Students shouldn’t be mollycoddled to work only with simple datasets. They need to learn that the first step in any real data analysis is to tidy it as needed, and just as importantly, to document those steps in their code. e.g. rather than deleting the raw data of the invalid subject at source, we explicitly drop those records and note why.
If students learn that raw data files can be edited, what is to stop them dropping cases or observations at that step, or transforming variables? As the .csv files don’t document those changes, they have become invisible, and permanent. And this extends to variables that don’t immediately seem useful, like the PsychoPy version column. If it was indeed constant during your experiment, then it won’t play any role in your analysis, so it can simply be dropped at the analysis stage with a single function call. But it might be of use to someone else to retain it in the raw data. Let’s say you release your edited data files publicly. Someone else might find their results differ from yours. For example, PsychoPy Builder recently shifted from using the
event module to record reaction times to the
Keyboard class. The former would often have a 16.7 ms granularity, while the latter is sub-millisecond. Seeing that your data was recorded with a newer version of PsychoPy would help explain the discrepancy. And do you really know that the version was constant during the study? What if a helpful student or technician upgraded PsychoPy during the experiment, leading to inconsistencies in the data? This becomes impossible to check if it has been removed from the files to be analysed. We often see people saying that they aren’t interested in all of the various trial and block numbers that PsychoPy records, as they just want to calculate within-subject totals or means. That’s all fine, until a reviewer asks if there was a learning effect going on within blocks… if those columns have been dropped from the data files, it becomes impossible to answer the question.
And if students learn that columns can be dropped from raw data files, they might also think its OK to transform remaining columns, to save time in the future. If someone else then analyses those files, the transformation might be applied twice. And if cases are dropped, there is no way of knowing how many or why. All of these issues disappear if one applies a rigorous rule that all such data manipulation should happen within code, so that it is documented, and can be reversed if required.
Sorry, this was a bit long, but I’ve found time and time again that it is a useful principle to use and to teach. But also, its fundamentally easier to have a single
select() function that drops any unneeded columns, rather than go through the hassle of processing your source csv files, and potentially having to re-do that process if your needs change.