-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcode_data.qmd
More file actions
65 lines (38 loc) · 3.8 KB
/
code_data.qmd
File metadata and controls
65 lines (38 loc) · 3.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
title: Code & Data
subtitle: Experiences of Learning to Code
date: last-modified
format-links: false
number-sections: false
toc: false
metadata-files:
- metadata/authors-joe.yml
- metadata/meta.yml
abstract: |
This page describes and locates the various code and data artifacts produced during the project.
---
### Available code & data
- [JSON file](https://github.com/ExpLrnCode-2024/jisc-surveys/tree/main/jisc_export/survey.json) with which the Jisc survey can be reconstructed.
- Survey data (to do)
- Code for analysing and visualising survey data (to do)
- [Python tool](https://github.com/ExpLrnCode-2024/teams-transcript-formatter) used to format the interview transcripts.
### Why not publish the interview transcripts?
Where research involves real human lives, there can be a tension between the transparency and reproducibility goals of the researcher, and the obligation to take every reasonable precaution to protect the privacy and security of the people involved.
We intended to strike a balance by publishing a collection of the most relevant sections of interviews, after making certain to redact potentially identifiable or irrelevant information, but not the interviews in their entirety.
However, upon reflection we have decided not to publish this dataset, for the following reasons:
#### Ambiguity in the informed consent form
The [informed consent form](jisc-surveys/participant_info.html), signed by all participants prior to their interview, gives permission to publish _"sections of the interview"_ in _"research outputs and websites"_.
There are two main issues with this.
First, students might reasonably assume that "research outputs" means communication documents such as articles and websites; we should have explicitly included "dataset" in the list of potential outputs if that was our intention.
Second, when the dataset is basically "the interview, minus sensitive or irrelevant information and false starts", although this is _technically_ "sections of the interview", it does not feel reasonable to say that this is covered by the informed consent form.
We could heavily cut down on the number and length of interview sections included in the dataset, but the value of this 'dataset' as a research object falls off very quickly as the context surrounding each section of the interview is stripped away.
#### Automated analyses
Our motivation for publishing the collection of interview sections was the hope that other researchers might perform their own analysis, uncovering any aspects we missed or insights that pertain to a different research question than ours.
However, the even in the last year the research landscape has shifted in such a way that substituting qualitative analysis for Large-Language-Model summaries is considered not only acceptable but innovative, at least by some.
I do not share this view, but this is largely irrelevant since the participants did not consent to their data being processed in this way.
We suspect that a dataset such as this, made open-access in a convenient plain text form, would be attractive to individuals looking to do automated analysis --- probably far more so than it would be for researchers who prefer traditional methods.
#### Risk of identifiability
We made a significant effort to redact all sensitive or potentially identifiable information from the interview transcripts prior to carrying out our main analysis.
We are confident that a human would find it extremely difficult to identify an individual based on reading the redacted transcripts.
However, publishing the dataset has the unfortunate side effect of making it available to data ingestion engines. This increases the risk that an individual may be identified through correlations between the information in the transcript and other information online.
{{< include metadata/_endmatter.qmd >}}