-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathmain_pres.qmd
More file actions
546 lines (385 loc) · 21.1 KB
/
main_pres.qmd
File metadata and controls
546 lines (385 loc) · 21.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
---
title: "Overview of survey data linkage"
author:
name: Pierre Walthéry, Iraklis Kyritsis, Christina Magder
institute: UK Data Service
date: "February 2026"
date-format: MMMM YYYY
brand: _brand.yml
include-after-body:
- text: |
<script type="text/javascript">
window.addEventListener('load', function() {
var logo = document.querySelector('.slide-logo');
var url = 'https://ukdataservice.ac.uk';
logo.style.cursor = 'pointer';
logo.addEventListener('click', function() {
window.open(url, '_blank');
});
});
</script>
format:
revealjs:
auto-stretch: false
fig-cap-location: top
scrollable: true
title-slide-attributes:
data-background-image: pics/ukri.png
data-background-size: 15%
data-background-opacity: "1"
data-background-position: "50px 835px"
logo:
path: "pics/UKDS_Logos_Col_Grey_300dpi.png"
href: "https://ukdataservice.ac.uk"
alt: "UK Data Service logo"
css: ukds25.css
embed-resources: true
pdf-max-pages-per-slide: 2
filters:
- reveal-header
---
# The UK Data Service
## Who we are
- Five partner universities
- UK Data Archive, University of Essex (lead partner)
- Cathie Marsh Institute, University of Manchester
- Jisc, University of Manchester
- EDINA, University of Edinburgh
- University College London
- 90+ staff
- Since 2012 (UKDA $\rightarrow$ 1967); curates national data since 2003
::: footer
Custom footer
:::
## What we do
- The main single point of access for UK social science data
- Secondary data collection, curation and access
- Training and user support
- Communication and user engagement
- Impact
- ... Key part of the UK social science research infrastructure, funded by the UKRI/ESRC
## Our data...
- UK social survey microdata:
- Cross-sectional: large government and academic surveys
- Longitudinal: major studies following people over time
- International data: survey data, [aggregate databases](https://ukdataservice.ac.uk/help/data-types/international-macrodata/)
- [Census tables and individual data](https://ukdataservice.ac.uk/learning-hub/census/resources/get-census-microdata/) – current and historical
- Business microdata and administrative data
- [Qualitative data](https://discover.ukdataservice.ac.uk/QualiBank): multimedia files and interview transcripts
##

## Our training in practice
1. [Webinars and online workshops](ukdataservice.ac.uk/training-events)
2. User Conferences: four main user conferences each year
3. Drop-in sessions: Survey, Computational Social Science and SecureLab
4. Online learning materials: find key resources on our [Learning Hub](https://ukdataservice.ac.uk/learning-hub/)
5. [Helpdesk](https://beta.ukdataservice.ac.uk/help) for individual data queries
6. Check out our [YouTube channel](https://www.youtube.com/ukdataservice)
# 1. Why a webinar series on data linkage?
## The world is changing
- Once upon a time: mostly census and surveys (lots!)
- Emerging / new kinds of data in the last 20 years
1. Administrative data
2. Digital trace data (i.e. social media, web...)
3. Smart data (i.e. flow, device generated)
- 2 and 3: genuinely new data
- 1: Digitalisation of records $\longrightarrow$ greater availability
- Increased demand for personal insights: [the Monitored Self](https://link.springer.com/chapter/10.1007/978-3-031-69944-3_8)
- Potential for new research avenues at a lower cost...
## Growing role of non-survey data {.smaller}
- Why interest is growing
- New and previously unavailable measurements
- Large scale, high frequency, low marginal cost
- Attractiveness of 'harder' type of data...
- ... Particularly in social epidemiology and socio-economic research
- But:
- Collected for administrative purposes, not research
- Selective coverage $\longrightarrow$ population exclusions
- Measurement error, changing definitions, policy artefacts
- Limited socio-demographic and subjective information
## The survey landscape {.smaller}
- Data collection challenges...
- Rising costs under tighter budgets
- Recruitment is harder (reach, refusals, panel fatigue)
- Alleviation comes at a cost: larger samples, incentives
- Growth of online and mixed-mode designs
- ... But they remain essential:
- Only source for attitudes, beliefs, motivations, well-being
- Rich socio-demographic information not captured elsewhere: detailed occupation, social class, ethnicity
- Theory-driven, validated instruments (ex: GHQ)
- Only tested tool for population representative data, hard to reach groups and subgroup analysis
<!-- ## Increasing number of actors -->
<!-- - Large academic data producers: CLS; ISER; ONS; NatCen -->
<!-- - Linkage intermediaries: LLC, ADR, SAIL; SDR -->
<!-- - Data curator: UKDS -->
<!-- - Other consortiums: CLOSER -->
<!-- - Government departments; health and education providers -->
<!-- - $\longrightarrow$ not always easy for researchers and analysts to navigate the landscape -->
## In a nutshell {.smaller}
- Budgetary pressures on survey data
- Wealth of cheaper, but narrowly focused, often unrepresentative new forms of data
- Data integration is a win-win: potential to improve (validation and enhancement) of both kinds of data (Benzeval 2020)
- At the same time:
- Linkage is still a limited (but growing) practice and few linked datasets are available for secondary research
- Increased complexity of the data provision landscape
- Need to adapt the skills training/capacity building
# 2. Exploring <br> integrated data
## Working definition {.smaller}
- Combining different sources of data ie:
- Survey data $\leftrightsquigarrow$ survey data
- Survey data $\leftrightsquigarrow$ non survey data
- Non survey data $\leftrightsquigarrow$ non survey data
- That includes a shared unit of observation (individual, household, area...)
- ... In a coherent way in order to:
- Validate or
- .. enhance the original data
- Bidirectional
- In this presentation: linkage = integration
## Validation example {.smaller}
- Whiffen et al (2020) [How effective are population health surveys for estimating prevalence of chronic conditions compared to anonymised clinical data?](https://doi.org/10.23889/ijpds.v5i1.1151)
- Reliability of population survey-based estimates of chronic diseases
- Data linkage to validate prevalence of selected chronic conditions:
- Angina, myocardial infarction, heart failure, and asthma
- Link 11,323 adults from the 2013 and 2014 Welsh Health Survey to clinical data
- Secure Anonymised Information Linkage (SAIL) Databank
- Results: quality depends on condition:
- Less agreement for cardiovascular, better for asthma
- Potentially cheaper
- But not devoid of technical difficulties
# Kinds of data linked to survey data
## Administrative data {.smaller}
- Usually arising from the interaction between:
- A public organisation or body...
- ... the unit for which records are produced (ie people)
- Exemples:
- Registry data: birth, death, marriage records,
- Health records, educational transcripts
- Government records: benefits, earnings/income
- Financial reports ie credit ratings, mortgage application
- In the UK: enabled Digital Economy Act 2017:
"*... de-identified data from government service providers, excluding NHS data, as part of their day-to-day functions, may be shared for public good research*"
<!-- ## School inspection data -->
<!-- - OFSTED 'State of the nation': anonymised data on latest schools inspections outcomes of 22,000 open schools -->
<!-- - Linked with the MCS, currently covers years 2005 to 2019 -->
<!-- - Data on a wide range of topics i.e.: -->
<!-- - Quality of teaching, learning and assessment -->
<!-- - Effectiveness of leadership and management -->
<!-- - Pupils' achievement (aggregated) (2005-2015) -->
<!-- - Behaviour and safety of pupils (2005-2015) -->
<!-- ::: aside -->
<!-- *More information* [Peters A et al (2025)](https://doc.ukdataservice.ac.uk/doc/9436/mrdoc/pdf/mcs_ofsted_user_guide_v1.pdf) -->
<!-- ::: -->
<!-- ## NEST pension data -->
<!-- - National Employment Saving Trust ie ain occupational (i.e. employer led) pensions scheme for UK employees -->
<!-- - Covers 1,000,000 employers, 11 millions employees -->
<!-- - Data about: -->
<!-- - Employer and employee characteristics -->
<!-- - Current pension status -->
<!-- - Pension contributions characteristics -->
<!-- ::: aside -->
<!-- *More information:* [ISER (2023)](https://doc.ukdataservice.ac.uk/doc/9127/mrdoc/pdf/9127_user_guide.pdf) -->
<!-- ::: -->
## Linking survey with digital trace data
- Arising from our presence online (web or apps)
- Increasing usage of linked survey and social media data
- Typical example: asking survey respondents to have their SM behaviour tracked
- May reduce the cost of the survey (fewer questions to ask)
- Subject to user consent: representativeness issues
- Consent depends on the app, gender, etc..
- [DIGISURVOR](https://digisurvor.github.io/main/) project for an example of current research
- Linkage of existing survey data with online participation in political discussions
## Smart data {.smaller}
- Fuzzier definition
- Digital records held by private sector organisations
- Often but not always device-recorded data
- Not traditionally associated with social research
- Flow/ quasi real time data
- Examples:
- Data from fitness trackers, smart watches, and in-car smart tech $\longrightarrow$ health and mobility research
- Financial transactions by businesses and individuals loyalty cards, purchase records < banks, supermarkets$\longrightarrow$ financial behaviour and resilience
- Energy network data (distribution & consumption); EV usage and charging; smart meter readings $\longrightarrow$ building/households energy consumptions$\longrightarrow$ net zero targets
<!-- ## (Bio measurement data) -->
<!-- - Collected as part large scale longitudinal surveys (often via nurse visit): -->
<!-- - Blood sample -->
<!-- - Epigenetic data -->
<!-- - Cortisol levels -->
<!-- - Is this integrated data? -->
# 3. Which survey data are most commonly linked?
<!-- ## Linked survey data: in theory -->
<!-- - Depends on: -->
<!-- - The topic covered by the data linked i.e. does it match common topics studied in surveys? -->
<!-- - The survey itself (i.e. does it include the required linking information / user consent) -->
<!-- - ... Scope of the surveys i.e. is linkage part of the original data collection, or is it a subsequent project? -->
<!-- - Means available from the data producer -->
## ... In practice
- Major longitudinal studies:
- Birth cohort studies
- Next Steps and ELSA
- Understanding Society
- A few large scale cross-sectional surveys such as:
- ASHE (Annual Survey of Hours and Earnings)
- Family Resources Survey
- Scottish Health Survey (project)
## Birth cohort studies
- Follow a sample of individuals from birth onwards
- Four so far: people born in 1958, 1970, 2000, 2026
- Millenium Cohort Study (MCS)
- \~ 19,000 children (born between June 2001 and Jan 03)
- 7 'sweeps': 9 months then at ages 3, 5, 7, 11, 14, 17, 23
- Parent and child interviews
- Focuses on education, skills and health, truancy, cognitive ability, biological measurements
- ... Traditional socio-economic and demographic data
## Other cohort studies
- Next Steps
- AKA Longitudinal Study of Young People in England
- 16,000 people in England born 1980-90, from secondary school age (i.e. 13-14) onwards
- Set up by DfE to study determinants of school outcomes
- ELSA (English Longitudinal Study of Ageing):
- Follows a sample of 19,000 people aged over 50 to understand all aspects of ageing in England.
- Started in 2002, biennial waves.
- Data on physical and mental health (incl. well-being), financial circumstances, and attitudes about ageing.
<!-- ## Understanding Society -->
<!-- - Largest longitudinal study of the UK population -->
<!-- - Initial sample size: 40K households, 100K individuals -->
<!-- - 14 waves so far: 2009-23. Includes BHPS data 1991-2009 -->
<!-- - Ethnic minority boost samples, innovation panel -->
<!-- - Very wide range of topics covered: -->
<!-- - Family, partnerships, caring responsibilities, -->
<!-- - Expenditure, consumption, deprivation -->
<!-- - Social attitudes, values, political opinions -->
<!-- - Transport, mobility, and commuting patterns -->
<!-- - Environmental behaviours, and related attitudes -->
## Annual Survey of Hours and Earnings (ASHE)
- Produced yearly by the Office for National Statistics
- Sample drawn from NI records: typical n=135-190,000
- Small number of variables
- Very precise source of information for pay components and working hours
- Can be linked to other business surveys, as well as PAYE and pensions data
- Some data available via [Administrative Data Research UK](https://www.adruk.org/data-access/flagship-datasets/annual-survey-of-hours-and-earnings-linked-to-his-majestys-revenue-and-customs-england-scotland-and-wales/)
# A sample of the integrated datasets curated by UKDS
## Next Steps: Student Loans Data
- Data on higher education loans for Next Steps participant
- who provided consent to linkage in the age 25 sweep.
- Information about:
- Full Next Steps dataset +
- applications for student finance,
- payment transactions & repayment details (via respondents' accounts),
- Overseas assessment.
- Also hospitalisation episodes data (SN8681)
## MCS: National Pupil Database {.smaller}
- Data for children in England whose carer gave consent
- Linked to National Pupil Database and the Pupil Level Annual School Census.
- Pupil level school census data from N1 to year 11 (2016/17)
- KS1, KS2, KS4 and KS5 results (Years 2, 6, 11, 12 and 13)
- Absence data from year 1 to year 11
- School characteristics and school changes: N1 to year 11
- Anonymised School identifiers (URN) and anonymised Local Education Authorities (LEA)
- Also available for Next Steps and Understanding Society
- Also linked with Ofsted Reports data
## Vacancy Survey 2005-2025
- Statutory, monthly survey of ~6,000 GB businesses
- Single question:
- "How many job vacancies for which actively seeking recruits from outside their organisation?"
- Sample drawn from the Inter-Departmental Business Register (HMT, collected from VAT and PAYE registers)
- Via linkage ISCO code (industrial activities classification), number of employees
- Additional linkage via IDBR possible - including ASHE
## Hospital episodes data and the NCDS
- NHS data about all hospital admissions in England.
- Four datasets:
- Episodes of using: Accident and Emergency, Admitted Patient Care, Adult Critical Care, Outpatients
- Mostly available for 2007/9-2023
- Data on diagnosis, maternity, mortality, mental health, treatment’s length, deprivation etc.
- Available for the NCDS Birth Cohort
::: aside
*More information*: [Kerry- Barnard et al (2025)](https://doc.ukdataservice.ac.uk/doc/8697/mrdoc/pdf/ncds_hes_user_guide_2025_v3.pdf)
:::
# 4. Data integration and skills requirements
## Survey analysis skills
- Traditional deterministic matching (ie merging)
- The simplest case: individual level data matched to individual level data non ambiguous identifier
- The same holds to aggregate level (for example smart sensor small area level matching)
- Probabilistic matching
- When separate ids
- When data is not clean
- Statistical inference & non random samples
## Emerging skills {.smaller}
- Computational skills
- Non-survey cleaning (Pandas, Tidyverse)
- Web scrapping (Python/R)
- API queries (social media app: X, Reddit...)
- Pattern detection ie random forest
- Regulatory knowledge
- Data protection & GDPR - prerequisite $\longrightarrow$ UKDS Safe researcher training
- Departmental regulations in case of Governmental data
- Institutional/procedural: ie how to engage with the data matching intermediaries
# 5. Who's who in the data integration landscape
## Data producers
- Involved in data matching (ea consent management)
- Data producers of the main longitudinal studies ie
- Understanding Society (ISER)
- Main cohort studies (CLS)
- Government departments and the ONS
- Private sector organisations
## Administrative Data Research (ADR)
- Consortium of organisations, including the ONS, devolved governments and academic partners
- Mission:
- link and open up de-identified administrative data
- making it securely available to accredited researchers.
- Point of access for new data linkage within the public sector and between the public sector and researchers
## UK Longitudinal Linking Consortium
- Trusted Research Environment - TRE
- Currently enables linkage between longitudinal studies and data from:
- NHS England
- Neighbourhood geographies
- Address geographies
- In preparation: NHS Wales, Department for Work and Pensions, HM Revenue and Customs data
<!-- ## UK Data service -->
<!-- - Collection, curation and access to linked survey data -->
<!-- - Main gateway for secondary survey data analysis -->
<!-- - Curates some linked data -->
<!-- - Trusted Research environment -->
## Other intermediaries (not exhaustive) {.smaller}
- Smart Data Research UK
- Gathers smart data from producers: ie financial, energy data, smart device...
- Makes it available to the research community
- Organised by kind of smart data
- Secure Anonymised Information Linkage -SAIL
- Wales based but provides also access to UK data, (mostly) health-related
- Trusted Research Environment - TRE
- CeLCIUS
- Census data linkage
# 6. Data linkage at UKDS: roles, routes, and researcher options
## The UK Data Service and data linkage: <br> core principles
- UKDS does not create linkages or integrate data
- Linked data are created by data owners or processors
- UKDS negotiates access to these data collections and makes them research-ready and safely accessible
- The type of linkage researchers can undertake depends on:
- the access level (Open / Safeguarded / Controlled)
- the presence or absence of identifiers
## What researchers can do in UKDS SecureLab {.smaller}
- Researchers can:
- Access more granular variables
- Create derived or contextual linkages, for example:
- Environmental or pollution deciles based on postcode-derived measures
- Area-level deprivation or service access indicators
- Import external datasets subject to depositor approval
- Key considerations:
- UKDS SecureLab does not host direct identifiers
- Researchers cannot create linkage spines or perform identifier-based matching
- All linkage activity must be explicitly approved as part of the project
# 7. How to access linked data at UKDS
## Conclusion so far
- Potential for exciting new research, some of which is already happening
- Dynamic landscape, changes in the role of actors likely to take place
- Need for investigating capacity building/skills training
- Kind of data left out for now: Census (longitudinal & administrative linkage)
- Please follow us for additional webinars on digital trace, administrative and smart data
## References {.smaller}
Millennium Cohort Study: Linked Education Administrative Datasets (National Pupil Database - KS1-KS5), England, 2003-2021: [Secure Access](https://datacatalogue.ukdataservice.ac.uk/series/series/2000031)
Next Steps: Linked Administrative Datasets (Student Loans Company Records), 2007 - 2021: [Secure Access](https://datacatalogue.ukdataservice.ac.uk/studies/study/8848)
Vacancy Survey, 2005-2025: [Secure Access](https://datacatalogue.ukdataservice.ac.uk/studies/study/7421)
Grant, P. (2024) [The Monitored Self In: The Virtual Hospital](https://doi.org/10.1007/978-3-031-69944-3_8) Springer, Cham.
Kerry- Barnard, S., Mohamad Zaki, N.H., Gomes, D., Ploubidis, G., Sanchez-Galvez, A. (2025) [National Child Development Study: A guide to the linked health administrative datasets – Hospital Episode Statistics (HES)](https://doc.ukdataservice.ac.uk/doc/8697/mrdoc/pdf/ncds_hes_user_guide_2025_v3.pdf). User Guide (Version 3). London: UCL Centre for Longitudinal Studies.
Peters, A., Sanchez-Galvez, A., Fitzsimons, E., Gomes, D. (2025) [Millennium Cohort Study: Linked education administrative datasets-Ofsted User Guide (Version 1)](https://doc.ukdataservice.ac.uk/doc/9436/mrdoc/pdf/mcs_ofsted_user_guide_v1.pdf) London: UCL Centre for Longitudinal Studies.
Silber, H., Breuer, J., Beuthner, C., Gummer, T., Keusch, F., Siegers, P., ... Weiß, B. (2022). [Linking Surveys and Digital Trace Data: Insights From two Studies on Determinants of Data Sharing Behaviour](https://doi.org/10.1111/rssa.12954) Journal of the Royal Statistical Society, Series A (Statistics in Society), 185(Suppl. 2), 387-407.
Whiffen, T; Akbari, A ; Paget, T ; Lowe, S; Lyons, R (2020) [How effective are population health surveys for estimating prevalence of chronic conditions compared to anonymised clinical data?](https://doi.org/10.23889/ijpds.v5i1.1151), International Journal of Population Data Science (IJPDS) Vol 5:1