InferenceSlides/Manchester_part3_stata.qmd at main · UKDataServiceOpen/InferenceSlides · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
---
title: Statistical inference using weights and survey design
subtitle: 3. R and Stata examples
author:
  name: Pierre Walthéry
institute: UK Data Service

date: "October 2025"
date-format: MMMM YYYY
brand: _brand.yml
include-after-body:
  - text: |
      <script type="text/javascript">
          window.addEventListener('load', function() {
            var logo = document.querySelector('.slide-logo');
            var url = 'https://ukdataservice.ac.uk';
            logo.style.cursor = 'pointer';
            logo.addEventListener('click', function() {
              window.open(url, '_blank');
          });
        });
      </script>

format:
  revealjs:
    scrollable: true
    logo:
      path: "pics/UKDS_Logos_Col_Grey_300dpi.png"
      href: "https://ukdataservice.ac.uk"
      alt: "UK Data Service logo"
    css: ukds25.css
    embed-resources: true
    smaller: true
filters:
  - reveal-header

execute:
  echo: true
#jupyter: nbstata
---
## Plan
 <!-- If rendering of this script returns an error  -->
 <!-- run quarto add shafayetShafee/reveal-header
 <!-- In the project  directory -->
-   Three sessions:

  1. Survey design: a refresher

  2. Inference in theory and practice

  **3.  R and Stata examples**

## This session

1.  R vs Stata speak
2.  Means, proportions, CI
3.  When survey design variables are not present

##  Data requirements

- An up to date  copy of R (with the `dplyr`,  `haven`,  `survey` and `Hmisc` packages installed) or Stata
- An active account with UKDS (for downloading the datasets)
- We will  use data from:

  - The  [2017 British Social Attitudes Survey (BSA)](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8450)

  - The [April-June 2022 Quarterly Labour Force Survey](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8999#!/access-data)


# 1.  R vs Stata speak
## Survey design estimation in R
- The R *Survey* package [@Lumley2023] provides a comprehensive set of functions for computing  point and variance estimates from survey data.

- Overall logic:
  - Install the `survey` packages and load it into memory
  - Declare the survey design: ie create a `svydesign` object from the  data

```
         mydata.s<-svydesign(id=~psu_var,
                             strata=~strata_var,
                             weights=weight_var,
                             mydata)`
```

  - Compute estimates with survey-specific functions: `svymean(myvar,mydata.s)`, `svytab(myvar,mydata.s)`, etc

## Command-based weighting in R

- R  does not provide a unified set of functions for computing command-based weighted estimates.
- Implementation of weighting may vary between packages, but algorithms are usually  described in detail in  package documentation.
- R Base has only one weight-aware function: `weighted.mean()`
- The Hmisc package offers a more comprehensive set of weighted estimation functions:

  - `wtd.mean()`
  - `wtd.var()`
  - `wtd.quantile()`

- Confidence intervals and standard errors still have to be computed manually

## Survey design in Stata

- Stata provides comprehensive support for computing survey design-informed  estimates from survey data
- Implementation logic similar to R:

  - Declare the survey design using svyset

    - `svyset id=psu_var [pweights=weight_var],strata(strata_var)`

  - Use `svy:` - prefixed commands for estimation:

    - `svy:mean myvar`, `svy:tab myvar` etc...

## Stata command-based weighting  - 1

- Users may add sampling weights to most Stata estimation commands, or use survey-specific commands. The latter is recommended.

- Stata distinguishes between four kinds of (dealing with) weights:

  - frequency weights (`fweight`),
  - analytical weights (`aweight`),
  - importance weights (`iweight`) and
  - probability weights (`pweight`).

- These mostly differ in the way standard errors are computed


## Stata command-based weighting  - 2

- Survey weights should be treated as *probability weights* or `pw`.

- Key estimation commands, such as `summarise` or `tab` do not allow using `pw`:  this is to nudge users to rely on the `svy:` commands instead
.
- 'On the fly' weighting (i.e. not using survey design functions) in Stata consists in the weighting variables being specified between square brackets.

    - `stata_command myvar [pw=weight_var]`

- It is tempting to to specify instead the wrong  kind of weights function  (`fw` or `aw`) if one does not wish to use the survey design functions. You may get the correct point estimates, but your standard errors are likely to be incorrect  Do this at your own risk

# 2.  Means, proportions, CIs


## Identifying the survey design

We first need to find out about the survey design that was used in the 2017 BSA, and the design variables that are made available in the dataset. Such information can usually be found in the documentation that comes together with the data under the `mrdoc/pdf` folder.

**Question 1**

- What is the design that was used in this survey (i.e. how many stages were there, and what were the units sampled)? - What were the primary sampling units; the strata (if relevant)?


## Finding  the survey design variables

Now that we are a bit more familiar with the way the survey was designed, we need to try and identify the design
variables available in the dataset. The information can usually be found in the user manual or the data dictionary available under `mrdoc/ukda_data_dictionaries.zip` The file may need to be decompressed  separately.

**Question 2**

- What survey design variables are available?
- Are there any that are missing -- if so which ones?
- What is the name of the weights variables?

## Preparing the data - R

```{r}
rm(list=ls())
library(dplyr)                                                              ### Data manipulation functions
library(haven)                                                              ### Importing stata/SPSS files
library(Hmisc)                                                              ### Extra statistical functions
library(survey)                                                             ### Survey design functions

setwd("~/OneDrive/trainings/Inference_workshops/Manchester/")               ### Edit as appropriate
datadir<-"~/Data/"                                                          ### Edit as appropriate
bsa17<-read_dta(paste0(datadir,"bsa/UKDA-8450-stata/bsa2017_for_ukda.dta")) ### Importing Stata format dataset
dim(bsa17)
```

- We can now specify the survey design using:

  - `Spoint` as Primary Sampling Unit ids,
  - `StratID` as strata ids,
  - `WtFactor` as weights.

##  Specifying the survey design in R

- We create a `svydesign` object, i.e. a survey design informed copy of the data, which will be used for  subsequent estimation.

```{r 5_2}
bsa17.s<-svydesign(ids=~Spoint,
                   strata=~StratID,
                   weights=~WtFactor,
                   data=bsa17)        ### Specifying the survey design
class(bsa17.s)                        ### Examining the svydesign object
```

- We can inspect   the content of `bsa17.s`. Do this at your own risks:

```{r 5_21}
summary(bsa17.s)                      ### ... And looking at its content

```

## Mean age and its 95% CI
- We can now produce a first set of estimates using the  survey design information. We begin with the mean age of respondents.

- We will need to use `svymean()`
```{r mean}
svymean(~RAgeE,bsa17.s)
```
 By default  `svymean()` computes the standard error of the mean. We need to
 embed it within `confint()` in order to get a confidence interval.

```{r 5_3}
confint(                                  ### Computing the confidence interval...
  svymean(~RAgeE,bsa17.s)                 ### And the mean
  )

round(                                    ### Rounding the results to one decimal point
  c(
    svymean(~RAgeE,bsa17.s),              ### Computing the mean...
    confint(svymean(~RAgeE,bsa17.s))      ### And its 95% CI
    ),
  1)
```
## Question 3
- What would be the consequences of:

  - weighing but not accounting for the sample design;
  - neither using  weights or accounting for the sample design?

- When:
  - inferring the mean age in the population?
  - computing the uncertainty  of this  estimate?

## Answer 1: command-based weighting

- We need to compute means and CI separately:

```{r}
a.m<-wtd.mean(                               ### Weighted mean function from the
  bsa17$RAgeE,                               ### Hmisc package
  bsa17$WtFactor,
  normwt = T)                                ### Option specific to survey weights

### Computation of the standard error by hand...
a.se<-sqrt(
          wtd.var(bsa17$RAgeE,               ### ... using the weighted variance function from Hmisc
                  bsa17$WtFactor,
                  normwt = T)
          )/
       sqrt(
         nrow(bsa17)                         ### ... shortcut to sample size
         )

c(a.m,                                       ### Concatenating the mean..
  a.m-1.96*a.se,                             ### ... the lowbound of the CI
  a.m+1.96*a.se)                             ### ... and the higher bound
```


## Answer 1 (continued):  unweighted estimates


```{r}
ua.m<-mean(bsa17$RAgeE)                     ### mean() function from R Base

ua.se<-sd(bsa17$RAgeE)/                     ### ... standard errors
      sqrt(nrow(bsa17))    ##

c(ua.m,                                     ### and putting it all together
  ua.m-1.96*ua.se,
  ua.m+1.96*ua.se
  )

```

## Answer - 2

- Not using weights results in  overestimating the mean age in the population (of those aged 18+) by about 4 years.
- This might be due to the fact that older respondents are more likely to take part to surveys.
- Using `on the fly` weighting does not alter the value of the estimated population mean when compared with SD informed estimates...
- ... but would lead us to overestimating the precision/underestimate the uncertainty of our estimate -- by about plus or minus 3 months.

## Stata version
- Opening the dataset and declaring the survey design (scroll down for full output)

```{*** stata_1,echo=F}

use ~/Data/bsa/UKDA-8450-stata/bsa2017_for_ukda.dta,clear

svyset Spoint [pw=WtFactor], strata(StratID)

* Computing the survey design-informed  version of the mean...

svy: mean RAgeE

* And the other two:

mean RAgeE [pw=WtFactor]
mean RAgeE
```

## Computing a proportion and its 95% confidence interval
- We can similarly estimate the distribution of a categorical variable in the population by estimating proportions (or percentages)
- Let's look at the proportion of people who declare that they are interested in politics.
- This is the `Politics` variable in the BSA.
- It has five categories ranging from 1: 'A great deal' to 5: 'Not at all'.
- We could recode together  1 and 2, ie `A great deal` and `quite a lot` into `Significantly`, but  here we will directly select the relevant values on the fly as this is allowed by R.

## Let's explore the variable
- Phrasing of the question:

```{r 5_4_1}
attr(bsa17$Politics,"label")
```

- Value labels:

```{r 5_4_2}
attr(bsa17$Politics,"labels")
```

- Sample distribution

```{r 5_4_3}
table(
  droplevels(                            ### We are using droplevels() in order to hide categories
             as_factor(bsa17$Politics)   ### ... without any observations
             )
  )
```

##
- Neater output

```{r 5_5}
round(
  100*
    prop.table(
      svytable(~(Politics==1 | Politics==2),bsa17.s)
      ),
  1)
```

- Let us now estimate the confidence intervals for  these proportions.  Software like Stata or SPSS  usually doesn't  show us  what is happening under the bonnet.  R requires more coding, but also gives a better understanding of what we are actually estimating.

- Confidence intervals for proportions of categorical variables are in fact  computed as a sequence of binomial/dichotomic estimations -- i.e. one for each category.

##
- In R we  specify this via `svyciprop()` and `I()`:

  - The former computes the proportion and its confidence interval (by default 95%)...
  - ... whereas the latter allows us to define the category of interest of a polytomous variable.
  - As before, we could have used a recoded dichotomic variable instead

```{r 5_6_1}
p<-svyciprop(
            ~I(Politics==1 | Politics==2),
            bsa17.s)
p
```

- A neater version:

```{r 5_6_2}
round(100*
        c("% Significantly interested"= p[1],              ### Extracts the point estimate
          "95% CI"=attr(p,"ci")       ### Extracts the CI
          ),1
      )
```

## Question 4
- What is the proportion of respondents aged 17-34 in the sample, as well as its 95% confidence interval?

  - You can use `RAgecat5`

## Answer
- The proportion of 17-34 year old in the sample is:

```{r}
a<-svyciprop(~I(RAgecat5 == 1),
              bsa17.s)

round(
  100*a[1],
     1)
```

and its 95% confidence interval:

```{r}
round(
    100*
    confint(a),  ### Another way of extracting CI from svyciprop objects
    1)
```

## Stata
* Proportions and answer to question 4 (Scroll down the slide for full output)

```{*** Stata_3,echo=F}
use ~/Data/bsa/UKDA-8450-stata/bsa2017_for_ukda.dta,clear
quietly svyset Spoint [pw=WtFactor], strata(StratID)
** Creating a dummy variable for significant interest in politics
quietly recode Politics 1 2 =1 3/8=0,gen(Politics2)
** Survey-design informed frequencies...
svy:ta Politics2
** ... Proportions and CI
svy:ta Politics2, percent ci
** Same for age categories
svy:ta RAgecat5, percent ci
```

## Computing domain estimates
- Computing estimates for subpopulation adds a layer of complexity to what we have seen so far.
- Weights are usually  designed to make the  full  sample representative of the population
- If we computed estimates  for a subpopulation only:

    - this would amount to  using a fraction of these weights
    - and may alter the accuracy of our estimates.
- It is instead recommended to use commands that take into account the entire distribution of the weights instead.

    - The R command that does this is `svyby()`

## Computing domain estimates in R

- Say we would like to compute the mean age of BSA respondents by Government Office Regions
- We need to specify:

  - The outcome variable whose estimate we want to compute: i.e. `RAgeE`
  - The grouping variable(s) `GOR_ID`
  - The estimation function we are going to use here: `svymean`
  - And the type of type of variance estimation we would like to see displayed i.e. standard errors or confidence interval


## Output

```{r 5.7}
d<-      svyby(~RAgeE,               ### Outcome variable
             by=~as_factor(GOR_ID),  ### Subpopulations
             svymean,                ### Estimation function
             design=bsa17.s,
             vartype = "ci")         ### CI or SE
round(d[-1],1)
```

We used `[-1]` above  to remove the column with region names from the results, so that we could round the estimates  without getting an error.

## Interpretation
- The population in  London is among the youngest in the country whereas those in the South West are among the oldest
- This is likely to be statistically significant as their  95% CI do not overlap.
- The same cannot be said of differences between London and the South East, as the CIs partially overlap.
- We can follow  a similar approach for proportions: ie specify a category of interest of a variable, for example respondents who are significantly interested in politics, and replace `svymean` by `svyciprop`.

```{r 5.8}
c<-round(
      100*
      svyby(~I(Politics==1 | Politics==2),
            by=~as_factor(GOR_ID),
            svyciprop,
            design=bsa17.s,
            vartype = "ci")[-1],
            1)
```

## Question 5
- What is the 95% confidence interval for the proportion of people significantly interested in politics in the North East?
- Is the proportion likely to be different in London? In what way?
- What is the region of the UK for which the precision of the estimates is likely to be the smallest?

## R answer

- The 95% confidence interval for the proportion of people significantly   interested in politics in the North East is  `r as.numeric(c[1,2:3])`.
- By contrast, it is  `r as.numeric(c[7,2:3])` in London.
- The region with the lowest precision of estimates (i.e. the widest confidence interval) is Wales, with more than   20  percentage point difference between the upper and lower bounds of the confidence interval.

## Stata answer - not accounting for domain estimation

```{*** Stata_5.1}
#| echo: False
use ~/Data/bsa/UKDA-8450-stata/bsa2017_for_ukda.dta,clear
quietly recode Politics 1 2 =1 3/8=2,gen(Politics2)
quietly svyset Spoint [pw=WtFactor], strata(StratID)

svy:prop Politics2,over(GOR_ID) percent  cformat(%9.1f)
```

## ... And  accounting for domain estimation

```{*** Stata_5.2}
#| echo: False
use ~/Data/bsa/UKDA-8450-stata/bsa2017_for_ukda.dta,clear
quietly recode Politics 1 2 =1 3/8=0,gen(Politics2)
quietly svyset Spoint [pw=WtFactor], strata(StratID)

* Generating dummy variables for  regions
tab GOR_ID,gen(RegNum_)
* % estimated in politics in the North East...
svy,subpop(RegNum_1):prop Politics2, percent  cformat(%9.1f)

* ... And in London
svy,subpop(RegNum_7):prop Politics2, percent  cformat(%9.1f)
```

## Question 6
- Using interest in politics as before, and three category age `RAgecat5`:

  - Produce a table showing the proportion of respondents significantly interested in politics by age group and gender
  - Assess whether the age difference in interest for politics is similar for each gender.
  - Is it fair to say that men aged under 35 are more likely to declare being  interested  in politics  than women aged 55 and above?

## Question 6 - R

```{r 6.1}
round(
      100*
        svyby(~I(Politics==1 | Politics==2),
              by=~as_factor(RAgecat5)+as_factor(Rsex),
              svyciprop,
              design=bsa17.s,
              vartype = "ci")[c(-8,-4),c(-2,-1)],1)
```
- Males and females aged 55+ tend to be more involved in politics than those who are younger.

- Confidence intervals for the proportion of men under 35 and women above 55 interested in politics overlap; it is unlikely that they  differ significantly  in the population.

## Question 6 - Stata

```{*** Stata_6,echo=F}
use ~/Data/bsa/UKDA-8450-stata/bsa2017_for_ukda.dta,clear
quietly recode Politics 1 2 =1 3/8=0,gen(Politics2)
quietly svyset Spoint [pw=WtFactor], strata(StratID)
egen RagSex=group(RAgecat5 Rsex),label // Creates age groups by sex variable
ta RagSex,gen(RagSex)                  // ... and dummy vars by  category

svy: prop Politics2,over(RagSex ) percent   cformat(%9.1f) // Overview, not accounting for domain estimation


svy,subpop(RagSex1):prop Politics2 ,percent  cformat(%9.1f) // Men under 35
svy,subpop(RagSex6):prop Politics2, percent  cformat(%9.1f) // Women 55+
```


# 3. Dealing with  the absence of survey design variables

##  Estimating employment by region with  the LFS
- We are using the End User License Quarterly Labour Force Survey, April-July 2022.
- As a rule, EUL versions of the LFS do not include survey  design variables.
- The LFS   come with  two weighting variables:
  - `pwt22` for estimation with the whole sample
  - `piwt22` for   earnings estimation of respondents  in employment (also accounting for  the high level of non response for the earnings variables)

- Estimation without  accounting for sample design is  likely be biased and should be reported as such including warnings, even if its nature (over or underestimation) and size are not known.

##
- An alternative is to look for design factors tables published by the data producer which could be used to correct for the bias.

- The Office for National Statistics regularly publishes Deft tables for the LFS, but only  for their headline statistics.

![](pics/lfs_vol1_SE.png)

## Regional employment rates using R

- Let's first produce uncorrected estimates of the regional population.
- We will still use the survey design functions, but declare a SRS design

```{r  5.9}
lfs<-read_dta(
              (paste0(
                datadir,
                "lfs/UKDA-8999-stata/lfsp_aj22_eul_pwt22.dta"
                )
               )
              )%>%
     select(PWT22,PIWT22,GOVTOF2,URESMC,ILODEFR)

table(as_factor(lfs$GOVTOF2))
```

##
For some reason, the ONS use a distinct category for Merseyside, but not the `GOVTOF` variable in our dataset. We will correct this using another, more detailed region variable: `URESMC`.

```{r 5.10}
lfs<-lfs%>%
     mutate(
            govtof=ifelse(URESMC==15,3,GOVTOF2)
            )        # Identifying Merseyside using URESMC

lfs$govtof.f<-as.ordered(lfs$govtof)                           # Converting into factor
levels(lfs$govtof.f)<-c(names(attr((lfs$GOVTOF2),"labels")[3:4]),
                        "Merseyside",
                        names(attr((lfs$GOVTOF2),"labels")[5:14])) # Adding factor levels from existing labels
table(lfs$govtof.f)
```


##
Let us now examine  the confidence intervals for the percentage of persons in employment:

```{r 5.12}
lfs.s<-svydesign(ids=~1,weights=~PWT22,data=lfs)
d<-      svyby(~I(ILODEFR==1),
               by=~govtof.f,
               svyciprop,
               vartype="se",
               design=lfs.s)

df<-100*data.frame(d[-1])
names(df)<-c("Empl.","SE")
df["Low.1"]<-round(df$Empl.-(1.96*df$SE),1)
df["High.1"]<-round(df$Empl.+(1.96*df$SE),1)

```

##
We can now import the design factors from the LFS documentation. This has to be done by hand.


```{r 5.13}
df$deft<-c(0.8712,1.0857,1.3655,
           1.0051,0.9634,1.0382,
           0.8936,1.3272,0.9677,
           0.9137,1.0012,1.0437,
           0.7113)
df["Low.2"]<-round(df$Empl.-(1.96*df$SE*df$deft),1)
df["High.2"]<-round(df$Empl.+(1.96*df$SE*df$deft),1)

# Cleaning up the labels
#rownames(df)<-substr(rownames(df),9,nchar(rownames(df)))
df
```
In some regions, CI have widened (ie London) whereas they have shrunk in others (ie the North East)

##

### Thank you for your attention

Comments, feedbacks and questions: pierre.walthery@manchester.ac.uk