-
Notifications
You must be signed in to change notification settings - Fork 50
Expand file tree
/
Copy pathtemplate_tutorial.Rmd
More file actions
2313 lines (1551 loc) · 101 KB
/
template_tutorial.Rmd
File metadata and controls
2313 lines (1551 loc) · 101 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: XX
author: XX
tutorial:
id: XX (should be the number (indicating the order in which we complete these) +
title, everything in lower case, all spaces and weird characters replaced with
dashes. See Instructions for details. Same as the directory in which this file
will be located.)
output:
learnr::tutorial:
progressive: yes
allow_skip: yes
runtime: shiny_prerendered
description: "Tutorial #XX for Preceptor's Primer (where XX is the same number as
in the id)"
---
```{r setup, include = FALSE}
# XX: First packages are ones that students don't know about or load up in the
# Console. learnr and tutorial.helpers are required for the tutorial to work at
# all. gt is needed because we show a Preceptor Table to the students, which is
# built with this package, even though we don't show students how to do so.
library(learnr)
library(tutorial.helpers)
library(gt)
# XX: Any package from below is something that we want students to explicitly
# load up in the tutorial/Console/QMD. This serves two purposes. First, it provides an
# occasion for knowledge drops. Second, it reminds students that these packages
# must be loaded in the Console if they want the relevant code from the tutorial
# to work in the Console. The most common package to be added to this section,
# and which should also be loaded by students, is a data source like
# primer.data. It should be placed after library(tidyverse) since that is when
# it is loaded in the tutorial.
# Some models, like ordinal regressions, do not work with tidymodels. So, for
# those tutorials, we replace library(tidymodels) with library(MASS), or
# whichever package we need to make a model. Obviously, it is nice if fitted
# objects using that package work with our usual tools. In general,
# broom (or broom.mixed) works with everything and marginaleffects is
# also ecumenical. In either case, we only keep here (and load later) tidymodels
# if we actually use it.
library(tidyverse)
library(tidymodels) # Or some other modeling package like ordinal.
library(broom) # Or broom.mixed. Not sure if we ever need broom.helpers?
library(marginaleffects)
# easystats is just used for check_predictions(), which we only run in the
# Console. Is there a better approach for running a posterior predictive
# check?
library(easystats)
knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 600,
tutorial.storage = "local")
# XX: Never include setup code that takes more than a few seconds to run. See
# https://ppbds.github.io/tutorial.helpers/articles/instructions.html#data for
# background.
# XX: You need to create a version of your model in this setup code chunk. We
# use an example here, fit_XX, which you should obviously replace and use your
# own name for, but always starting with `fit_`.
fit_XX <- linear_reg(engine = "lm") |>
fit(att_end ~ sex + treatment + age, data = primer.data::trains)
# XX: If model creation, or any other set up code, takes more than a few seconds
# to run, then write/load the object, following this advice:
# https://ppbds.github.io/tutorial.helpers/articles/instructions.html#data
```
```{r copy-code-chunk, child = system.file("child_documents/copy_button.Rmd", package = "tutorial.helpers")}
```
```{r info-section, child = system.file("child_documents/info_section.Rmd", package = "tutorial.helpers")}
```
<!-- INSTRUCTIONS to Users of this Template -->
<!-- This is the template tutorial for creating any tutorial which uses the Cardinal Virtues to answer a question given a data set. Although its primary use is for the chapters after 3 in the Primer, it could be used for independent tutorials as well. The letters `XX` are used to indicate locations which require editing. Comments with instructions are interspersed. -->
<!-- Read The Cardinal Virtues vignette in the package for background: https://ppbds.github.io/primer.tutorials/articles/cardinal_virtues.html -->
<!-- Delete all these instructions as you go. Once you are done with the tutorial, none of them should remain. The only comments left should be ones that you wrote, comments specific to your tutorial --- like modeling approaches that you tried but did not use, other approaches that might be considered in the future and so on. -->
<!-- The tutorial is not done until you deal with, and remove, every XX. In general, XX will either mark a comment, which you should delete entirely once you have read it (and/or followed its instructions), or it will mark an object which you need to replace with the name that you have chosen. -->
<!-- Once you decide the appropriate replacement for `fit_XX` and `XX.qmd`, you can do a global replace to fix them all. -->
<!-- We sometimes connect XX to another word or phrase, as in [XX: unit] or `[XX: the tibble]`. In these cases, the XX indicates that this is something that you need to replace and the other words/symbols are there to guide you as to what the replacement should be. But you delete everything within the brackets. For example, you might replace [XX: unit] with "candidate" (with no quotation marks) or whatever the type of unit we have in this problem. Similarly, `[XX: the tibble]` would be replaced with `trains` or whatever tibble is used in this tutorial. In both cases, we provide the correct punctuation. The word "candidate" would not have any punctuation since it is just a word in a sentence. But a tibble like `trains` needs to be surrounded by backticks, like any other tibble. -->
<!-- Whenever creating an object which will be used in later questions, never have students do the assignment themselves. Instead, have a series of one or more questions which create the object, often by building a pipe line-by-line, with each step creating output which can be examined and discussed. Then, when the creation is done, have a last question which says, more or less, 'Behind the scenes, we have assigned the result of the pipe [or whatever function call was used] to the object `fit_obj`. To confirm, type `fit_obj` and hit "Run Code."' -->
<!-- Note that the questions are a mixture of our three types: code, written (with answer) and written (without answer). The last is only used for questions in which we ask the student to run a command like `show_file()`. Otherwise, we always provide an excellent written answer because students will generally look closely at our answer because they are concerned about whether or not their written answer matches ours. -->
<!-- A plot, especially of the outcome or key covariate, often makes for an excellent knowledge drop. Just have a code chunk with no code chunk label, just ```{r} ```. -->
<!-- Whenever you tell a student to make a change in the QMD, you should tell them to `Cmd/Ctrl + Shift + K` in order to render the document. (This will also cause it to be saved.) This is good practice for catching bugs early. (Professionals do this.) Then, the last step in these exercises is often some version of show_file() and then CP/CR. -->
<!-- Make use of, e.g., `show_file("tutorial-6.qmd", start = -5)` to get just the last 5 lines of the QMD. We don't want students to copy/paste the whole document. We also don't need to ensure that we get whatever it is that was just changed. We never look! Instead, we are just plausibly threatening to look. -->
<!-- Make sure to uncomment the test code chunks below, once you have created the necessary objects. -->
## Introduction
###
This tutorial supports [*Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference*](https://ppbds.github.io/primer/) by [David Kane](https://davidkane.info/).
The world confronts us. Make decisions we must.
<!-- XX: Add an "Imagine" paragraph The purpose of the paragraph is to describe a real person in the world. In general, this person has lots of data and lots of problems and lots of decisions to make. And they are a real person! This motivates the work that we will do below. Always start with "Imagine that you are . . ." And end with "There are many decisions to make." Examples: -->
<!-- XX: Example: Imagine you're in charge of ordering uniforms for next year's Marine Corps bootcamp recruits. There are many factors to consider: the cost of different designs, the number of male and female recruits, the distributions of heights and weights, and so on. There are many decisions to make. -->
<!-- XX: Example: Imagine that you are running for Governor of Texas in the next election. Seeking any political office, much less the governorship of a large state, is difficult. You have resources --- money, volunteers, surrogates, your own time. You have goals --- increase your name recognition, raise money, attack your opponent, persuade undecided voters, get your supporters to vote. There are many decisions to make. -->
Imagine that you are ... XX. There are many decisions to make.
### Exercise 1
What are the four [Cardinal Virtues](https://en.wikipedia.org/wiki/Cardinal_virtues), in order, which we use to guide our data science work?
```{r introduction-1}
question_text(NULL,
message = "Wisdom, Justice, Courage, and Temperance.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 2)
```
###
Why do we ask this, and a score more other questions, in each tutorial? Because the best way to (try to) ensure that students remember these concepts more than a few months after the course ends is [spaced repetition](https://en.wikipedia.org/wiki/Spaced_repetition), although we focus more on the repetition than on the spacing.
### Exercise 2
Create a Github repo called `XX`. Make sure to click the "Add a README file" check box.
Connect the repo to a project on your computer using `File -> New Folder from Git ...`. Make sure to select the "Open in a new window" box.
You need two Positon windows: this one for running the tutorial and the one you just created for writing your code and interacting with the Console.
Select `File -> New File -> Quarto Document ...`. Provide a title -- `"XX"` -- and an author (you). Render the document and save it as `XX.qmd`.
Create a `.gitignore` file with `XX_files` on the first line and then a blank line. Save and push.
In the Console, run:
```
show_file(".gitignore")
```
If that fails, it is probably because you have not yet loaded `library(tutorial.helpers)` in the Console.
CP/CR.
```{r introduction-2}
question_text(NULL,
answer(NULL, correct = TRUE),
allow_retry = TRUE,
try_again_button = "Edit Answer",
incorrect = NULL,
rows = 3)
```
###
Professionals keep their data science work in the cloud because laptops fail.
### Exercise 3
<!-- XX: Switch primer.data for whichever package you get your data from. -->
In your QMD, put `library(tidyverse)` and [XX: `library(primer.data)`] in a new code chunk. Render the file.
Notice that the file does not look good because the code is visible and there are annoying messages. To take care of this, add `#| message: false` to remove all the messages in this `setup` chunk. Also add the following to the YAML header to remove all code echos from the HTML:
```
execute:
echo: false
```
In the Console, run:
```
show_file("XX.qmd", chunk = "Last")
```
CP/CR.
```{r introduction-3}
question_text(NULL,
answer(NULL, correct = TRUE),
allow_retry = TRUE,
try_again_button = "Edit Answer",
incorrect = NULL,
rows = 6)
```
###
Render again. Everything looks nice, albeit empty, because we have added code to make the file look better and more professional.
### Exercise 4
<!-- XX: This reason for this somewhat convoluted approach is that we want to give students lots of practice working in both the QMD World and the Console World. -->
Place your cursor in the QMD file on the `library(tidyverse)` line. Use `Cmd/Ctrl + Enter` to execute that line.
Note that this causes `library(tidyverse)` to be copied down to the Console and then executed.
CP/CR.
```{r introduction-4}
question_text(NULL,
answer(NULL, correct = TRUE),
allow_retry = TRUE,
try_again_button = "Edit Answer",
incorrect = NULL,
rows = 3)
```
###
<!-- XX: Report the source of the data, along with some background information about it. -->
### Exercise 5
<!-- DK: Load up data before Wisdom? -->
<!-- XX. Note that these instructions do not specify what the next library is. Feel free to specify it, if you like. If you are not adding another library, just delete this question. -->
Place your cursor in the QMD file on the next line. Use `Cmd/Ctrl + Enter` to execute that line.
This work flow --- writing things in the QMD so that you have a permanent copy and then executing them in the Console with `Cmd/Ctrl + Enter` --- is the most common approach to data science.
There is QMD World and Console World. It is your responsibility to keep them in sync.
CP/CR.
```{r introduction-5}
question_text(NULL,
answer(NULL, correct = TRUE),
allow_retry = TRUE,
try_again_button = "Edit Answer",
incorrect = NULL,
rows = 3)
```
###
A version of the data from XX is available in the `XX` tibble.
### Exercise 6
<!-- XX: This question only works if there is help available for the tibble you hope to use. Delete the question if there is not. -->
In the Console, type `?XX`, and paste the Description below.
```{r introduction-6}
question_text(NULL,
answer(NULL, correct = TRUE),
allow_retry = TRUE,
try_again_button = "Edit Answer",
incorrect = NULL,
rows = 8)
```
###
<!-- XX. More information about the data. One example would be a copy/paste of the abstract if the data is from a paper. Or perhaps a quote from the website on which the data can be found. -->
### Exercise 7
Define a causal effect.
```{r introduction-7}
question_text(NULL,
message = "A causal effect is the difference between two potential outcomes.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
According to the Rubin Causal Model, there must be two (or more) potential outcomes for any discussion of causation to make sense. This is simplest to discuss when the treatment only has two different values, thereby generating only two potential outcomes.
### Exercise 8
<!-- IS: move to wisdom instead? -->
What is the fundamental problem of causal inference?
```{r introduction-8}
question_text(NULL,
message = "The fundamental problem of causal inference is that we can only observe one potential outcome.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
<!-- DK: Not a great knowledge drop for this question. -->
If the treatment variable is continuous (like a lottery payment), then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.
<!-- DK: Why those definitions? Why not others? -->
### Exercise 9
XX is the broad topic of this tutorial. Given that topic, which variable in `XX` should we use as our outcome variable?
```{r introduction-9}
question_text(NULL,
message = "XX: A sentence about the outcome variable which we will be using.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 2)
```
###
We will use `XX` as our outcome variable.
<!-- XX: Create a simple univariate plot of the outcome variable. Or a bivariate plot in which the outcome variable, on the y-axis, is compared to a covariate, but not to the important covariate. Do not use any code chunk labels for this code chunk. The subtitle should highlight some aspect of the data. Plot does not need to be fancy but it should be competent, with axis labels, nice formatting and so on. We don't show students the code. You can add more knowledge below the plot, if you like. Looking closely at the outcome variable guides us as to the statistical model to use. -->
### Exercise 10
<!-- DK: Should discussion of predictive models come first? -->
Let's imagine a brand new variable which **does not exists** in the data. This variable should be binary, meaning that it only takes on one of two values. It should also, at least in theory, be manipulable. In other words, if the value of the variable is "3," or whatever, then it generates one potential outcome and if it is "9," or whatever, it generates another potential outcome.
Describe this imaginary variable and how might we manipulate its value.
<!-- XX: Include these two sentences if this is a causal model: -->
<!-- For now, ignore the actual treatment variable `XX` which we will be using later in the analysis. The point of this exercise is to reinforce our understanding of the [Rubin Causal Model](https://ppbds.github.io/primer/rubin-causal-model.html). -->
```{r introduction-10}
# XX: In your answer, and for the next few questions, always treat this
# imaginary variable as real by putting backticks around the name. For example,
# with nhanes data, we might imagine a variable called `vitamin` for which `1`
# means that the individual ate vitamins growing up and `0` means they did not.
# Using the words "treatment group" and "control group" as part of your answer
# is often helpful since it reinforces the fact that we are using the Rubin
# Causal Model.
question_text(NULL,
message = "XX: (This is an example answer.) Imagine a variable called `phone_call` which has a value of `1` if the person received a phone call urging them to vote and `0` if they did not receive such a phone call. We, meaning the organization in charge of making such phone calls, can manipulate this variable by deciding, either randomly or otherwise, whether or not to call a specific individual.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.
### Exercise 11
Given our (imaginary) treatment variable `XX`, how many potential outcomes are there for each [XX: unit]? Explain why.
```{r introduction-11}
question_text(NULL,
message = "There are two potential outcomes because the treatment variable `XX` takes on two possible values: XX-list-the-values-here, i.e., exposure to Spanish-speakers on a train platform versus no such exposure.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. This is a *conceptual* framework we apply to the data. It is never inherent in the data itself.
### Exercise 12
<!-- DK: Make this two questions? -->
In a few sentences, specify the two different values for the imaginary treatment variable `XX`, for a single unit, guess at the potential outcomes which would result, and then determine the causal effect for that unit given those guesses.
```{r introduction-12}
# XX: Replace [XX: unit] with a better word below given the actual data set we
# are using. Replace all the XX terms as appropriate.
# XX: For a given individual, assume that the value of the treatment variables
# might be 'exposure to Spanish-speakers' or 'no exposure'. If the individual
# gets 'exposure to Spanish-speakers', then her attitude toward immigration
# would be 10. If the individual gets 'no exposure', then her attitude would be
# 8. The causal effect on the outcome of a treatment of exposure to
# Spanish-speakers versus no exposure is 10 - 8 --- i.e., the difference between
# two potential outcomes --- which equals 2, which is the causal effect.
# XX: If the outcome is a character variable, like Strongly Approve, then there
# is no simple metric on which we can pinpoint the causal effect. That is, the
# causal effect is still defined --- as, in this example, the difference between
# Strongly Approve and Neutral --- but can not be expressed as a number, at
# least without further work.
question_text(NULL,
message = "For a given [XX: unit], assume that the value of the treatment variable might be [XX: treatment] or [XX: control]. If the [XX: unit] gets [XX: treatment], then [XX: the outcome] would be [XX: a number/character]. If the [XX: unit] gets [XX: control], then [XX: the outcome] would be [XX: a different number or character]. The causal effect on the outcome of a treatment of [XX: treatment] versus [XX: control] is [XX: a number] - [XX: a different number] --- i.e., the difference between two potential outcomes --- which equals [XX: the causal effect], which is the causal effect for this [XX: unit].",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
A causal effect is defined as the *difference* between two potential outcomes. Keep two things in mind.
First, *difference* does not necessarily mean *subtraction*. Many potential outcome are not numbers. For example, it makes no sense to subtract a potential outcome, like who you would vote for if you saw a Facebook ad, from another potential outcome, like who you vote for if you did not see the ad.
Second, even in the case of numeric outcomes, you can’t simply say the effect is 10 without specifying the order of subtraction, although there is, perhaps, a default sense in which the causal effect is defined as potential outcome under treatment minus potential outcome under control.
### Exercise 13
<!-- XX: Replace stuff like `XX: the tibble` with just the name of the tibble, i.e., `trains`. -->
Let's consider a *predictive* model. Which variable in `XX: the tibble` do you think might have an important connection to `XX: the outcome variable`?
```{r introduction-13}
question_text(NULL,
message = "XX: Describe one of the key covariates whose connection to the outcome variable we might want to explore.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 2)
```
###
With a predictive model, each individual unit has only one observed outcome. There are not two potential outcomes because none of the covariates are treated as treatment variables. Instead, all covariates are assumed to be "fixed."
Predictive models have no "treatments" -—- only covariates.
<!-- DK: Shouldn't this come after we specify the covariate in which we are interested? -->
### Exercise 14
Specify two different groups of [XX: units] which have different values for [XX: covariate] and which might have different average values for the [XX: outcome].
```{r introduction-14}
question_text(NULL,
message = "XX: Consider two groups, the first with a value for [XX: covariate] of [XX: a value] and the second with value [XX: a different value]. Those two groups might have different average values for the outcome variable.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for a covariate of interest.
Any causal connection means exploring the *within row* difference between two potential outcomes. There's no need to consider other rows.
### Exercise 15
<!-- XX: You can make this question more complex by specifying more than one covariate. This is often useful when you expect heterogeneity, either in the causal effects or in the predictive model. -->
Write a [XX: choose causal or predictive] question which connects the outcome variable `XX` to `XXZ`, the covariate of interest.
```{r introduction-15}
# XX: If it is causal, you should use key causal language in the question, like
# "What is the causal effect of the treatment on the outcome?" Example: "What is
# the average causal effect of exposure to Spanish-speakers on attitudes toward
# immigration?" If the model is predictive, the question should clearly compare
# two groups of units. "What is the difference in the outcome variable between
# two groups of units?" Example: "What is the difference in immigration
# attitudes between Democrats and Republicans?" In both cases, the word
# "average" is implicit in the question.
question_text(NULL,
message = "XX: Give your question.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
This is the first version of the question. We will now create a Preceptor Table to answer the question. We may then revise the question given complexities discovered in the data. We then update the question and the Preceptor Table. And so on.
## Wisdom
###
<!-- XX: We finished the Introduction with a broad area of interest, a relevant dataset and some reminders about the Rubin Causal Model. Now we need to iterate, moving back and forth between the data and the Preceptor Table until we have ones that seem to work well together. -->
<!-- XX: Pick one. -->
*A prudent question is one half of wisdom.* - Francis Bacon
*The power to question is the basis of all human progress.* - Indira Gandhi
*The important thing is not to stop questioning.* - Albert Einstein
*It is not the answer that enlightens, but the question.* - Eugene Ionesco
<!-- XX: Write a few sentences, but no more than a paragraph, which connects your larger problem, as discussed above, to a question/questions which you might be able to answer given the data set we seem to have. That answer won't solve all your other problems! But it should make it more likely that you will deal with at least some of your problems more intelligently, that your decisions will be better than they otherwise would have been if, counterfactually, you had not completed this data science project. -->
<!-- XX: Example: -->
<!-- You have a campaign budget. Your goal is to win the election. Winning the election involves convincing people to vote for you *and* getting your supporters to vote. Should you send postcards to registered voters? What should those postcards say? Does the effect of the postcards vary for different types of voters? -->
<!-- XX: It is probably good if there are a couple of questions/topics here. We have not made any final choices yet. We are just refining the problem down from the broad generality of the "Imagine" paragraph from the Introduction. By the end of the Wisdom section, we need a very specific question. -->
### Exercise 1
In your own words, describe the key components of Wisdom when working on a data science problem.
```{r wisdom-1}
question_text(NULL,
message = "Wisdom begins with a question and then moves on to the creation of a Preceptor Table and an examination of our data.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 3)
```
###
*The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.* -- John W. Tukey
### Exercise 2
Define a Preceptor Table.
```{r wisdom-2}
question_text(NULL,
message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, we can easily calculate the quantity of interest.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
The Preceptor Table does not include all the covariates which you will eventually include in your model. It only includes covariates which you need to answer your question.
<!-- XX: Insert at least two questions which explore the data. Provide knowledge drops which highlight important aspects. Don't forget to `tutorial.helpers::check_current_tutorial()` when you are done so that all the subsequent exercises are renumbered correctly. -->
### Exercise 3
Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."
```{r wisdom-3}
question_text(NULL,
message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will considered a treatment.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
<!-- XX: Pick one, and edit, as appropriate: -->
<!-- XX: This problem is causal so one of the covariates is a treatment. In our problem, the treatment is XX. There is a potential outcome for each of the XX possible values of the treatment. -->
<!-- XX: This problem is predictive so [insert something about comparing the outcomes for two different groups] -->
### Exercise 4
What are the units for this problem?
```{r wisdom-4}
question_text(NULL,
message = "XX",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
Specifying the Preceptor Table forces us to think clearly about the units and outcomes implied by the question. The resulting discussion sometimes leads us to modify the question with which we started. No data science project follows a single direction. We always backtrack. There is always dialogue.
We model units, but we only really care about aggregates.
### Exercise 5
What is the outcome variable for this problem?
```{r wisdom-5}
question_text(NULL,
message = "Keep track of two 'outcome' variables: the one in our Preceptor Table and the one in our data. In this case, XX . . .",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
The outcome variable that we really care about is often not the outcome variable which our data includes. This compromise --- working with what we *have* rather than what we really *want* --- is a part of most data science work in the real world.
### Exercise 6
What is a covariate which you think might be useful for this problem, regardless of whether or not it might be included in the data?
```{r wisdom-6}
question_text(NULL,
message = "XX. Answer should be a sensible variables which is plausibly connected to the outcome.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
The term "covariates" is used in at least three ways in data science. First, it is all the variables which might be useful, regardless of whether or not we have the data. Second, it is all the variables for which we have data. Third, it is the set of variables in the data which we end up using in the model.
### Exercise 7
What are the treatments, if any, for this problem?
```{r wisdom-7}
question_text(NULL,
message = "XX",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
Remember that a treatment is just another covariate which, for the purposes of this specific problem, we are assuming can be manipulated, thereby, creating two or more different potential outcomes for each unit.
### Exercise 8
What moment in time does the Preceptor Table refer to?
```{r wisdom-8}
question_text(NULL,
message = "XX",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
A Preceptor Table can never really refer to an exact instant in time since nothing is instantaneous in this fallen world.
In almost all practical problems, the data was gathered at a time other than that to which the Preceptor Table refers.
```{r}
# XX: Make a nice looking plot which shows the outcome variable on the y-axis
# and the most important/interesting covariate/treatment on the x-axis. This
# doesn't really go here, but we want to show the plot after discussing
# covariates and treatments above. We don't show the code. Plot does not need
# to be fancy, but it should be competent, with a title, a subtitle which
# highlights the main takeaway from the plot, axis labels and so on.
```
> *You can never look at the data too much. -- Mark Engerman*
### Exercise 9
Describe in words the Preceptor Table for this problem.
```{r wisdom-9}
question_text(NULL,
message = "XX. Make sure your words give an excellent description of the Preceptor Table which you are about to show the student. Mentions these words: rows, units, outcome, covariates and, maybe, treatment.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
The Preceptor Table for this problem looks something like this:
<!-- XX: Here are two examples which might help you to create your own. In general, your question will only specify the outcome and one or two covariates, so the Preceptor Table is actually fairly small. Making a good Preceptor Table is one of the more difficult aspects of a tutorial. -->
<!-- First, for a predictive model: -->
<!-- ```{r} -->
<!-- tibble(ID = c("1", "2", "...", "10", "11", "...", "103,754,865"), -->
<!-- vote = c("Democrat", "Third Party", "...", "Republican", "Democrat", "...", "Republican"), -->
<!-- sex = c("M", "F", "...", "F", "F", "...", "M")) |> -->
<!-- gt() |> -->
<!-- tab_header(title = "Preceptor Table") |> -->
<!-- cols_label(ID = md("ID"), -->
<!-- vote = md("Vote"), -->
<!-- sex = md("Sex")) |> -->
<!-- tab_style(cell_borders(sides = "right"), -->
<!-- location = cells_body(columns = c(ID))) |> -->
<!-- tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), -->
<!-- locations = cells_column_labels(columns = c(ID))) |> -->
<!-- cols_align(align = "center", columns = everything()) |> -->
<!-- cols_align(align = "left", columns = c(ID)) |> -->
<!-- fmt_markdown(columns = everything()) |> -->
<!-- tab_spanner(label = "Outcome", columns = c(vote)) |> -->
<!-- tab_spanner(label = "Covariate", columns = c(sex)) -->
<!-- ``` -->
<!-- Second, for a causal model: -->
<!-- ```{r} -->
<!-- tibble(ID = c("1", "2", "...", "10", "11", "...", "N"), -->
<!-- voting_after_treated = c("1", "1", "...", "1", "0", "...", "1"), -->
<!-- voting_after_control = c("1", "0", "...", "1", "1", "...", "0"), -->
<!-- treatment = c("Yes", "No", "...", "Yes", "Yes", "...", "No"), -->
<!-- engagement = c("1", "3", "...", "6", "2", "...", "2")) |> -->
<!-- gt() |> -->
<!-- tab_header(title = "Preceptor Table") |> -->
<!-- cols_label(ID = md("ID"), -->
<!-- voting_after_treated = md("Voting After Treatment"), -->
<!-- voting_after_control = md("Voting After Control"), -->
<!-- treatment = md("Treatment"), -->
<!-- engagement = md("Engagement")) |> -->
<!-- tab_style(cell_borders(sides = "right"), -->
<!-- location = cells_body(columns = c(ID))) |> -->
<!-- tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), -->
<!-- locations = cells_column_labels(columns = c(ID))) |> -->
<!-- cols_align(align = "center", columns = everything()) |> -->
<!-- cols_align(align = "left", columns = c(ID)) |> -->
<!-- fmt_markdown(columns = everything()) |> -->
<!-- tab_spanner(label = "Covariates", columns = c(treatment, engagement)) |> -->
<!-- tab_spanner(label = "Outcomes", columns = c(voting_after_control, voting_after_treated)) -->
<!-- ``` -->
Like all aspects of a data science problem, the Preceptor Table evolves as we continue our work.
<!-- XX: If necessary, provide code exercises which, line-by-line, create the pipeline which creates the cleaned data that will be used in modeling. Name the new object `x`. For many tutorials, this is unnecessary since we can just use the raw tibble that is available in whatever package. But we sometimes need some code like
nes |>
filter(year == 1992) |>
drop_na()
We have three code exercises, each adding one line to the pipeline, explaining what we are doing and why. It is nice that, for each exercise, something is spat out.
-->
<!-- XX: If such a pipeline was built, there is one QMD question which requires that you add a new code chunk to the QMD, copy/paste the pipeline and assign the result to the new object `x`:
x <- nes |>
filter(year == 1992) |>
drop_na()
`Cmd/Ctrl + Shift + K` follows, perhaps with a show_file("XX.qmd", chunk = "Last")
-->
### Exercise 10
What is the narrow, specific question we will try to answer?
```{r wisdom-10}
question_text(NULL,
message = "XX: ",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
The answer to this question is your "Quantity of Interest." It is OK if your question differs from ours. Many similar questions lead to the creation of the same model. For the purpose of this tutorial, let's use our question.
Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers in which we are interested, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific question which helped to guide us in the creation of the Preceptor Table and, soon, the model.
### Exercise 11
Over the course of this tutorial, we will be creating a summary paragraph. The purpose of this exercise is to write the first two sentences of that paragraph.
The first sentence is a general statement about the overall topic, mentioning both the general class of the outcome variable and of at least one of the covariates. It is **not** connected to the initial "Imagine that you are XX" which set the stage for this project. That sentence can be rhetorical. It can be trite, or even a platitude. The purpose of the sentence is to let the reader know, gently, about our topic.
The second sentence does two things. First, it introduces the data source. Second, it introduces the specific question. The sentence can't be that long. Important aspects of the data include when/where it was gathered, how many observations it includes and the organization (if famous) which collected it.
Type your two sentences below.
<!-- XX: Example: -->
<!-- XX: Sending postcards and other mailings to registered voters is a traditional part of US political campaigns. Using data from a 2006 experiment in Michigan, we seek to explore the likely causal effects of sending postcards to voters in the current gubernatorial campaign in Texas. -->
```{r wisdom-11}
question_text(NULL,
message = "XX; Make your own answer excellent!",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to your QMD, `Cmd/Ctrl + Shift + K`, and then commit/push.
<!-- DK: In most academic writing, the "official" Preceptor Table is much closer to the data than it is in applied work. An academic might just be interested in the causal effect of postcard on voting in an election 20 years ago. That is the only claim she is making. But any decision-maker in the real world really cares about, not what happened decades ago, but what postcards might accomplish today. -->
## Justice
###
<!-- XX: Choose one. -->
*Justice is truth in action.* - Benjamin Disraeli
*The arc of the moral universe is long, but it bends toward justice.* - Theodore Parker
*Justice delayed is justice denied.* - William E. Gladstone
*It is in justice that the ordering of society is centered.* - Aristotle
*Charity is no substitute for justice withheld.* - Saint Augustine
### Exercise 1
In your own words, name the five key components of Justice when working on a data science problem.
```{r justice-1}
question_text(NULL,
message = "Justice concerns the Population Table and the four key assumptions which underlie it: validity, stability, representativeness, and unconfoundedness.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.
### Exercise 2
In your own words, define "validity" as we use the term.
```{r justice-2}
question_text(NULL,
message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
Validity is always about the *columns* in the Preceptor Table and the data. Just because columns from these two different tables have the same *name* does not mean that they are the same *thing*.
### Exercise 3
<!-- XX: For the validity questions, specifics matter. There is always a reason why the outcome column in the data is not the same as the outcome columns in the Preceptor Table, even in the case of simple sampling. For example, consider a historical question connecting sex with presidential vote. Our data is a subset of our Preceptor Table. We have information on a few thousand voters and want to draw inferences about millions of other voters in the same election. But, even in this case, the outcome columns are different. The data is who people told a survey who they voted for. The Preceptor Table is who people actually did vote for. Those are not the same things. If they, in your view, are different enough than validity is violated. -->
Provide one reason why the assumption of validity might not hold for the outcome variable `XX` or for one of the covariates. Use the words "column" or "columns" in your answer.
```{r justice-3}
question_text(NULL,
message = "XX: Answers to validity questions should always use the word 'column(s)'. You should make your answer longer than the one we expect from students, ideally mentioning potential problems with both the outcome and a covariate.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
In order to consider the Preceptor Table and the data to be drawn from the same population, the columns from one must have a *valid correspondence* with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table, which is the first step in Justice.
Because we control the Preceptor Table and, to a lesser extent, the original question, we can adjust those variables to be “closer” to the data that we actually have. This is another example of the iterative nature of data science. If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.
Despite these potential problems, we will assume that validity holds since it, mostly (?), does.
### Exercise 4
In your own words, define a Population Table.
```{r justice-4}
question_text(NULL,
message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.
### Exercise 5
Specify the unit/time combinations which define each row in this Population Table.
```{r justice-5}
question_text(NULL,
message = "XX: Your answer here.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)
```
###
The exact time period used --- whether hour, day, month, year, or whatever --- is relatively arbitrary. The important thing to note is that the Population Table, unlike the Preceptor Table, covers a period of time over which things may change.
<!-- XX: Give an example if possible. -->
<!-- DK: Add example (steal from Preceptor Table above) which students can then modify. Also available in Primer. Done.-->
<!-- First, for a predictive model: -->
<!-- #| echo: false
tibble(source = c("PT/Data", "PT/Data", "PT", "PT", "PT", "PT", "...", "PT/Data", "PT/Data", "PT", "PT", "...", "PT/Data"),
ID = c("1", "2", "3", "4", "5", "6", "...", "10", "11", "12", "13", "...", "103,754,865"),
vote = c("Democrat", "Third Party", "Republican", "Democrat", "Democrat", "Democrat", "...", "Republican", "Democrat", "Democrat", "Republican", "...", "Republican"),
sex = c("M", "F", "M", "F", "F", "M", "...", "F", "F", "...", "F", "...", "M")) |>
gt() |>
tab_header(title = "Population Table") |>
cols_label(source = md("Source"),
ID = md("ID"),
vote = md("Vote"),
sex = md("Sex")) |>
tab_style(cell_borders(sides = "right"),
location = cells_body(columns = c(ID))) |>
tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"),
locations = cells_column_labels(columns = c(ID))) |>
cols_align(align = "center", columns = everything()) |>
cols_align(align = "left", columns = c(ID)) |>
fmt_markdown(columns = everything()) |>
tab_spanner(label = "Outcome", columns = c(vote)) |>
tab_spanner(label = "Covariate", columns = c(sex)) -->
<!-- Second, for a causal model: -->
<!-- ```{r}
#| echo: false
First, we create a tibble with the values we want for the table
tibble(source = c("...", "...", "...",
"Data", "Data", "...",
"...", "...", "...",
"Preceptor Table", "Preceptor Table", "...",
"...", "..."),
gender = c("?", "?", "...",
"Male", "Female", "...",
"?", "?", "...",
"Female", "Female", "...",
"?", "?"),
year = c("1990", "1995", "...",
"2006", "2006", "...",
"2010", "2012", "...",
"2026", "2026", "...",
"2046", "2050"),
state = c("?", "?", "...",
"Michigan", "Michigan", "...",
"?", "?", "...",
"Texas", "Texas", "...",
"?", "?"),
ytreat = c("?", "?", "...",
"Did not vote", "?", "...",
"?", "?", "...",
"?", "?", "...",
"?", "?"),
ycontrol = c("?", "?", "...",
"?", "Voted", "...",
"?", "?", "...",
"?", "?", "...",
"?", "?")) |>
Then, we use the gt function to make it pretty
gt() |>
cols_label(source = md("Source"),
gender = md("Sex"),
year = md("Year"),
state = md("State"),
ytreat = md("Treatment"),
ycontrol = md("Control")) |>
tab_style(cell_borders(sides = "right"),
location = cells_body(columns = c(source))) |>
tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"),
locations = cells_column_labels(columns = c(source))) |>
cols_align(align = "center", columns = everything()) |>
cols_align(align = "left", columns = c(source)) |>
fmt_markdown(columns = everything()) |>
tab_spanner(label = "Outcomes", c(ytreat, ycontrol))
``` -->