You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: declaration-diagnosis-redesign/crafting-data-strategy.qmd
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -298,9 +298,9 @@ Measurement is the part of the data strategy in which variables are collected ab
298
298
299
299
Descriptive inference is threatened whenever measurements differ from the quantities they are meant to measure. For example when we want to measure "latent variables" such as fear, support for a political candidate, or economic well-being, we use a measurement technology to imperfectly observe them. We might represent that measurement technology as the function $Q$ that yields the observed outcome $Y^{\mathrm obs}$: $Q(Y^*) = Y^{\mathrm obs}$. Our measurement strategy is a set of such functions for each variable we measure.
300
300
301
-
Some measurement strategies exhibit little to no measurement error. It's easy enough to measure some plain matters of fact, like whether a country is a member of the European Union (though clerical errors could still crop up). In the social sciences, most measurement strategies are threatened by the possibility of measurement errors due to any number of biases (e.g., recall bias, observer bias, Hawthorn effects, demand effects, sensitivity bias, response substitution, among many others).
301
+
Some measurement strategies exhibit little to no measurement error. It's easy enough to measure some plain matters of fact, like whether a country is a member of the European Union (though clerical errors could still crop up). In the social sciences, most measurement strategies are threatened by the possibility of measurement errors due to any number of biases (e.g., recall bias, observer bias, Hawthorne effects, demand effects, sensitivity bias, response substitution, among many others).
302
302
303
-
We often describe measurement error in two ways, measurement *validity*, and measurement *reliability*. Validity is the difference between the observed and latent outcome, $Y^{\mathrm obs} - Y^*$. Reliability is the consistency of the measurements we would obtain if we were to repeat the measurement many times, which we can operationalize as low variance of the measurements:, $\mathbb{V}(Y_1^{\mathrm obs}, Y_2^{\mathrm obs}, \ldots, Y_k^{\mathrm obs})$. We would of course like to always select valid, reliable measurement strategies. When no perfect measure is available, choices among alternative measurement strategies typically reduce to tradeoffs between their validity and reliability.
303
+
We often describe measurement error in two ways, measurement *validity*, and measurement *reliability*. Validity is the difference between the observed and latent outcome, $Y^{\mathrm obs} - Y^*$. Reliability is the consistency of the measurements we would obtain if we were to repeat the measurement many times, which we can operationalize as low variance of the measurements, $\mathbb{V}(Y_1^{\mathrm obs}, Y_2^{\mathrm obs}, \ldots, Y_k^{\mathrm obs})$. We would of course like to always select valid, reliable measurement strategies. When no perfect measure is available, choices among alternative measurement strategies typically reduce to tradeoffs between their validity and reliability.
304
304
305
305
To make these choices, we depend on methodological research whose main~inquiries are the reliability and validity of particular measurement procedures. Sometimes measurement studies are presented as "validation" studies that compare a proposed measure to a "ground truth." But even "ground truths" must be measured, usually with an expensive or otherwise unfeasible approach (otherwise they would be no need for the alternative measurement). Further, neither measurement is known to be exactly $Y^*$, so ultimately validation studies are comparisons of multiple techniques each with their own advantages and disadvantages. This fact does not make these studies useless, but rather underlines that they rely on our faith in ground truths.
Copy file name to clipboardExpand all lines: library/complex.qmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -192,7 +192,7 @@ The payoffs from structural estimation can be great: we can operationalize our t
192
192
193
193
-@Samii2016 enumerates some examples of structural estimation in economics and predicts future political scientists will take to structural estimation in earnest.
194
194
195
-
-@francois2015power provides a structural model to explain how leaders allocate cabinet positions to bolseter coalitions; the analysis compares the performance of a preferred theory to rival theories.
195
+
-@francois2015power provides a structural model to explain how leaders allocate cabinet positions to bolster coalitions; the analysis compares the performance of a preferred theory to rival theories.
196
196
197
197
-@Frey2022 estimates a structural model of party competition and coalitions on the basis of a regression discontinuity design. With the model parameters estimated, the authors simulate counterfactual scenarios without party coalitions.
Copy file name to clipboardExpand all lines: lifecycle/integration.qmd
+6-6Lines changed: 6 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -19,15 +19,15 @@ All three of these activities depend on an accurate understanding of the study d
19
19
20
20
## Communicating
21
21
22
-
The findings from studies are communicated to other scholars through academic publications. But some of the most important audiences -- policymakers, businesses, journalists, and the public at large -- do not read academic journals. These audiences learn about the study in other in other ways, through op-eds, blog posts, and policy reports that translate research for nonspecialist audiences.
22
+
The findings from studies are communicated to other scholars through academic publications. But some of the most important audiences -- policymakers, businesses, journalists, and the public at large -- do not read academic journals. These audiences learn about the study in other ways, through op-eds, blog posts, and policy reports that translate research for nonspecialist audiences.
23
23
24
24
Too often, a casualty of translating the study from academic to other audiences is the design information. Emphasis gets placed on the study results, not on the reasons why the results of the study are to be believed. In sharing the research for nonspecialist audiences, we revert to saying *that* we think the findings are true and not *why* we think the findings are true. Explaining why requires explaining the research design, which in our view ought to be part of any public-facing communication about research.
25
25
26
-
Looking at recent studies published in *The New York Times* Well section on health and fitness, we found that two dimensions of design quality were commonly ignored. First, experimental studies on new fitness regimens with very small samples, sometimes fewer than 10 units, are commonly highlighted. When both academic journals and reporters promote tiny studies, the likely result is that the published (and public) record contains many statistical flukes results reflecting noise rather than new discoveries. Second, very large studies that draw observational comparisons between large samples of dieters and non-dieters with millions of observations receive outsize attention. These designs are prone to bias from confounding, but these concerns are too often not described or discussed.
26
+
Looking at recent studies published in *The New York Times* Well section on health and fitness, we found that two dimensions of design quality were commonly ignored. First, experimental studies on new fitness regimens with very small samples, sometimes fewer than 10 units, are commonly highlighted. When both academic journals and reporters promote tiny studies, the likely result is that the published (and public) record contains many statistical fluke results reflecting noise rather than new discoveries. Second, very large studies that draw observational comparisons between large samples of dieters and non-dieters with millions of observations receive outsize attention. These designs are prone to bias from confounding, but these concerns are too often not described or discussed.
27
27
28
28
How can we improve scientific communication so that we better communicate the credibility of findings? The market incentives for both journalists and authors reward striking and surprising findings, so any real solution to the problem likely requires addressing those incentives. Short of that, we recommend that authors who wish to communicate the high quality of their designs to the media do so by providing the design information in *M*, *I*, *D*, and *A* in lay terms. Science communicators can state the research question (*I*) and explain why applying the data and answer strategies is likely to yield a good answer to the question. The actual result is, of course, also important to communicate, but *why* it is a credible answer to the research question is just as important to share---specifically what has to be believed about *M* for the results to be on target (@exm-designtoshare: *Design to share*).
29
29
30
-
How can we as researchers communicate about other scholars' work? Citations can't covey the entirety of *MIDA* in one sentence, but they can give an inkling. Here's an example of how we could cite a (hypothetical) study in a way that conveys at least some design information. "<spanstyle="color:#81AFEF">Using a randomized experiment</span>, the researchers (Authors, Year) found that <spanstyle="color:#DC5D86">donating to a campaign causes a large increase in the number of subsequent donation requests from other candidates</span>, which is consistent with <spanstyle="color:#E6C560">theories of party behavior that predict</span><spanstyle="color:#8DBA4C"> intra-party cooperation."</span>.
30
+
How can we as researchers communicate about other scholars' work? Citations can't convey the entirety of *MIDA* in one sentence, but they can give an inkling. Here's an example of how we could cite a (hypothetical) study in a way that conveys at least some design information. "<spanstyle="color:#81AFEF">Using a randomized experiment</span>, the researchers (Authors, Year) found that <spanstyle="color:#DC5D86">donating to a campaign causes a large increase in the number of subsequent donation requests from other candidates</span>, which is consistent with <spanstyle="color:#E6C560">theories of party behavior that predict</span><spanstyle="color:#8DBA4C"> intra-party cooperation."</span>.
31
31
32
32
The citation explains that the <span style="color:#81AFEF">data strategy</span> included some kind of randomized experiment (we don't know how many treatment arms or subjects, among other details), and that the <span style="color:#DC5D86">answer strategy</span> probably compared the counts of donation requests from any campaign (email requests, or phone, we don't know) among the groups of subjects that were assigned to donate to a particular campaign. The citation mentions the <span style="color:#E6C560">models</span> described in an unspecified area of the scientific literature on party politics, which all predict cooperation like the sharing of donor lists. We can reason that, if the <span style="color:#8DBA4C">inquiry</span>, "Is the population average treatment effect of donating to one campaign on the number of donation requests from other campaigns positive?" were put to each of these theories, they would all respond "Yes." The citation serves as a useful shorthand for the reader of what the claim of the paper is and why they should think it's credible. By contrast, a citation like "The researchers found that party members cooperate (Author, Year)." doesn't communicate any design information at all.
33
33
@@ -110,7 +110,7 @@ est |>
110
110
kable(booktabs = TRUE,
111
111
align = "c",
112
112
digits = 3,
113
-
caption = "Analysis and reanalysis estiamtes")
113
+
caption = "Analysis and reanalysis estimates")
114
114
```
115
115
116
116
@@ -136,7 +136,7 @@ Reanalysis diagnosis.
136
136
137
137
What we see in the diagnosis below is that `A_prime` is only preferred if we know for sure that $X$ is measured pretreatment. In design 3, where $X$ is measured posttreatment, `A` is preferred, because controlling for $X$ leads to posttreatment bias. This diagnosis indicates that the reanalyst needs to justify their beliefs about the causal ordering of $X$ and $Z$ to claim that `A_prime` is preferred to `A`. The reanalyst should not conclude on the basis of the realized estimates only that their answer strategy is preferred.
@@ -186,7 +186,7 @@ Here we have an original study design of size 1,000. The original study design's
186
186
187
187
@fig-ch23num2 shows that no matter how big we make the replication, we find that the rate of concluding the difference-in-SATEs is nonzero only occurs about 10% of the time. Similarly, the replication estimate is rarely outside of the original confidence interval, because it's rare to be more extreme than a wide confidence interval. The relatively high variance of the original study means that it is so uncertain, it's tough to distinguish it from any number in particular.
188
188
189
-
Turning to the third metric (is the original outside the 95% confidence interval of the replication estimate), we that we become more and more likely to conclude that the original study fails to replicate as the quality replication study goes up. At very large sample sizes, the replication confidence intervals become extremely small, so in the limit, it will always exclude the original study estimate.
189
+
Turning to the third metric (is the original outside the 95% confidence interval of the replication estimate), we see that we become more and more likely to conclude that the original study fails to replicate as the quality replication study goes up. At very large sample sizes, the replication confidence intervals become extremely small, so in the limit, it will always exclude the original study estimate.
190
190
191
191
The last metric, equivalence testing, has the nice property that, as the sample size grows, we get closer to the correct answer -- the true SATEs are indeed within 0.2 standard units of each other. However, again because the original study is so noisy, it is difficult to affirm its equivalence with anything, even when the replication study is quite large.
Copy file name to clipboardExpand all lines: lifecycle/planning.qmd
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -79,7 +79,7 @@ A researcher might use this graph together with the partner to jointly select th
79
79
80
80
{#fig-ch21num2}
81
81
82
-
Choosing the proportion treated is one example of integrating partner constraints into research designs. A second common problem is that there are a set of units that must be treated or that must not be treated for ethical or political reasons (e.g., the home district of a government partner must receive the treatment). If these constraints are discovered after treatment assignment, they lead to noncompliance, which may substantially complicate the analysis of the experiment and even prevent providing an answer to the original inquiry. @Gerber2012 recommend, before randomizing treatment, exploring possible treatment assignments with the partner organization and using this exercise to elicit the set of units that must or cannot be treated. @king2007politically describe a "politically-robust" design, which uses pair-matched block randomization. In this design, when any unit is dropped due to political constraints, the whole pair is dropped from the study.^[This procedure is prone to at risk of bias for the average treatment effect among the "politically feasible" units if within some pairs, one unit is treatable but the other is not.]
82
+
Choosing the proportion treated is one example of integrating partner constraints into research designs. A second common problem is that there are a set of units that must be treated or that must not be treated for ethical or political reasons (e.g., the home district of a government partner must receive the treatment). If these constraints are discovered after treatment assignment, they lead to noncompliance, which may substantially complicate the analysis of the experiment and even prevent providing an answer to the original inquiry. @Gerber2012 recommend, before randomizing treatment, exploring possible treatment assignments with the partner organization and using this exercise to elicit the set of units that must or cannot be treated. @king2007politically describe a "politically-robust" design, which uses pair-matched block randomization. In this design, when any unit is dropped due to political constraints, the whole pair is dropped from the study.^[This procedure is at risk of bias for the average treatment effect among the "politically feasible" units if within some pairs, one unit is treatable but the other is not.]
83
83
84
84
A major benefit of working with partners is their deep knowledge of the substantive area. For this reason, we recommend involving them in the design declaration and diagnosis process. How can we develop intuitions about the means, variances, and covariances of the variables to be measured? Ask your partner for their best guesses, which may be far more educated than your own. For experimental studies, solicit your partner's beliefs about the magnitude of the treatment effect on each outcome variable, subgroup by subgroup if possible. Engaging partners in the declaration process improves design -- and it very quickly sharpens the discussion of key design details. Pro-tip: Share your design diagnoses and mock analyses *before* the study is launched to quickly build consensus around the study's goals.
85
85
@@ -121,7 +121,7 @@ Like main studies, pilot studies can be declared and diagnosed -- but importantl
121
121
122
122
{#fig-ch21num3}
123
123
124
-
Suppose we have prior beliefs about the effect size that can be summarized as a normal distribution centered at 0.3 with a standard deviation of 0.1, as in the bottom panel of @fig-ch21num2. We could choose a design that corresponds to this best guess, the average of our prior belief distribution. If the true effect size is 0.3, then a study with 350 subjects will have 80% power.
124
+
Suppose we have prior beliefs about the effect size that can be summarized as a normal distribution centered at 0.3 with a standard deviation of 0.1, as in the bottom panel of @fig-ch21num3. We could choose a design that corresponds to this best guess, the average of our prior belief distribution. If the true effect size is 0.3, then a study with 350 subjects will have 80% power.
125
125
126
126
However, redesigning the study to optimize for the "best guess" is risky because the true effect could be much smaller than 0.3. Suppose we adopt the redesign heuristic of powering the study for an effect size at the 10th percentile of our prior belief distribution, which works out here to be an effect size of 0.17. Following this rule, we would select a design with 1,100 subjects.
0 commit comments