d-morrison · d-morrison · Apr 10, 2026 · Nov 14, 2025 · Nov 14, 2025 · Nov 14, 2025
diff --git a/.gitignore b/.gitignore
@@ -28,6 +28,8 @@ _freeze/
 *.pdf
 rsconnect
 *.md
+
+**/*.quarto_ipynb
 !.github/copilot-instructions.md
 
 **/*.quarto_ipynb

diff --git a/_quarto-book.yml b/_quarto-book.yml
@@ -38,6 +38,7 @@ book:
     - appendices-are-prereqs.qmd
     - math-prereqs.qmd
     - probability.qmd
+    - classification.qmd
     - estimation.qmd
     - inference.qmd
     - intro-MLEs.qmd

diff --git a/_quarto-website.yml b/_quarto-website.yml
@@ -18,6 +18,7 @@ project:
     - appendices-are-prereqs.qmd
     - math-prereqs.qmd
     - probability.qmd
+    - classification.qmd
     - estimation.qmd
     - inference.qmd
     - intro-MLEs.qmd
@@ -82,6 +83,8 @@ website:
             href: math-prereqs.qmd
           - text: "Probability"
             href: probability.qmd
+          - text: "Classification"
+            href: classification.qmd
           - text: "Estimation"
             href: estimation.qmd
           - text: "Inference"

diff --git a/classification.qmd b/classification.qmd
@@ -0,0 +1,212 @@
+---
+title: "Classification"
+format:
+  html: default
+  revealjs:
+    output-file: classification-slides.html
+  pdf:
+    output-file: classification-handout.pdf
+---
+
+{{< include shared-config.qmd >}}
+
+---
+
+Classification is a core problem in statistics and machine learning:
+we seek to assign individuals or observations to one of several discrete categories
+based on available data.
+In medicine and epidemiology,
+classification problems arise constantly—for example,
+determining whether a patient has a disease based on test results,
+biomarkers, or clinical signs.
+
+::: {#def-classification}
+
+#### Classification
+
+A **classification problem** is a statistical problem in which
+we seek to assign observations to one of two or more discrete categories (classes)
+based on observed features or predictors.
+In the binary case, we assign each observation to one of two classes,
+often labeled as "positive" or "negative", "diseased" or "healthy", etc.
+
+:::
+
+---
+
+A central challenge in medical classification is interpreting test results correctly.
+A test may appear highly accurate in isolation,
+yet its predictive value for an individual patient depends heavily on the
+prevalence of the condition in the population being tested.
+Understanding this interplay requires tools from probability theory—in particular,
+Bayes' theorem and the law of total probability.
+
+In the sections below, we define the key performance measures of a diagnostic test
+and work through a concrete example using COVID-19 testing.
+
+## Diagnostic test characteristics
+
+When evaluating a diagnostic test, we consider several key performance measures:
+
+:::{#def-sensitivity}
+
+#### Sensitivity
+
+The probability that the test is positive given that the person has the disease, denoted $\pmf{\text{positive} \mid \text{disease}}$.
+
+:::
+
+:::{#def-specificity}
+
+#### Specificity
+
+The probability that the test is negative given that the person does not have the disease, denoted $\pmf{\text{negative} \mid \text{no disease}}$.
+
+:::
+
+:::{#def-ppv}
+
+#### Positive Predictive Value (PPV)
+
+The probability that a person has the disease given that their test is positive, denoted $\pmf{\text{disease} \mid \text{positive}}$.
+
+:::
+
+:::{#def-npv}
+
+#### Negative Predictive Value (NPV)
+
+The probability that a person does not have the disease given that their test is negative, denoted $\pmf{\text{no disease} \mid \text{negative}}$.
+
+:::
+
+---
+
+## Example: COVID-19 testing
+
+Suppose we have a COVID-19 test with the following characteristics:
+
+- **99% sensitive**: If a person has COVID-19, the test will be positive 99% of the time
+- **99% specific**: If a person does not have COVID-19, the test will be negative 99% of the time
+
+---
+
+Let's define our events:
+
+- Let $D$ denote the event "person has COVID-19"
+- Let $+$ denote the event "test is positive"
+
+Then our test characteristics can be written as:
+
+$$
+\pmf{+ \mid D} = 0.99 \quad \text{(sensitivity)}
+$$
+
+$$
+\pmf{- \mid \neg D} = 0.99 \quad \text{(specificity)}
+$$
+
+---
+
+Note that if specificity is 0.99, then the false positive rate is:
+$$
+\pmf{+ \mid \neg D} = 1 - 0.99 = 0.01
+$$
+
+Suppose the **prevalence** of COVID-19 in the population is 7%:
+
+$$
+\pmf{D} = 0.07
+$$
+
+$$
+\pmf{\neg D} = 0.93
+$$
+
+---
+
+## Calculating positive predictive value
+
+The key question we want to answer is: **If someone tests positive, what is the probability they actually have COVID-19?**
+
+This is the positive predictive value:
+$$
+\pmf{D \mid +} = \, ?
+$$
+
+---
+
+We can use **Bayes' theorem** to calculate this:
+
+$$
+\pmf{D \mid +} = \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+}}
+$$
+
+To find $\pmf{+}$, we use the **law of total probability**:
+
+$$
+\pmf{+} = \pmf{+ \mid D} \cd \pmf{D} + \pmf{+ \mid \neg D} \cd \pmf{\neg D}
+$$
+
+---
+
+Now we can calculate each component:
+
+**Probability of being positive with disease:**
+$$
+\pmf{+ \mid D} \cd \pmf{D} = 0.99 \times 0.07 = 0.0693
+$$
+
+**Probability of being positive without disease (false positive):**
+$$
+\pmf{+ \mid \neg D} \cd \pmf{\neg D} = 0.01 \times 0.93 = 0.0093
+$$
+
+---
+
+**Total probability of positive test:**
+$$
+\pmf{+} = 0.0693 + 0.0093 = 0.0786
+$$
+
+**Positive predictive value:**
+$$
+\pmf{D \mid +} = \frac{0.0693}{0.0786} = 0.88
+$$
+
+---
+
+Therefore, even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have COVID-19.
+This is because the disease prevalence is relatively low (7%), so false positives make up a meaningful fraction of all positive tests.
+
+::: notes
+This counterintuitive result demonstrates the importance of considering disease prevalence when interpreting test results.
+Even highly accurate tests can have relatively low positive predictive values when the disease is rare.
+:::
+
+---
+
+## Alternative formulation
+
+We can rearrange Bayes' theorem to express the positive predictive value in terms of the sensitivity, specificity, and disease prevalence:
+
+$$
+\begin{aligned}
+\pmf{D \mid +} &= \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+}} \\
+&= \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+ \mid D} \cd \pmf{D} + \pmf{+ \mid \neg D} \cd \pmf{\neg D}} \\
+&= \frac{\pmf{D}}{\pmf{D} + \frac{\pmf{+ \mid \neg D}}{\pmf{+ \mid D}} \cd \pmf{\neg D}} \\
+&= \frac{1}{1 + \frac{\pmf{+ \mid \neg D}}{\pmf{+ \mid D}} \cd \frac{\pmf{\neg D}}{\pmf{D}}} \\
+&= \frac{1}{1 + \frac{1 - \text{spec}}{\text{sens}} \cd \frac{1 - \text{prev}}{\text{prev}}}
+\end{aligned}
+$$
+
+---
+
+This final form emphasizes the ratio of the false positive rate to the sensitivity, weighted by the ratio of non-diseased to diseased individuals in the population.
+It shows that even with a very high sensitivity and specificity, the positive predictive value depends strongly on disease prevalence.
+
+::: notes
+This algebraic form is useful for understanding how the different parameters interact.
+Notice how the prevalence ratio $\pmf{\neg D}/\pmf{D}$ appears explicitly in the denominator.
+When the disease is rare, this ratio is large, which reduces the positive predictive value.
+:::
diff --git a/inst/WORDLIST b/inst/WORDLIST
@@ -1,4 +1,5 @@
 Biostat
+biomarkers
 CLT
 Epi
 Github

diff --git a/latex-macros b/latex-macros
diff --git a/probability.qmd b/probability.qmd
@@ -532,7 +532,6 @@ $\dsn{X}$.
 
 {{< include sec-CLT.qmd >}}
 
-
 # Additional resources
 
 - @problifesaver
-Original file line number
+Diff line change
@@ -1,4 +1,5 @@
     Biostat
+    biomarkers
     CLT
     Epi
     Github
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
Expand Up		@@ -532,7 +532,6 @@ $\dsn{X}$.

		{{< include sec-CLT.qmd >}}


		# Additional resources

		- @problifesaver