From 2ac9ca1aceaf664f2c5266e7d067c060f63f310f Mon Sep 17 00:00:00 2001
From: kate-marine <katemarine22@gmail.com>
Date: Tue, 25 Nov 2025 01:00:14 -0500
Subject: [PATCH] Created unit 6 ridge regression exercise

Created 4-part practice problem involving ridge regression and correlated predictors. Also added solutions. Created in LaTex document tryintg to keep as similar style as possible to the current notes/exercise pdfs
---
 ridge_regress_practice_problem | 126 +++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)
 create mode 100644 ridge_regress_practice_problem

diff --git a/ridge_regress_practice_problem b/ridge_regress_practice_problem
new file mode 100644
index 0000000..16b5cf0
--- /dev/null
+++ b/ridge_regress_practice_problem
@@ -0,0 +1,126 @@
+\documentclass{article}
+\usepackage{graphicx} % Required for inserting images
+
+\title{Extra Unit 6 Practice Problems}
+% \author{katherine.a.marine.28 }
+
+\usepackage[a4paper, margin=1in]{geometry} % uniform margins
+\begin{document}
+
+\maketitle
+
+% \section{Introduction}
+\subsubsection{Exercise 1: Ridge with Correlated Predictors}
+
+A sports scientist is modeling a player’s injury risk score \textbf{Y} using three predictors measured before each match:
+\begin{itemize}
+    \item \(X_1:\) \textit{Cumulative minutes played in last 4 weeks}
+    \item \(X_2:\) \textit{Number of matches started in last 4 weeks}
+    \item \(X_3:\) \textit{Average high-speed distance per match (km)}
+\end{itemize}
+\vspace{10pt}
+The predictors $X_1$ and $X_2$ are highly correlated: players who play more minutes usually also start more matches. The researcher fits three models using the same training data:
+\begin{enumerate}
+    \item Ordinary least squares (OLS),
+    \item Ridge regression with $\lambda = 0.1$,
+    \item Ridge regression with $\lambda = 5$.
+\end{enumerate}
+\vspace{10pt}
+All models include an intercept. The estimated coefficients (excluding the intercept) are:
+\vspace{6pt}
+\begin{center}
+\begin{tabular}{lccc}
+\hline
+Model & $\hat{\beta}_1$ (for $X_1$) & $\hat{\beta}_2$ (for $X_2$) & $\hat{\beta}_3$ (for $X_3$) \\
+\hline
+OLS & $4.5$ & $-0.3$ & $0.05$ \\
+Ridge ($\lambda = 0.1$) & $3.0$ & $1.2$ & $0.08$ \\
+Ridge ($\lambda = 5$)   & $1.4$ & $1.3$ & $0.02$ \\
+\hline
+\end{tabular}
+\end{center}
+\vspace{10pt}
+\noindent On an independent test set, the mean squared error (MSE) for ridge regression at different values of $\lambda$ is:
+
+\begin{center}
+\begin{tabular}{cc}
+\hline
+$\lambda$ & Test MSE \\
+\hline
+$0$ (OLS) & $12.5$ \\
+$0.1$     & $10.2$ \\
+$1$       & $8.8$ \\
+$5$       & $9.0$ \\
+$20$      & $11.7$ \\
+\hline
+\end{tabular}
+\end{center}
+\vspace{15pt}
+\noindent\underline{Questions}:
+\begin{enumerate}
+    \item Based on the coefficient table, how do the ridge estimates differ from the OLS estimates? Include specifically the size and signs of $\hat{\beta}_1$ and $\hat{\beta}_2$.
+    \vspace{10pt}
+    \item Explain why multicollinearity between $X_1$ and $X_2$ can cause the OLS pattern (large $\hat{\beta}_1 = 4.5$ and slightly negative $\hat{\beta}_2 = -0.3$), even if both variables are in reality positively related to injury risk.
+    \vspace{10pt}
+    \item Ridge regression minimizes
+    \[
+    \sum_{j=1}^N (Y_j - \hat{y}(X_j,D))^2 \;+\; \lambda \sum_{k=1}^3 \beta_k^2.
+    \]
+    Intuitively explain how this penalty term leads to:
+    \begin{itemize}
+        \item smaller coefficient magnitudes, and
+        \item a more ``balanced'' sharing of the effect between $X_1$ and $X_2$.
+    \end{itemize}
+    \vspace{10pt}
+    \item Using the MSE table, sketch (by hand) the test MSE as a function of $\lambda$. For which value of $\lambda$ does the model perform best on the test set? Briefly interpret this behavior in terms of the bias–variance tradeoff.
+    
+\end{enumerate}
+
+\vspace{15pt}
+\noindent\underline{Solution}:
+\begin{enumerate}
+
+\item From the coefficient table:
+\begin{itemize}
+    \item OLS: $\hat{\beta}_1 = 4.5$ (large positive), $\hat{\beta}_2 = -0.3$ (slightly negative), $\hat{\beta}_3 = 0.05$ (small positive).
+    \item Ridge with $\lambda = 0.1$: $\hat{\beta}_1 = 3.0$, $\hat{\beta}_2 = 1.2$, $\hat{\beta}_3 = 0.08$.
+    \item Ridge with $\lambda = 5$: $\hat{\beta}_1 = 1.4$, $\hat{\beta}_2 = 1.3$, $\hat{\beta}_3 = 0.02$.
+\end{itemize}
+As $\lambda$ increases, all coefficient magnitudes shrink toward $0$. The ridge estimates also become more ``balanced'' for $X_1$ and $X_2$: for $\lambda = 5$, $\hat{\beta}_1$ and $\hat{\beta}_2$ are very similar (1.4 and 1.3), in contrast to the OLS fit where most of the effect is placed on $X_1$ and $\hat{\beta}_2$ is slightly negative.
+
+\item When $X_1$ and $X_2$ are highly correlated, the design matrix $X^TX$ is nearly singular, so the OLS estimator is very unstable. Small changes in the data can lead to large changes in the individual coefficients. OLS can still produce reasonable predictions for $Y$, but it may do so using a combination of a very large positive coefficient on one predictor and a small or even negative coefficient on the other. Thus we can see a pattern like $\hat{\beta}_1 = 4.5$ and $\hat{\beta}_2 = -0.3$ even if, in reality, both $X_1$ and $X_2$ are positively related to injury risk.
+
+\item Ridge minimizes
+\[
+\sum_{j=1}^N (Y_j - \hat{y}(X_j,D))^2 + \lambda \sum_{k=1}^3 \beta_k^2.
+\]
+The penalty term $\lambda \sum_k \beta_k^2$ discourages large coefficients, because any increase in $|\beta_k|$ increases the loss even if the residual sum of squares decreases. As a result, ridge solutions have:
+\begin{itemize}
+    \item \emph{Smaller coefficient magnitudes}: all $\hat{\beta}_k$ are shrunk toward $0$ relative to OLS.
+    \item \emph{More balanced sharing between $X_1$ and $X_2$}: for highly correlated predictors, it is ``cheaper'' (in terms of the penalty) to assign moderate coefficients to both variables than to let one become very large and the other negative. This leads to ridge spreading the effect across $X_1$ and $X_2$ rather than letting one coefficient explode.
+\end{itemize}
+
+\item From the table:
+
+\begin{center}
+\begin{tabular}{cc}
+\hline
+$\lambda$ & Test MSE \\
+\hline
+$0$ (OLS) & $12.5$ \\
+$0.1$     & $10.2$ \\
+$1$       & $8.8$ \\
+$5$       & $9.0$ \\
+$20$      & $11.7$ \\
+\hline
+\end{tabular}
+\end{center}
+
+If we plot test MSE vs.\ $\lambda$, it decreases from $\lambda = 0$ to $\lambda = 1$, reaches a minimum at $\lambda = 1$ (MSE $= 8.8$), then increases again for larger $\lambda$.
+
+Thus the best test performance occurs at $\lambda = 1$. For small $\lambda$, ridge adds a bit of bias but substantially reduces variance, improving generalization. For very large $\lambda$, the coefficients are shrunk too strongly toward zero, increasing bias and causing the test MSE to rise again. This illustrates the usual bias--variance tradeoff.
+
+\end{enumerate}
+
+
+\end{document}