From 2ac9ca1aceaf664f2c5266e7d067c060f63f310f Mon Sep 17 00:00:00 2001 From: kate-marine Date: Tue, 25 Nov 2025 01:00:14 -0500 Subject: [PATCH] Created unit 6 ridge regression exercise Created 4-part practice problem involving ridge regression and correlated predictors. Also added solutions. Created in LaTex document tryintg to keep as similar style as possible to the current notes/exercise pdfs --- ridge_regress_practice_problem | 126 +++++++++++++++++++++++++++++++++ 1 file changed, 126 insertions(+) create mode 100644 ridge_regress_practice_problem diff --git a/ridge_regress_practice_problem b/ridge_regress_practice_problem new file mode 100644 index 0000000..16b5cf0 --- /dev/null +++ b/ridge_regress_practice_problem @@ -0,0 +1,126 @@ +\documentclass{article} +\usepackage{graphicx} % Required for inserting images + +\title{Extra Unit 6 Practice Problems} +% \author{katherine.a.marine.28 } + +\usepackage[a4paper, margin=1in]{geometry} % uniform margins +\begin{document} + +\maketitle + +% \section{Introduction} +\subsubsection{Exercise 1: Ridge with Correlated Predictors} + +A sports scientist is modeling a player’s injury risk score \textbf{Y} using three predictors measured before each match: +\begin{itemize} + \item \(X_1:\) \textit{Cumulative minutes played in last 4 weeks} + \item \(X_2:\) \textit{Number of matches started in last 4 weeks} + \item \(X_3:\) \textit{Average high-speed distance per match (km)} +\end{itemize} +\vspace{10pt} +The predictors $X_1$ and $X_2$ are highly correlated: players who play more minutes usually also start more matches. The researcher fits three models using the same training data: +\begin{enumerate} + \item Ordinary least squares (OLS), + \item Ridge regression with $\lambda = 0.1$, + \item Ridge regression with $\lambda = 5$. +\end{enumerate} +\vspace{10pt} +All models include an intercept. The estimated coefficients (excluding the intercept) are: +\vspace{6pt} +\begin{center} +\begin{tabular}{lccc} +\hline +Model & $\hat{\beta}_1$ (for $X_1$) & $\hat{\beta}_2$ (for $X_2$) & $\hat{\beta}_3$ (for $X_3$) \\ +\hline +OLS & $4.5$ & $-0.3$ & $0.05$ \\ +Ridge ($\lambda = 0.1$) & $3.0$ & $1.2$ & $0.08$ \\ +Ridge ($\lambda = 5$) & $1.4$ & $1.3$ & $0.02$ \\ +\hline +\end{tabular} +\end{center} +\vspace{10pt} +\noindent On an independent test set, the mean squared error (MSE) for ridge regression at different values of $\lambda$ is: + +\begin{center} +\begin{tabular}{cc} +\hline +$\lambda$ & Test MSE \\ +\hline +$0$ (OLS) & $12.5$ \\ +$0.1$ & $10.2$ \\ +$1$ & $8.8$ \\ +$5$ & $9.0$ \\ +$20$ & $11.7$ \\ +\hline +\end{tabular} +\end{center} +\vspace{15pt} +\noindent\underline{Questions}: +\begin{enumerate} + \item Based on the coefficient table, how do the ridge estimates differ from the OLS estimates? Include specifically the size and signs of $\hat{\beta}_1$ and $\hat{\beta}_2$. + \vspace{10pt} + \item Explain why multicollinearity between $X_1$ and $X_2$ can cause the OLS pattern (large $\hat{\beta}_1 = 4.5$ and slightly negative $\hat{\beta}_2 = -0.3$), even if both variables are in reality positively related to injury risk. + \vspace{10pt} + \item Ridge regression minimizes + \[ + \sum_{j=1}^N (Y_j - \hat{y}(X_j,D))^2 \;+\; \lambda \sum_{k=1}^3 \beta_k^2. + \] + Intuitively explain how this penalty term leads to: + \begin{itemize} + \item smaller coefficient magnitudes, and + \item a more ``balanced'' sharing of the effect between $X_1$ and $X_2$. + \end{itemize} + \vspace{10pt} + \item Using the MSE table, sketch (by hand) the test MSE as a function of $\lambda$. For which value of $\lambda$ does the model perform best on the test set? Briefly interpret this behavior in terms of the bias–variance tradeoff. + +\end{enumerate} + +\vspace{15pt} +\noindent\underline{Solution}: +\begin{enumerate} + +\item From the coefficient table: +\begin{itemize} + \item OLS: $\hat{\beta}_1 = 4.5$ (large positive), $\hat{\beta}_2 = -0.3$ (slightly negative), $\hat{\beta}_3 = 0.05$ (small positive). + \item Ridge with $\lambda = 0.1$: $\hat{\beta}_1 = 3.0$, $\hat{\beta}_2 = 1.2$, $\hat{\beta}_3 = 0.08$. + \item Ridge with $\lambda = 5$: $\hat{\beta}_1 = 1.4$, $\hat{\beta}_2 = 1.3$, $\hat{\beta}_3 = 0.02$. +\end{itemize} +As $\lambda$ increases, all coefficient magnitudes shrink toward $0$. The ridge estimates also become more ``balanced'' for $X_1$ and $X_2$: for $\lambda = 5$, $\hat{\beta}_1$ and $\hat{\beta}_2$ are very similar (1.4 and 1.3), in contrast to the OLS fit where most of the effect is placed on $X_1$ and $\hat{\beta}_2$ is slightly negative. + +\item When $X_1$ and $X_2$ are highly correlated, the design matrix $X^TX$ is nearly singular, so the OLS estimator is very unstable. Small changes in the data can lead to large changes in the individual coefficients. OLS can still produce reasonable predictions for $Y$, but it may do so using a combination of a very large positive coefficient on one predictor and a small or even negative coefficient on the other. Thus we can see a pattern like $\hat{\beta}_1 = 4.5$ and $\hat{\beta}_2 = -0.3$ even if, in reality, both $X_1$ and $X_2$ are positively related to injury risk. + +\item Ridge minimizes +\[ +\sum_{j=1}^N (Y_j - \hat{y}(X_j,D))^2 + \lambda \sum_{k=1}^3 \beta_k^2. +\] +The penalty term $\lambda \sum_k \beta_k^2$ discourages large coefficients, because any increase in $|\beta_k|$ increases the loss even if the residual sum of squares decreases. As a result, ridge solutions have: +\begin{itemize} + \item \emph{Smaller coefficient magnitudes}: all $\hat{\beta}_k$ are shrunk toward $0$ relative to OLS. + \item \emph{More balanced sharing between $X_1$ and $X_2$}: for highly correlated predictors, it is ``cheaper'' (in terms of the penalty) to assign moderate coefficients to both variables than to let one become very large and the other negative. This leads to ridge spreading the effect across $X_1$ and $X_2$ rather than letting one coefficient explode. +\end{itemize} + +\item From the table: + +\begin{center} +\begin{tabular}{cc} +\hline +$\lambda$ & Test MSE \\ +\hline +$0$ (OLS) & $12.5$ \\ +$0.1$ & $10.2$ \\ +$1$ & $8.8$ \\ +$5$ & $9.0$ \\ +$20$ & $11.7$ \\ +\hline +\end{tabular} +\end{center} + +If we plot test MSE vs.\ $\lambda$, it decreases from $\lambda = 0$ to $\lambda = 1$, reaches a minimum at $\lambda = 1$ (MSE $= 8.8$), then increases again for larger $\lambda$. + +Thus the best test performance occurs at $\lambda = 1$. For small $\lambda$, ridge adds a bit of bias but substantially reduces variance, improving generalization. For very large $\lambda$, the coefficients are shrunk too strongly toward zero, increasing bias and causing the test MSE to rise again. This illustrates the usual bias--variance tradeoff. + +\end{enumerate} + + +\end{document}