Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions ridge_regress_practice_problem
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
\documentclass{article}
\usepackage{graphicx} % Required for inserting images

\title{Extra Unit 6 Practice Problems}
% \author{katherine.a.marine.28 }

\usepackage[a4paper, margin=1in]{geometry} % uniform margins
\begin{document}

\maketitle

% \section{Introduction}
\subsubsection{Exercise 1: Ridge with Correlated Predictors}

A sports scientist is modeling a player’s injury risk score \textbf{Y} using three predictors measured before each match:
\begin{itemize}
\item \(X_1:\) \textit{Cumulative minutes played in last 4 weeks}
\item \(X_2:\) \textit{Number of matches started in last 4 weeks}
\item \(X_3:\) \textit{Average high-speed distance per match (km)}
\end{itemize}
\vspace{10pt}
The predictors $X_1$ and $X_2$ are highly correlated: players who play more minutes usually also start more matches. The researcher fits three models using the same training data:
\begin{enumerate}
\item Ordinary least squares (OLS),
\item Ridge regression with $\lambda = 0.1$,
\item Ridge regression with $\lambda = 5$.
\end{enumerate}
\vspace{10pt}
All models include an intercept. The estimated coefficients (excluding the intercept) are:
\vspace{6pt}
\begin{center}
\begin{tabular}{lccc}
\hline
Model & $\hat{\beta}_1$ (for $X_1$) & $\hat{\beta}_2$ (for $X_2$) & $\hat{\beta}_3$ (for $X_3$) \\
\hline
OLS & $4.5$ & $-0.3$ & $0.05$ \\
Ridge ($\lambda = 0.1$) & $3.0$ & $1.2$ & $0.08$ \\
Ridge ($\lambda = 5$) & $1.4$ & $1.3$ & $0.02$ \\
\hline
\end{tabular}
\end{center}
\vspace{10pt}
\noindent On an independent test set, the mean squared error (MSE) for ridge regression at different values of $\lambda$ is:

\begin{center}
\begin{tabular}{cc}
\hline
$\lambda$ & Test MSE \\
\hline
$0$ (OLS) & $12.5$ \\
$0.1$ & $10.2$ \\
$1$ & $8.8$ \\
$5$ & $9.0$ \\
$20$ & $11.7$ \\
\hline
\end{tabular}
\end{center}
\vspace{15pt}
\noindent\underline{Questions}:
\begin{enumerate}
\item Based on the coefficient table, how do the ridge estimates differ from the OLS estimates? Include specifically the size and signs of $\hat{\beta}_1$ and $\hat{\beta}_2$.
\vspace{10pt}
\item Explain why multicollinearity between $X_1$ and $X_2$ can cause the OLS pattern (large $\hat{\beta}_1 = 4.5$ and slightly negative $\hat{\beta}_2 = -0.3$), even if both variables are in reality positively related to injury risk.
\vspace{10pt}
\item Ridge regression minimizes
\[
\sum_{j=1}^N (Y_j - \hat{y}(X_j,D))^2 \;+\; \lambda \sum_{k=1}^3 \beta_k^2.
\]
Intuitively explain how this penalty term leads to:
\begin{itemize}
\item smaller coefficient magnitudes, and
\item a more ``balanced'' sharing of the effect between $X_1$ and $X_2$.
\end{itemize}
\vspace{10pt}
\item Using the MSE table, sketch (by hand) the test MSE as a function of $\lambda$. For which value of $\lambda$ does the model perform best on the test set? Briefly interpret this behavior in terms of the bias–variance tradeoff.

\end{enumerate}

\vspace{15pt}
\noindent\underline{Solution}:
\begin{enumerate}

\item From the coefficient table:
\begin{itemize}
\item OLS: $\hat{\beta}_1 = 4.5$ (large positive), $\hat{\beta}_2 = -0.3$ (slightly negative), $\hat{\beta}_3 = 0.05$ (small positive).
\item Ridge with $\lambda = 0.1$: $\hat{\beta}_1 = 3.0$, $\hat{\beta}_2 = 1.2$, $\hat{\beta}_3 = 0.08$.
\item Ridge with $\lambda = 5$: $\hat{\beta}_1 = 1.4$, $\hat{\beta}_2 = 1.3$, $\hat{\beta}_3 = 0.02$.
\end{itemize}
As $\lambda$ increases, all coefficient magnitudes shrink toward $0$. The ridge estimates also become more ``balanced'' for $X_1$ and $X_2$: for $\lambda = 5$, $\hat{\beta}_1$ and $\hat{\beta}_2$ are very similar (1.4 and 1.3), in contrast to the OLS fit where most of the effect is placed on $X_1$ and $\hat{\beta}_2$ is slightly negative.

\item When $X_1$ and $X_2$ are highly correlated, the design matrix $X^TX$ is nearly singular, so the OLS estimator is very unstable. Small changes in the data can lead to large changes in the individual coefficients. OLS can still produce reasonable predictions for $Y$, but it may do so using a combination of a very large positive coefficient on one predictor and a small or even negative coefficient on the other. Thus we can see a pattern like $\hat{\beta}_1 = 4.5$ and $\hat{\beta}_2 = -0.3$ even if, in reality, both $X_1$ and $X_2$ are positively related to injury risk.

\item Ridge minimizes
\[
\sum_{j=1}^N (Y_j - \hat{y}(X_j,D))^2 + \lambda \sum_{k=1}^3 \beta_k^2.
\]
The penalty term $\lambda \sum_k \beta_k^2$ discourages large coefficients, because any increase in $|\beta_k|$ increases the loss even if the residual sum of squares decreases. As a result, ridge solutions have:
\begin{itemize}
\item \emph{Smaller coefficient magnitudes}: all $\hat{\beta}_k$ are shrunk toward $0$ relative to OLS.
\item \emph{More balanced sharing between $X_1$ and $X_2$}: for highly correlated predictors, it is ``cheaper'' (in terms of the penalty) to assign moderate coefficients to both variables than to let one become very large and the other negative. This leads to ridge spreading the effect across $X_1$ and $X_2$ rather than letting one coefficient explode.
\end{itemize}

\item From the table:

\begin{center}
\begin{tabular}{cc}
\hline
$\lambda$ & Test MSE \\
\hline
$0$ (OLS) & $12.5$ \\
$0.1$ & $10.2$ \\
$1$ & $8.8$ \\
$5$ & $9.0$ \\
$20$ & $11.7$ \\
\hline
\end{tabular}
\end{center}

If we plot test MSE vs.\ $\lambda$, it decreases from $\lambda = 0$ to $\lambda = 1$, reaches a minimum at $\lambda = 1$ (MSE $= 8.8$), then increases again for larger $\lambda$.

Thus the best test performance occurs at $\lambda = 1$. For small $\lambda$, ridge adds a bit of bias but substantially reduces variance, improving generalization. For very large $\lambda$, the coefficients are shrunk too strongly toward zero, increasing bias and causing the test MSE to rise again. This illustrates the usual bias--variance tradeoff.

\end{enumerate}


\end{document}