This project implements a Multiple Linear Regression model from scratch using the Normal Equation to predict housing prices. It performs statistical analysis by running the model multiple times on random train/test splits to evaluate model stability and performance.
We model the relationship between a dependent variable
Instead of using gradient descent, this project serves as an example of finding the optimal parameters
To ensure the results aren't biased by a single random data split, the script run_multiple_part_2.py:
- Performs 100 independent runs with random 80/20 train/test splits.
- Calculates the mean and standard deviation for:
- Root Mean Squared Error (RMSE)
- Coefficients (
$\beta$ ) for each feature
- Generates visualizations to analyze the distribution of errors and feature importance.
This project uses uv, an extremely fast Python package and project manager.
If you don't have uv installed, you can install it via the official installer script:
Linux / macOS:
curl -LsSf https://astral.sh/uv/install.sh | shWindows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Navigate to the project directory and sync the environment. uv will automatically read pyproject.toml, create a virtual environment, and install all required locked dependencies.
uv syncTo run the analysis and generate the plots, simply execute the main script using uv run. This ensures it runs within the correct environment.
uv run run_multiple_part_2.pyThis will:
- Load the dataset (
Problem2_Dataset.csv). - Run the regression 100 times.
- Print statistical summaries to the console.
- Save the visualizations shown below to the current directory.
This histogram shows the distribution of the Root Mean Squared Error over 100 runs. A narrower distribution indicates a more stable model.

This plot displays the average coefficient value for equal feature. Error bars represent the standard deviation across runs, showing how much the importance of a feature varies depending on the data split.

The scatter plot below compares the Predicted Prices vs Actual Prices for the best performing run (lowest RMSE). The red dashed line represents the ideal scenario where Predicted = Actual.

The extra/ directory contains two alternative implementations for solving the linear regression problem: Gradient Descent and Singular Value Decomposition (SVD).
Unlike the analytical Normal Equation, Gradient Descent is an iterative optimization algorithm.
- Normalization: Features are normalized (z-score) to ensure the cost surface is spherical, allowing faster convergence.
-
Update Rule: Iteratively updates weights
$\beta$ to minimize the Mean Squared Error (MSE): $$ \beta := \beta - \alpha \frac{1}{m} X^T (X\beta - y) $$ where$\alpha$ is the learning rate.
Visualizations:
This method solves for
- Stability: This is numerically more stable than the Normal Equation, especially when matrices are singular or near-singular (multicollinearity).
Visualizations:
| Feature | Normal Equation | Gradient Descent | SVD (Pseudoinverse) |
|---|---|---|---|
| Approach | Analytical (Exact Solution) | Iterative (Approximation) | Analytical (Exact Solution) |
| Computational Cost |
|
|
|
| Scalability | Good for small/medium datasets. Slow for large feature counts ( |
Excellent. Scales well with large datasets (used in Deep Learning). | Good for small/medium datasets. |
| Stability | Can be unstable if |
Stable if learning rate is chosen correctly. | Most Stable. Handles singular matrices gracefully. |
| Feature Scaling | Not required. | Critical. Requires normalization. | Not required. |



