🚀 Sparse Factor Model Estimation: factorlasso

factorlasso package implements sign-constrained LASSO, prior-centered regularisation, and hierarchical group LASSO (HCGL) for sparse multi-output factor model estimation with integrated factor covariance assembly

📊 Metric	🔢 Value
PyPI Version
Python Versions
License

📈 Package Statistics

📊 Metric	🔢 Value
Total Downloads
CI Status
Coverage
GitHub Stars
GitHub Forks

The Problem

In many applications — portfolio construction, genomics, macro-econometrics — you need to estimate a factor model

$$Y_t = \alpha + \beta X_t + \varepsilon_t$$

where $Y_t \in \mathbb{R}^{N}$ are response variables (asset returns, gene expressions), $X_t \in \mathbb{R}^{M}$ are factors, $\beta \in \mathbb{R}^{N \times M}$ are sparse factor loadings, and $\alpha \in \mathbb{R}^{N}$ is the intercept.

In practice, you face several challenges that standard LASSO packages don't handle:

Domain knowledge constrains coefficient signs — equity assets should have non-negative equity beta; government bonds should not load on commodity factors. Standard LASSO ignores this.
You have prior estimates and want to shrink toward them, not toward zero — the penalty should be $|\beta - \beta_0|$ not $|\beta|$.
Variables have different history lengths — some assets start trading later than others. Dropping rows with any NaN discards valid data for all other variables.
You need a consistent covariance matrix — the factor covariance $\Sigma_y = \beta \Sigma_x \beta^\top + D$ must use the same $\beta$ from estimation, not a separate estimate.
Data is non-stationary — recent observations should carry more weight (EWMA weighting).

factorlasso solves all five in a single fit() call. The implementation follows scikit-learn conventions (fit / predict / score / coef_ / intercept_).

The methodology is based on the Hierarchical Clustering Group LASSO (HCGL) framework introduced in:

Sepp A., Ossa I., Kastenholz M. (2026), "Robust Optimization of Strategic and Tactical Asset Allocation for Multi-Asset Portfolios", The Journal of Portfolio Management, 52(4), 86–120. Paper link

and the Capital Market Assumptions framework in the companion paper:

Sepp A., Hansen E., Kastenholz M. (2026), "Capital Market Assumptions and Strategic Asset Allocation Using Multi-Asset Tradable Factors", Under revision at the Journal of Portfolio Management.

Installation

Install using

pip install factorlasso

Upgrade using

pip install --upgrade factorlasso

Clone using

git clone https://github.com/ArturSepp/factorlasso.git

Core dependencies: numpy, pandas, scipy, cvxpy, openpyxl

Quick Start

import numpy as np, pandas as pd
from factorlasso import LassoModel, LassoModelType

# Simulate Y_t = β X_t + noise  (code uses row-major: Y = X @ β' + noise)
np.random.seed(42)
T, M, N = 200, 3, 5
X = pd.DataFrame(np.random.randn(T, M), columns=['f0', 'f1', 'f2'])
beta_true = np.array([[1, 0, .5], [0, 1, 0], [.3, 0, 0], [0, .8, .2], [1, .5, 0]])
Y = pd.DataFrame(X.values @ beta_true.T + .1*np.random.randn(T, N),
                  columns=[f'y{i}' for i in range(N)])

# Fit sparse factor model
model = LassoModel(model_type=LassoModelType.LASSO, reg_lambda=1e-4)
model.fit(x=X, y=Y)
print(model.coef_.round(2))       # β (N × M)
print(model.intercept_.round(4))  # α (N,)

# Predict and score (scikit-learn compatible)
y_hat = model.predict(X)  # Ŷ_t = α + β X_t  (code: X @ β' + α)
r2 = model.score(X, Y)    # mean R² across response variables

Convention: Paper vs Code

The factor model in the paper uses column vectors:

$$Y_t = \alpha + \beta, X_t + \varepsilon_t, \qquad \beta \in \mathbb{R}^{N \times M}$$

where $Y_t \in \mathbb{R}^{N \times 1}$ and $X_t \in \mathbb{R}^{M \times 1}$.

In Python, pandas DataFrames store observations as rows. The code works with the row-major equivalent:

Symbol	Paper (column-vector)	Code (row-major, pandas)
$Y$	$(N \times T)$	`y`: DataFrame $(T \times N)$
$X$	$(M \times T)$	`x`: DataFrame $(T \times M)$
$\beta$	$(N \times M)$	`coef_`: DataFrame $(N \times M)$ — same as paper
$\alpha$	$(N \times 1)$	`intercept_`: Series $(N,)$

The coefficient matrix coef_ is stored in the paper convention $(N \times M)$. The prediction Y = X @ β' + α in code is the row-major form of the paper's Y_t = α + β X_t.

Sign Constraints

Enforce domain knowledge on coefficient signs using a constraint matrix where 1 = non-negative, -1 = non-positive, 0 = constrained to zero, NaN = free:

signs = pd.DataFrame([[1, np.nan, 1], [np.nan, 1, 0], [1, 0, np.nan],
                       [np.nan, 1, 1], [1, 1, np.nan]],
                      index=Y.columns, columns=X.columns)

model = LassoModel(reg_lambda=1e-4, factors_beta_loading_signs=signs)
model.fit(x=X, y=Y)
# All constrained coefficients satisfy their sign requirements by construction

Prior-Centred Regularisation

Shrink toward a non-zero prior instead of zero. When you have prior estimates $\beta_0$ (e.g., from a previous estimation period or theoretical model), the penalty becomes $|\beta - \beta_0|$ instead of $|\beta|$:

beta_prior = pd.DataFrame(beta_true, index=Y.columns, columns=X.columns)
model = LassoModel(reg_lambda=1e-2, factors_beta_prior=beta_prior)
model.fit(x=X, y=Y)  # shrinks toward beta_prior instead of zero

Hierarchical Clustering Group LASSO (HCGL)

Automatically discover group structure among response variables via hierarchical clustering on their correlation matrix (Ward's method), then apply Group LASSO with group-adaptive penalties:

model = LassoModel(
    model_type=LassoModelType.GROUP_LASSO_CLUSTERS,
    reg_lambda=1e-5, span=52,
)
model.fit(x=X, y=Y)
print(model.clusters)  # auto-discovered groups

NaN-Aware Estimation

Variables with different history lengths are handled naturally. Instead of dropping any row containing a NaN (which discards valid observations for all other variables), factorlasso applies a binary validity mask that zeros out the contribution of missing observations per variable while preserving all available data:

Y_with_gaps = Y.copy()
Y_with_gaps.iloc[:50, 3] = np.nan   # variable y3 starts 50 periods later
Y_with_gaps.iloc[:100, 4] = np.nan  # variable y4 starts 100 periods later

model = LassoModel(reg_lambda=1e-4)
model.fit(x=X, y=Y_with_gaps)
# All 5 variables estimated using their full available history
# No data discarded for y0, y1, y2 despite gaps in y3, y4

Factor Covariance Assembly

After estimation, assemble the consistent factor covariance decomposition $\Sigma_y = \beta \Sigma_x \beta^\top + D$ where $\beta$ is the same matrix from the LASSO estimation — guaranteed consistency:

from factorlasso import CurrentFactorCovarData, VarianceColumns

sigma_y = CurrentFactorCovarData(
    x_covar=factor_covariance,   # Σ_x (M × M)
    y_betas=model.coef_,          # β (N × M) from estimation
    y_variances=diagnostics_df,   # residual variances D
).get_y_covar()
# sigma_y is (N × N) positive semi-definite by construction

API Summary

The API follows scikit-learn conventions: fit / predict / score.

Method	Description
`model.fit(x, y)`	Estimate α, β — returns `self`
`model.predict(x)`	Return Ŷ_t = α + β X_t (row-major: `X @ β' + α`)
`model.score(x, y)`	Return mean R²

Fitted attribute	Shape	Description
`coef_`	(N, M)	Factor loadings β
`intercept_`	(N,)	Intercept α
`estimated_betas`	(N, M)	Alias for `coef_` (backward compat)
`clusters_`	(N,)	HCGL cluster labels
`estimation_result_`	—	Full diagnostics (r2, ss_res, ss_total)

Parameter	Type	Default	Description
`model_type`	`LassoModelType`	`LASSO`	Estimation method
`reg_lambda`	`float`	`1e-5`	Regularisation strength
`span`	`int`	`None`	EWMA span for observation weighting
`factors_beta_loading_signs`	`DataFrame`	`None`	Sign constraint matrix (N × M)
`factors_beta_prior`	`DataFrame`	`None`	Prior β₀ matrix (N × M)
`group_data`	`Series`	`None`	Group labels (required for `GROUP_LASSO`)
`demean`	`bool`	`True`	Subtract (rolling) mean before estimation
`solver`	`str`	`'CLARABEL'`	CVXPY solver name
`warmup_period`	`int`	`12`	Min observations before including a variable

Estimation Methods

Method	`LassoModelType`	Penalty
LASSO	`LASSO`	$\lambda\|\beta - \beta_0\|_1$
Group LASSO	`GROUP_LASSO`	$\sum_g \lambda\sqrt{
HCGL	`GROUP_LASSO_CLUSTERS`	Same as Group LASSO with auto-clustering

All methods support sign constraints, prior-centered shrinkage, EWMA weighting, and NaN-aware estimation.

Applications

The methodology is domain-agnostic. Examples are provided for:

examples/finance_factor_model.py — Multi-asset factor models with sign-constrained betas and consistent covariance estimation
examples/genomics_factor_model.py — Gene expression driven by pathway activity factors with biological sign priors

The same estimation problem (sparse factor loadings with sign priors and consistent covariance) appears in macro-econometrics, signal processing, and multi-task learning.

Illustration: multi-asset factor model with HCGL

from factorlasso import LassoModel, LassoModelType

model = LassoModel(
    model_type=LassoModelType.GROUP_LASSO_CLUSTERS,
    reg_lambda=1e-5,
    span=52,                                 # 1-year EWMA half-life (weekly data)
    factors_beta_loading_signs=sign_matrix,   # domain-knowledge constraints
    factors_beta_prior=prior_betas,           # shrink toward prior, not zero
)
model.fit(x=factor_returns, y=asset_returns)

# Inspect results
print(model.coef_)           # sparse factor loadings (N × M)
print(model.intercept_)      # intercept α (N,)
print(model.clusters_)       # auto-discovered asset groups
print(model.score(factor_returns, asset_returns))  # mean R²

Related Packages

Package	Key Difference
scikit-learn `Lasso`	No sign constraints, no multi-output Group LASSO
skglm	No sign constraints, no prior-centered shrinkage
abess	Best-subset selection (L0), not L1/Group L2
group-lasso	No sign constraints, no EWMA, no prior-centered

factorlasso is the only package that combines sign-constrained penalised regression, prior-centered shrinkage, HCGL clustering, NaN-aware estimation, and integrated factor covariance assembly.

References

Sepp A., Ossa I., Kastenholz M. (2026), "Robust Optimization of Strategic and Tactical Asset Allocation for Multi-Asset Portfolios", The Journal of Portfolio Management, 52(4), 86–120. Paper link
Sepp A., Hansen E., Kastenholz M. (2026), "Capital Market Assumptions and Strategic Asset Allocation Using Multi-Asset Tradable Factors", Under revision at the Journal of Portfolio Management.

Citation

If you use factorlasso in your research, please cite the software and the underlying papers:

@software{sepp2026factorlasso,
  author = {Sepp, Artur},
  title = {factorlasso: Sparse Factor Model Estimation with Constrained LASSO in Python},
  year = {2026},
  url = {https://github.com/ArturSepp/factorlasso}
}

@article{seppossa2026,
  author = {Sepp, Artur and Ossa, Ivan and Kastenholz, Mika},
  title = {Robust Optimization of Strategic and Tactical Asset Allocation for Multi-Asset Portfolios},
  journal = {The Journal of Portfolio Management},
  volume = {52},
  number = {4},
  pages = {86--120},
  year = {2026}
}

@article{sepphansen2026,
  author = {Sepp, Artur and Hansen, Emilie and Kastenholz, Mika},
  title = {Capital Market Assumptions and Strategic Asset Allocation Using Multi-Asset Tradable Factors},
  journal = {Under revision at the Journal of Portfolio Management},
  year = {2026}
}

Disclaimer

factorlasso package is distributed FREE & WITHOUT ANY WARRANTY under the MIT License.

See LICENSE for details.

Please report any bugs or suggestions by opening an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
.idea		.idea
examples		examples
factorlasso		factorlasso
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
ci.yml		ci.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Sparse Factor Model Estimation: factorlasso

📈 Package Statistics

The Problem

Installation

Table of Contents

Quick Start

Convention: Paper vs Code

Sign Constraints

Prior-Centred Regularisation

Hierarchical Clustering Group LASSO (HCGL)

NaN-Aware Estimation

Factor Covariance Assembly

API Summary

Estimation Methods

Applications

Illustration: multi-asset factor model with HCGL

Related Packages

References

Citation

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Sparse Factor Model Estimation: factorlasso

📈 Package Statistics

The Problem

Installation

Table of Contents

Quick Start

Convention: Paper vs Code

Sign Constraints

Prior-Centred Regularisation

Hierarchical Clustering Group LASSO (HCGL)

NaN-Aware Estimation

Factor Covariance Assembly

API Summary

Estimation Methods

Applications

Illustration: multi-asset factor model with HCGL

Related Packages

References

Citation

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages