Log space outliers#291
Conversation
TabPFNUnsupervisedModel.outliers() previously combined per-feature densities via exp(sum(log(p))) using a 1/pdf trick to keep intermediates small. On wider datasets this still overflows when densities are >> 1 (flagged with an existing TODO in the source). - outliers_single_permutation_() now returns log p directly via -criterion.forward(...), dropping the 1/pdf hack and the explicit log(pred) / clamp steps. - outliers() combines permutations via the log-sum-exp identity: log(mean(p_k)) = logsumexp(log p_k) - log(K). - Score semantics: lower scores indicate more likely outliers; callable signatures unchanged. - experiments.py demo adapted to new score semantics: column p -> scores, percentile comparisons flipped (< instead of >), and quantile computation handles -inf rows.
The values returned by outliers()/outliers_pdf()/outliers_pmf() are log-densities, not unitless scores. Rename the experiment attribute, DataFrame columns, and local variables to match what the values are. - experiments.py: Experiment.scores -> Experiment.log_p; DataFrame columns "scores"/"score_rank" -> "log_p"/"log_p_rank"; run() return dict key "outlier_scores" -> "log_p". - unsupervised.py: locals scores_pdf/scores_pmf -> log_pdf/log_pmf inside outliers_pdf()/outliers_pmf(); docstrings updated. Public method names (outliers, outliers_pdf, outliers_pmf) are unchanged.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request refactors the unsupervised outlier detection logic to operate entirely in log-probability space, utilizing the log-sum-exp trick for numerical stability when averaging densities across permutations. Plotting functions in the experiment module have been updated to reflect these changes, including a shift from JointGrid to standard scatter plots. Review feedback identifies a critical bug in the feature masking logic within outliers_pdf and outliers_pmf that incorrectly indexes dimensions, and notes a breaking API change in the return type of the plot_two method.
212ed51 to
18c9ce9
Compare
|
Hi @ClementBourt - thanks for the PR, could you please merge base branch in before review? |
|
Just merged with base. |
|
Heyho! Looking into this now, as well, I was assigned to the related issue. The proposed change seems reasonable to me and is a valid alternative for aggregating outlier detection. While I am not an expert in the field, it should be fine to move to the more stable version of the code (i.e., with the new aggregation). This will be a breaking change, but it is likely for a specific use case in the extensions. So I would be fine to merge as long as the test passes and the example runs fine! Small note: please make sure to run ruff on the changes to get the linting correct. |
|
Ruff test should pass now. |
|
Do you have an overview of the differences between the two approaches (e.g., the original vs your proposal)? What could the implications be for users? In general, all of these ways produced outlier detection, unclear if better or worse, but that might be less important than stability here. |
Core explanationThe approach itself doesn't change from a theoretical standpoint — you still use the chain rule to transform a likelihood estimation into a series of model fittings. This works because TabPFN organically produces calibrated probabilities. For stability reasons, different feature permutations are averaged in the density estimation. Let π ∈ Sd denote a permutation of the d features (so Sd is the symmetric group), and let p̂π(x) denote the chain-rule density estimate produced when features are conditioned in the order π. Sampling K permutations π1, …, πK uniformly from Sd and averaging: The previous implementation computed each individual p̂π in log space and exponentiated back into linear space. Whenever numerical features were part of the mix, overflow could — and did — happen. The workaround in the code was to flip This resolves the overflow for large pdfs but introduces a new one for very diffuse pdfs (the situation I encountered). This form also comes with another issue: when both pmf and pdf are small (both indicating an outlier), they now cancel each other out rather than compounding. Long story short from hereOverflow can't be avoided in linear space — we need to stay in log space. The log-sum-exp identity lets us compute log p̄(x) without ever leaving log space, eliminating the overflow risk. Killing two birds with one stone: categorical and numerical features can now again be mixed together as we can restor More detailslog-sum-exp trick: Let ℓk := log p̂πk(x) for the k-th sampled permutation πk. Then: The right-hand identity is what makes the computation numerically stable: subtracting ℓmax from every term before exponentiating guarantees the largest input to exp is exactly zero, so no overflow can occur. Failed attempt: I first considered staying in log space by averaging the log-likelihoods log p̂π directly. But this is equivalent to taking the geometric mean (GM) of the per-permutation densities in linear space, not the arithmetic mean (AM). A given ordering of estimated densities using AM for a dataset is not guaranteed to be preserved using GM. Thus, depending on the method, specific points could be flagged as outliers or not. Final nail in the coffin for GM. Averaging in log-space means we are making the assumption that: This is equivalent to p̂π ≡ ptrue — in words, every chain-rule decomposition of the model agrees with the true joint density. Proof. The right-hand side of (1) is the negative entropy of ptrue: The left-hand side can be written as: The inner expectation is the negative cross-entropy of ptrue with respect to p̂π: Standard cross-entropy decomposition: Substituting (2), (3), (4), (5) into (1): Since
|
|
A summary for future reference:
Which sounds good to me to merge. Thank you for your contribution and help! |
Disclaimer
This document was drafted by AI and reviewed by a human.
Linked issue
Closes #289 — Overflow in diffuse density.
What changes
Replace
outliers()'s combiner with the log of the Arythmetic Mean (AM)-via-logsumexp identity. Three coordinated edits inunsupervised.py, plus a matching consistency update inexperiments.py.unsupervised.py1.0 /inversion. The regressor branch reads the log-density directly:log_pred = -pred["criterion"].forward(logits, y_predict).to(log_p.device).The resolved L641–L644 TODO block is removed.
outliers_single_permutation_returnslog_ponly.outliers()averages densities via the log-sum-exp identity:nan_to_numclamps; log-densities don't blow up the wayexp(log_p)did.outliers(),outliers_pdf,outliers_pmf, and the module-level usage example updated to reflect log-density semantics.outliers_pdfandoutliers_pmfrename their local variablespdf/pmf→log_pdf/log_pmfto reflect the log-density semantics.experiments.pyOutlierDetectionUnsupervisedExperimentis updated to match the new score semantics:self.p→self.log_p, DataFrame column"p"→"log_p", rank column"p_rank"→"log_p_rank".plot_two(): threshold tests inverted fromx > threshtox < thresh. Under the old1/pdfcombiner, large output meant low joint pdf meant outlier; underlog(AM(pdf)), low output means outlier — opposite direction. Default percentiles are inverted to match:outlier_thresh_p0.98 → 0.02,outlier_thresh_p_10.9 → 0.1, with oversampling fractions and legend labels updated accordingly.[self.data["p"] > 0]quantile filter, replace with-infclamp: the old filter was a defensive measure against the old code'snan_to_num(nan=0.0)artifacts; under log-density semantics it would silently exclude all rows where the AM density is < 1, which is the dominant regime for diffuse predictive PDFs. The replacement: clamp-infrows (zero-pmf categories) to the finite minimum for the quantile computation only —np.quantileinterpolation across-infproducesNaN(-inf - -inf). The original log_p series is preserved for bucketing, wherex < threshstill classifies-infrows as Low correctly.JointGrid(joint scatter + marginal histograms) replaced with a plainscatterplot.The return type of
outliers(),outliers_pdf(), andoutliers_pmf()changes from densities to log-densities. Per the project's semver policy in CONTRIBUTING.md, this is a MAJOR version bump. We recommend the strictly-breaking change because the old return type encoded the bug; consumers can calltorch.exp(score)to recover the AM density if they want it.Public attribute
OutlierDetectionUnsupervisedExperiment.pis renamed tolog_p, and the DataFrame columns"p"and"p_rank"(in the same class'sself.data) are renamed to"log_p"and"log_p_rank". The threshold-test polarity inplot_two()is flipped (x > thresh→x < thresh) — any consumer that was thresholding the oldpcolumn directly must invert their comparison in addition to renaming. Same MAJOR-bump concern.Empirical evidence
Synthetic dataset designed to reproduce overflow (1000 rows, 2 categorical × 9 classes, 5 numerical features as
sign(±) * 10^uniform(-15, +15), n_permutations=10, seed=42):min(scores)max(scores)unique_valuesfrac_at_ceiling (≥ 1e29)flag_rate @ 5th-pct ruleThe post-fix
flag_rate=0.05confirms the 5%-percentile contract is honored.unique_values ≈ Nconfirms full per-row resolution (no ties from clamp saturation).Test plan
pytest tests/test_unsupervised.py— all 5 pass (2 existing + 3 new)pytest tests/— full vendor suite passesruff check src/tabpfn_extensions/unsupervised/— cleanmypy src/tabpfn_extensions/unsupervised/ --python-version=3.10— clean