Skip to content

Commit 2bdc1ab

Browse files
add getting started page for documentation
1 parent 3e0b185 commit 2bdc1ab

2 files changed

Lines changed: 102 additions & 54 deletions

File tree

docs/source/getting_started.rst

Lines changed: 101 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,98 +1,145 @@
11
Getting Started
22
===============
33

4-
`mlquantify` is a comprehensive Python toolkit for quantification (also known as class prevalence estimation or class prior estimation). This guide will help you get started with the main features of the library:
5-
6-
- Aggregative methods
7-
- Non-aggregative methods
8-
- Meta methods
9-
- Evaluation metrics
10-
- Confidence Intervals
11-
- Model Selection
12-
13-
It is assumed basic knowledge of machine learning practices (fitting, evaluation, etc.). You can install `mlquantify`
4+
**mlquantify** is a comprehensive Python toolkit for **Quantification** (also known as *Class Prevalence Estimation*, *Class Prior Estimation*, or *Shift Estimation*).
145

156
Installation
167
------------
178

18-
You can install ``mlquantify`` directly from PyPI:
9+
You can install ``mlquantify`` using pip:
1910

2011
.. code-block:: bash
2112
2213
pip install mlquantify
2314
24-
Alternatively, you can install the latest version from source:
15+
Or install the latest development version from source:
2516

2617
.. code-block:: bash
2718
2819
git clone https://github.com/luizfernandolj/mlquantify.git
2920
cd mlquantify
3021
pip install .
3122
32-
Quick Start
23+
Basic Usage
3324
-----------
3425

35-
Here is a minimal example of how to use ``mlquantify`` to estimate class prevalences using the **Classify & Count (CC)** method.
26+
Most quantifiers in ``mlquantify`` behave like scikit-learn estimators. They implement ``fit(X, y)`` and ``predict(X)`` methods.
27+
28+
.. code-block:: python
29+
30+
from sklearn.linear_model import LogisticRegression
31+
from mlquantify.adjust_counting import CC # Classify & Count
32+
33+
# 1. Initialize a base classifier
34+
estimator = LogisticRegression()
35+
36+
# 2. Wrap it with an aggregative quantifier (e.g., CC)
37+
quantifier = CC(estimator)
38+
39+
# 3. Fit on labeled training data
40+
quantifier.fit(X_train, y_train)
41+
42+
# 4. Predict class prevalences on new data
43+
prevalences = quantifier.predict(X_test)
44+
45+
print(prevalences)
46+
47+
The ``fit`` Parameters
48+
----------------------
3649

37-
1. **Import necessary modules**
50+
All aggregative quantifiers in ``mlquantify`` support a consistent set of parameters in their ``fit`` method to control how the underlying classifier is trained or used:
3851

39-
We'll use ``scikit-learn`` for the underlying classifier and data generation.
52+
* ``X``: The training input samples (array-like, sparse matrix).
53+
* ``y``: The target values (class labels).
54+
* ``learner_fitted`` (bool): If ``True``, assumes the provided estimator is already trained. If ``False`` (default), trains the estimator on the provided ``X`` and ``y``.
55+
* ``cv`` (int, cross-validation generator, or iterable): Determines the cross-validation splitting strategy for generating internal predictions (used by methods like ACC, PACC).
56+
* ``stratified`` (bool): If ``True``, uses stratified folds for cross-validation.
57+
* ``shuffle`` (bool): Whether to shuffle the data before splitting in cross-validation.
4058

41-
.. code-block:: python
59+
Aggregative Quantifiers & ``aggregate``
60+
---------------------------------------
4261

43-
from mlquantify.adjust_counting import CC
44-
from sklearn.linear_model import LogisticRegression
45-
from sklearn.datasets import make_classification
46-
from sklearn.model_selection import train_test_split
62+
Aggregative methods (like CC, ACC, PCC) estimate prevalence by aggregating predictions from individual instances.
4763

48-
2. **Generate Synthetic Data**
64+
Unlike standard estimators, they offer an additional **``aggregate``** method. This allows you to perform quantification **without re-predicting** if you already have the classifier's outputs (labels or probabilities) for your test set.
4965

50-
Let's create a binary classification dataset.
66+
.. code-block:: python
5167
52-
.. code-block:: python
68+
# Assume we already have predictions for the test set
69+
predictions = classifier.predict(X_test)
70+
71+
# Use 'aggregate' directly - no need for X_test
72+
estimated_prevalence = quantifier.aggregate(predictions)
5373
54-
# Generate a synthetic dataset
55-
X, y = make_classification(
56-
n_samples=2000,
57-
n_features=20,
58-
n_classes=2,
59-
weights=[0.8, 0.2], # Imbalanced dataset
60-
random_state=42
61-
)
74+
Model evaluation
75+
----------------
6276

63-
# Split into training and testing sets
64-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
77+
Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We typically use a ``train_test_split`` to split a dataset into train and test sets, and then use specific metrics to compare the predicted prevalences against the true prevalences.
6578

66-
3. **Initialize and Train the Quantifier**
79+
``mlquantify`` provides many tools for model evaluation in the :mod:`mlquantify.metrics` module.
6780

68-
Most aggregative quantifiers in ``mlquantify`` wrap a standard classifier. Here we use Logistic Regression.
81+
.. code-block:: python
6982
70-
.. code-block:: python
83+
from sklearn.datasets import make_classification
84+
from sklearn.model_selection import train_test_split
85+
from mlquantify.adjust_counting import CC
86+
from mlquantify.metrics import MAE
87+
from sklearn.linear_model import LogisticRegression
7188
72-
# Initialize the classifier
73-
classifier = LogisticRegression()
89+
# Generate synthetic data
90+
X, y = make_classification(n_samples=1000, weights=[0.8, 0.2], random_state=42)
91+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
7492
75-
# Initialize the Classify & Count (CC) quantifier
76-
quantifier = CC(classifier)
93+
# Initialize and fit
94+
quantifier = CC(LogisticRegression())
95+
quantifier.fit(X_train, y_train)
7796
78-
# Fit the quantifier on the training data
79-
quantifier.fit(X_train, y_train)
97+
# Predict prevalences
98+
y_pred = quantifier.predict(X_test)
8099
81-
4. **Estimate Class Prevalences**
100+
# Calculate Mean Absolute Error (MAE)
101+
# y_test contains true labels; we convert them to prevalences for comparison
102+
error = MAE(y_test, y_pred)
103+
print(f"Mean Absolute Error: {error:.4f}")
82104
83-
Use the ``predict`` method to estimate the class distribution of the test set.
105+
Quantification Protocols
106+
~~~~~~~~~~~~~~~~~~~~~~~~
84107

85-
.. code-block:: python
108+
In quantification, a single test set is often insufficient because we want to evaluate performance across *different* class distributions (shifts).
86109

87-
# Estimate class prevalences
88-
prevalences = quantifier.predict(X_test)
110+
**Protocols** like the **Artificial Prevalence Protocol (APP)** allow you to generate many test samples with varying prevalences from a single dataset.
89111

90-
print(f"Estimated class prevalences: {prevalences}")
91-
# Example Output: [0.79, 0.21]
112+
.. code-block:: python
92113
93-
Next Steps
114+
from mlquantify.protocols import APP
115+
from mlquantify.utils import get_prev_from_labels
116+
117+
# Create an APP generator:
118+
# - n_prevalences=21: Generate samples with prevalences from 0.0 to 1.0 (step 0.05)
119+
# - repeats=10: Generate 10 difference samples for each prevalence
120+
protocol = APP(batch_size=100, n_prevalences=21, repeats=10, random_state=42)
121+
122+
errors = []
123+
124+
# APP.split() yields indices for each test sample
125+
for test_index in protocol.split(X_test, y_test):
126+
X_sample, y_sample = X_test[test_index], y_test[test_index]
127+
128+
# Predict prevalence on this specific sample
129+
pred_prev = quantifier.predict(X_sample)
130+
131+
# Calculate error for this sample
132+
errors.append(MAE(y_sample, pred_prev))
133+
134+
print(f"Mean Absolute Error across {len(errors)} samples: {sum(errors)/len(errors):.4f}")
135+
136+
See :ref:`quantification_protocols` for more details on APP, NPP, and other protocols.
137+
138+
Next steps
94139
----------
95140

96-
* Explore the :ref:`mlquantify-methods` to see all available algorithms.
97-
* Check out the :ref:`user-guide` for in-depth tutorials on Aggregative and Non-Aggregative methods.
98-
* Learn about :ref:`evaluation-metrics` to properly assess your quantifier's performance.
141+
We have briefly covered estimator fitting and predicting, aggregative methods, and model evaluation. This guide should give you an overview of some of the main features of the library, but there is much more to ``mlquantify``!
142+
143+
Please refer to our :ref:`user-guide` for details on all the tools that we provide, including **Non-Aggregative Methods**, **Meta Quantification**, and **Confidence Intervals**. You can also find an exhaustive list of the public API in the :ref:`api`.
144+
145+
You can also look at our numerous :ref:`examples` that illustrate the use of ``mlquantify`` in many different contexts.

docs/source/modules/protocols.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ The :class:`UPP` is a variant of the APP that ensures uniform sampling of class
7575
- Particularly beneficial in multiclass quantification tasks (less computationally intensive).
7676

7777
**Example**
78+
7879
.. code-block:: python
7980
8081
from mlquantify.protocols import UPP

0 commit comments

Comments
 (0)