|
1 | 1 | Getting Started |
2 | 2 | =============== |
3 | 3 |
|
4 | | -`mlquantify` is a comprehensive Python toolkit for quantification (also known as class prevalence estimation or class prior estimation). This guide will help you get started with the main features of the library: |
5 | | - |
6 | | -- Aggregative methods |
7 | | -- Non-aggregative methods |
8 | | -- Meta methods |
9 | | -- Evaluation metrics |
10 | | -- Confidence Intervals |
11 | | -- Model Selection |
12 | | - |
13 | | -It is assumed basic knowledge of machine learning practices (fitting, evaluation, etc.). You can install `mlquantify` |
| 4 | +**mlquantify** is a comprehensive Python toolkit for **Quantification** (also known as *Class Prevalence Estimation*, *Class Prior Estimation*, or *Shift Estimation*). |
14 | 5 |
|
15 | 6 | Installation |
16 | 7 | ------------ |
17 | 8 |
|
18 | | -You can install ``mlquantify`` directly from PyPI: |
| 9 | +You can install ``mlquantify`` using pip: |
19 | 10 |
|
20 | 11 | .. code-block:: bash |
21 | 12 |
|
22 | 13 | pip install mlquantify |
23 | 14 |
|
24 | | -Alternatively, you can install the latest version from source: |
| 15 | +Or install the latest development version from source: |
25 | 16 |
|
26 | 17 | .. code-block:: bash |
27 | 18 |
|
28 | 19 | git clone https://github.com/luizfernandolj/mlquantify.git |
29 | 20 | cd mlquantify |
30 | 21 | pip install . |
31 | 22 |
|
32 | | -Quick Start |
| 23 | +Basic Usage |
33 | 24 | ----------- |
34 | 25 |
|
35 | | -Here is a minimal example of how to use ``mlquantify`` to estimate class prevalences using the **Classify & Count (CC)** method. |
| 26 | +Most quantifiers in ``mlquantify`` behave like scikit-learn estimators. They implement ``fit(X, y)`` and ``predict(X)`` methods. |
| 27 | + |
| 28 | +.. code-block:: python |
| 29 | +
|
| 30 | + from sklearn.linear_model import LogisticRegression |
| 31 | + from mlquantify.adjust_counting import CC # Classify & Count |
| 32 | +
|
| 33 | + # 1. Initialize a base classifier |
| 34 | + estimator = LogisticRegression() |
| 35 | +
|
| 36 | + # 2. Wrap it with an aggregative quantifier (e.g., CC) |
| 37 | + quantifier = CC(estimator) |
| 38 | + |
| 39 | + # 3. Fit on labeled training data |
| 40 | + quantifier.fit(X_train, y_train) |
| 41 | +
|
| 42 | + # 4. Predict class prevalences on new data |
| 43 | + prevalences = quantifier.predict(X_test) |
| 44 | + |
| 45 | + print(prevalences) |
| 46 | +
|
| 47 | +The ``fit`` Parameters |
| 48 | +---------------------- |
36 | 49 |
|
37 | | -1. **Import necessary modules** |
| 50 | +All aggregative quantifiers in ``mlquantify`` support a consistent set of parameters in their ``fit`` method to control how the underlying classifier is trained or used: |
38 | 51 |
|
39 | | - We'll use ``scikit-learn`` for the underlying classifier and data generation. |
| 52 | +* ``X``: The training input samples (array-like, sparse matrix). |
| 53 | +* ``y``: The target values (class labels). |
| 54 | +* ``learner_fitted`` (bool): If ``True``, assumes the provided estimator is already trained. If ``False`` (default), trains the estimator on the provided ``X`` and ``y``. |
| 55 | +* ``cv`` (int, cross-validation generator, or iterable): Determines the cross-validation splitting strategy for generating internal predictions (used by methods like ACC, PACC). |
| 56 | +* ``stratified`` (bool): If ``True``, uses stratified folds for cross-validation. |
| 57 | +* ``shuffle`` (bool): Whether to shuffle the data before splitting in cross-validation. |
40 | 58 |
|
41 | | - .. code-block:: python |
| 59 | +Aggregative Quantifiers & ``aggregate`` |
| 60 | +--------------------------------------- |
42 | 61 |
|
43 | | - from mlquantify.adjust_counting import CC |
44 | | - from sklearn.linear_model import LogisticRegression |
45 | | - from sklearn.datasets import make_classification |
46 | | - from sklearn.model_selection import train_test_split |
| 62 | +Aggregative methods (like CC, ACC, PCC) estimate prevalence by aggregating predictions from individual instances. |
47 | 63 |
|
48 | | -2. **Generate Synthetic Data** |
| 64 | +Unlike standard estimators, they offer an additional **``aggregate``** method. This allows you to perform quantification **without re-predicting** if you already have the classifier's outputs (labels or probabilities) for your test set. |
49 | 65 |
|
50 | | - Let's create a binary classification dataset. |
| 66 | +.. code-block:: python |
51 | 67 |
|
52 | | - .. code-block:: python |
| 68 | + # Assume we already have predictions for the test set |
| 69 | + predictions = classifier.predict(X_test) |
| 70 | + |
| 71 | + # Use 'aggregate' directly - no need for X_test |
| 72 | + estimated_prevalence = quantifier.aggregate(predictions) |
53 | 73 |
|
54 | | - # Generate a synthetic dataset |
55 | | - X, y = make_classification( |
56 | | - n_samples=2000, |
57 | | - n_features=20, |
58 | | - n_classes=2, |
59 | | - weights=[0.8, 0.2], # Imbalanced dataset |
60 | | - random_state=42 |
61 | | - ) |
| 74 | +Model evaluation |
| 75 | +---------------- |
62 | 76 |
|
63 | | - # Split into training and testing sets |
64 | | - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) |
| 77 | +Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We typically use a ``train_test_split`` to split a dataset into train and test sets, and then use specific metrics to compare the predicted prevalences against the true prevalences. |
65 | 78 |
|
66 | | -3. **Initialize and Train the Quantifier** |
| 79 | +``mlquantify`` provides many tools for model evaluation in the :mod:`mlquantify.metrics` module. |
67 | 80 |
|
68 | | - Most aggregative quantifiers in ``mlquantify`` wrap a standard classifier. Here we use Logistic Regression. |
| 81 | +.. code-block:: python |
69 | 82 |
|
70 | | - .. code-block:: python |
| 83 | + from sklearn.datasets import make_classification |
| 84 | + from sklearn.model_selection import train_test_split |
| 85 | + from mlquantify.adjust_counting import CC |
| 86 | + from mlquantify.metrics import MAE |
| 87 | + from sklearn.linear_model import LogisticRegression |
71 | 88 |
|
72 | | - # Initialize the classifier |
73 | | - classifier = LogisticRegression() |
| 89 | + # Generate synthetic data |
| 90 | + X, y = make_classification(n_samples=1000, weights=[0.8, 0.2], random_state=42) |
| 91 | + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) |
74 | 92 |
|
75 | | - # Initialize the Classify & Count (CC) quantifier |
76 | | - quantifier = CC(classifier) |
| 93 | + # Initialize and fit |
| 94 | + quantifier = CC(LogisticRegression()) |
| 95 | + quantifier.fit(X_train, y_train) |
77 | 96 |
|
78 | | - # Fit the quantifier on the training data |
79 | | - quantifier.fit(X_train, y_train) |
| 97 | + # Predict prevalences |
| 98 | + y_pred = quantifier.predict(X_test) |
80 | 99 |
|
81 | | -4. **Estimate Class Prevalences** |
| 100 | + # Calculate Mean Absolute Error (MAE) |
| 101 | + # y_test contains true labels; we convert them to prevalences for comparison |
| 102 | + error = MAE(y_test, y_pred) |
| 103 | + print(f"Mean Absolute Error: {error:.4f}") |
82 | 104 |
|
83 | | - Use the ``predict`` method to estimate the class distribution of the test set. |
| 105 | +Quantification Protocols |
| 106 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
84 | 107 |
|
85 | | - .. code-block:: python |
| 108 | +In quantification, a single test set is often insufficient because we want to evaluate performance across *different* class distributions (shifts). |
86 | 109 |
|
87 | | - # Estimate class prevalences |
88 | | - prevalences = quantifier.predict(X_test) |
| 110 | +**Protocols** like the **Artificial Prevalence Protocol (APP)** allow you to generate many test samples with varying prevalences from a single dataset. |
89 | 111 |
|
90 | | - print(f"Estimated class prevalences: {prevalences}") |
91 | | - # Example Output: [0.79, 0.21] |
| 112 | +.. code-block:: python |
92 | 113 |
|
93 | | -Next Steps |
| 114 | + from mlquantify.protocols import APP |
| 115 | + from mlquantify.utils import get_prev_from_labels |
| 116 | +
|
| 117 | + # Create an APP generator: |
| 118 | + # - n_prevalences=21: Generate samples with prevalences from 0.0 to 1.0 (step 0.05) |
| 119 | + # - repeats=10: Generate 10 difference samples for each prevalence |
| 120 | + protocol = APP(batch_size=100, n_prevalences=21, repeats=10, random_state=42) |
| 121 | +
|
| 122 | + errors = [] |
| 123 | + |
| 124 | + # APP.split() yields indices for each test sample |
| 125 | + for test_index in protocol.split(X_test, y_test): |
| 126 | + X_sample, y_sample = X_test[test_index], y_test[test_index] |
| 127 | + |
| 128 | + # Predict prevalence on this specific sample |
| 129 | + pred_prev = quantifier.predict(X_sample) |
| 130 | + |
| 131 | + # Calculate error for this sample |
| 132 | + errors.append(MAE(y_sample, pred_prev)) |
| 133 | +
|
| 134 | + print(f"Mean Absolute Error across {len(errors)} samples: {sum(errors)/len(errors):.4f}") |
| 135 | +
|
| 136 | +See :ref:`quantification_protocols` for more details on APP, NPP, and other protocols. |
| 137 | + |
| 138 | +Next steps |
94 | 139 | ---------- |
95 | 140 |
|
96 | | -* Explore the :ref:`mlquantify-methods` to see all available algorithms. |
97 | | -* Check out the :ref:`user-guide` for in-depth tutorials on Aggregative and Non-Aggregative methods. |
98 | | -* Learn about :ref:`evaluation-metrics` to properly assess your quantifier's performance. |
| 141 | +We have briefly covered estimator fitting and predicting, aggregative methods, and model evaluation. This guide should give you an overview of some of the main features of the library, but there is much more to ``mlquantify``! |
| 142 | + |
| 143 | +Please refer to our :ref:`user-guide` for details on all the tools that we provide, including **Non-Aggregative Methods**, **Meta Quantification**, and **Confidence Intervals**. You can also find an exhaustive list of the public API in the :ref:`api`. |
| 144 | + |
| 145 | +You can also look at our numerous :ref:`examples` that illustrate the use of ``mlquantify`` in many different contexts. |
0 commit comments