Add selection with fdr and associate test#361
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #361 +/- ##
==========================================
+ Coverage 97.87% 99.43% +1.56%
==========================================
Files 22 22
Lines 1223 1247 +24
==========================================
+ Hits 1197 1240 +43
+ Misses 26 7 -19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
bthirion
left a comment
There was a problem hiding this comment.
fdr control should be based on p-values or e-values only.
LGTM otherwise.
bthirion
left a comment
There was a problem hiding this comment.
oops, I had a few comments not pushed yet. HTH.
Co-authored-by: bthirion <bertrand.thirion@inria.fr>
Co-authored-by: bthirion <bertrand.thirion@inria.fr>
Co-authored-by: bthirion <bertrand.thirion@inria.fr>
Co-authored-by: Joseph Paillard <joseph.paillard@inria.fr>
bthirion
left a comment
There was a problem hiding this comment.
Thx for taking care of that.
I have a few simplications suggestions.
| Selects features with importance scores above the specified threshold. | ||
| threshold_pvalue : float, optional, default=None | ||
| Selects features with p-values below the specified threshold. | ||
| threshold_max : float, default=None |
There was a problem hiding this comment.
I'm not sure whether this argument really makes sense ?
I think I would have a unique threshold argument for this function.
There was a problem hiding this comment.
It's because sometimes, we want to have the maximum or the minimum.
| Selects features based on a specified percentile of p-values. | ||
| threshold_max : float, default=0.05 | ||
| Selects features with p-values below the specified maximum threshold (0 to 1). | ||
| threshold_min : float, default=None |
There was a problem hiding this comment.
similarly, I don't see any use case for threshold_min here.
There was a problem hiding this comment.
The first idea is to propose a generic way of selection.
My first idea of using it is to have a selecting the feature to discard.
| Selects features with p-values below the specified maximum threshold (0 to 1). | ||
| threshold_min : float, default=None | ||
| Selects features with p-values above the specified minimum threshold (0 to 1). | ||
| alternative_hypothesis : bool, default=False |
There was a problem hiding this comment.
I don't see the use case for alternative hypothesis.
There was a problem hiding this comment.
This was present in the EnCluDL, I add the option for keeping the same possibilities.
| reshaping_function: callable or None, default=None | ||
| Optional reshaping function for FDR control methods. | ||
| If None, defaults to sum of reciprocals for 'bhy'. | ||
| alternative_hippothesis: bool or None, default=False |
There was a problem hiding this comment.
Same thing here, I don't see any reason to consider an alternative hypothesis. This is because importance tests are all one-sided tests that test whether importance is greater 0 (=significantly different from 0, in that case).
There was a problem hiding this comment.
This was present in the EnCluDL, I add the option for keeping the same possibilities.
There was a problem hiding this comment.
Yes, but there are good reasons for that: EncluDL yields a signed statistic, not dCRT.
| random_state=None, | ||
| reuse_screening_model=True, | ||
| k_best=None, | ||
| k_lowest=None, |
There was a problem hiding this comment.
k_lowest is hard to interpret: it only makes sense because we're considering p-values.
There was a problem hiding this comment.
Users won't use it if they can't interpret.
There was a problem hiding this comment.
For DCRT, only pvalue is considered.
jpaillard
left a comment
There was a problem hiding this comment.
Looks almost ready.
- I agree with the comments regarding simplifying the signature of selection functions.
- I suggest simplifying the smoke test: one test per function with multiple asserts to explore branching seems enough to me and would cut duplicated code.
- It would be good to add an example illustrating how to use the new functions. No need to do it here, but could you open an issue for that?
| [0, 2], | ||
| ids=["default_seed", "another seed"], | ||
| ) | ||
| class TestSelection: |
There was a problem hiding this comment.
The tests in this class are smoke tests and have a lot of duplicated code. I think it would be ok to gather all the smoke tests that explore the different selections in one test, or maybe 2 to separate importance_selection and p_value_selection.
There was a problem hiding this comment.
There are not smoke tests because they test the result directly, the values into the array and not only the shape.
It's better to have only one assertion by test. This type of test is a call unit test and they shouldn't be gathered. To group them, I use classes for it.
I don't see the duplication of the code. Each test, test one specific parameter.
There was a problem hiding this comment.
I think it would also be good to have a "behaviour test" with simulated data.
Ideally, in high dimensions, with a method that is not computationally costly, to show that it reduces the number of false discoveries.
There was a problem hiding this comment.
In issue #375, the "behaviour tests", also call system test/user acceptance test, were not defined for the moment.
I will open an issue in regard to it.
Add the method for FDR and the test associate test for these two methods.
Fix bug in selection