[API 2]: Model X knockoff#367
Conversation
Originally posted by @AngelReyero in #361 (comment)
Originally posted by @jpaillard in #361 (comment) |
6038de1 to
edeab86
Compare
jpaillard
left a comment
There was a problem hiding this comment.
Looks overall good. I mostly have comments for naming and organisation.
I would suggest to set by default statistical_test='lcd', rename lasso_statistic_with_sampling --> ModelXKnockoff.lasso_coefficient_difference and make it an internal method of the class ModelXKnockoff. IMO the method should be attached to the class, and use the attribute self.estimator (see below).
To the same extend, the class should have a parameter estimator in order to expose the Lasso model instead of keeping it under the hood. Especially since it uses a particularly high default max_iter=200000,
| n_bootstraps = 25 | ||
| # number of jobs for repetition of the method | ||
| n_jobs = 2 | ||
| n_jobs = 1 |
There was a problem hiding this comment.
- I think it makes sense to illustrate in the example where users should use parallelization and to do so using n_jobs>1
- If we decide not to show it I would remove n_jobs and rely on the default
There was a problem hiding this comment.
n_jobs=2 (or even 4) is a better option IMHO.
There was a problem hiding this comment.
I get some issue when I was running with n_jobs=2.
There was a problem hiding this comment.
For running the example, I got some errors in the last three commits. This was the solution to my problem.
There was a problem hiding this comment.
The creation of the class makes a track of some states, which uses too much memory and create an error.
There was a problem hiding this comment.
May be related to #356.
We still have the problem of nested parallel loops in the example generation. I would still argue that showcasing parallelization in the example is valuable.
|
|
||
| def preconfigure_LassoCV(estimator, X, X_tilde, y, n_alphas=20): | ||
| """ | ||
| Configure the estimator for Model-X knockoffs. |
There was a problem hiding this comment.
I think the docstring should focus more on L43-44: the regularization path is defined in a data-dependent way.
The paragraph in Notes should be the central part of the docstring. Is there a reference?
| return estimator | ||
|
|
||
|
|
||
| def lasso_statistic_with_sampling( |
There was a problem hiding this comment.
Similar to D0CRT (regression_test, logistic_test), IMO lasso_statistic should be a method of the knockoff class. As its signature suggests, it is KO-specific. it would be more intuitive to place it in the same file, under the class.
There was a problem hiding this comment.
From what I understand, the actual implementation of the test is the original one but there are propositions for modifying.
I want it to let the possibility for users to modify it.
There was a problem hiding this comment.
Similar to D0CRT, we can consider extending to other test statistics, but they should still be methods of the class.
- These methods will not be used anywhere else
- Would be simplified if the class state were used
There was a problem hiding this comment.
I wanted it to separate the configuration of Lasso from knockoff because I don't like to add the parameter preconfigure.
I put everything to a knockoff.
bthirion
left a comment
There was a problem hiding this comment.
Only minor things remaining, thx !
| n_bootstraps = 25 | ||
| # number of jobs for repetition of the method | ||
| n_jobs = 2 | ||
| n_jobs = 1 |
There was a problem hiding this comment.
n_jobs=2 (or even 4) is a better option IMHO.
The lasso is present at the user API level. I use the default value of the code. I can modify it in order to remove all these default parameters. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #367 +/- ##
==========================================
- Coverage 99.31% 99.19% -0.12%
==========================================
Files 24 24
Lines 1309 1364 +55
==========================================
+ Hits 1300 1353 +53
- Misses 9 11 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Please chekc the logic. This fixes the error reported in #520
| # Number of variables | ||
| n_features = 150 | ||
| # Correlation parameter | ||
| n = 300 |
There was a problem hiding this comment.
If you can add the description of the parameter as before, it should be better.
There was a problem hiding this comment.
I prefer n_samples, n_features, which are self-explaining.
| max_iter=1000, | ||
| ), | ||
| random_state=0, | ||
| n_repeats=1, |
There was a problem hiding this comment.
For example, it's sometimes better to declare some parameters even if the value is the same as the default value.
There was a problem hiding this comment.
Agreed, but in that case, we just want to show the vanilla knockoff, not the aggregation.
So the user can simply ignore the existence of this parameter.
| return test_statistic | ||
|
|
||
| @staticmethod | ||
| def knockoff_threshold(test_score, fdr=0.1): |
There was a problem hiding this comment.
This is a provide method.
There was a problem hiding this comment.
Why do you don't add the other methods (_empirical_knockoff_pval and _empirical_knockoff_eval) into the class?
There was a problem hiding this comment.
I guess you mean private? If so, I am not sure that it should be a private method. As shown in the example, having access to the knockoff threshold can be quite useful for visualization / understanding the data.
For the second comment I agree, they can all be class methods
There was a problem hiding this comment.
yes, I meant private.
There was a problem hiding this comment.
What is the point of using a staticmethod here ? it could be a standard class method since it needs to access test_scores anyhow ?
There was a problem hiding this comment.
Same question for the next 2 static methods.
Note that it may be simply a poor understanding on my side.
| k_star = 1 | ||
| # The for loop over all e-values could be optimized by considering a descending list | ||
| # and stopping when the condition is not satisfied anymore. | ||
| for k, e_k in enumerate(evals_sorted, start=1): |
There was a problem hiding this comment.
Add comments because it is quite unusual in python that the enumeration starts at 1.
Moreover, if you add the link to the equation in the paper, it will be better.
| for i in range(n_features - 1, -1, -1): | ||
| if evals_sorted[i] >= n_features / (fdr * (i + 1)): | ||
| selected_index = i | ||
| break |
There was a problem hiding this comment.
This way of finding the maximum is more optimising than your new implementation.
There was a problem hiding this comment.
I know, but it's also broken. I replaced it with a non-optimized version that is easy to read and clearly reflects Equation 5 of Wang and Ramdas (2022).
This is not a critical increase in computation; it is simply a for loop with an if condition at each step of the loop. I still added a comment mentioning that it could be optimized.
| for k, e_k in enumerate(evals_sorted, start=1): | ||
| if k * e_k >= n_features / fdr: | ||
| k_star = k | ||
| if k_star <= n_features: |
There was a problem hiding this comment.
| if k_star <= n_features: | |
| if k_star-1 <= n_features: |
There was a problem hiding this comment.
See issue #520.
By the way, it will be better to do a PR only on this modification.
There was a problem hiding this comment.
I think it should be k_star otherwise you can end up with k_star = n_features + 1 which is problematic.
Actually, I think this test is not necessary; the condition is always fulfilled because the possible range of values k_star can take in the for loop is always met.
bthirion
left a comment
There was a problem hiding this comment.
This looks great. I only have relatively minor comments left.
| # Number of variables | ||
| n_features = 150 | ||
| # Correlation parameter | ||
| n = 300 |
There was a problem hiding this comment.
I prefer n_samples, n_features, which are self-explaining.
| return test_statistic | ||
|
|
||
| @staticmethod | ||
| def knockoff_threshold(test_score, fdr=0.1): |
There was a problem hiding this comment.
What is the point of using a staticmethod here ? it could be a standard class method since it needs to access test_scores anyhow ?
| return test_statistic | ||
|
|
||
| @staticmethod | ||
| def knockoff_threshold(test_score, fdr=0.1): |
There was a problem hiding this comment.
Same question for the next 2 static methods.
Note that it may be simply a poor understanding on my side.
Co-authored-by: bthirion <bertrand.thirion@inria.fr>
|
To transform them to standard class methods, we would need to move the for loops from the |
OK, I see. I think we need to brainstorm a little bit on this ---i.e. we're likely going to break the API in the future. But Let's merge the PR as is. |
This PR is based on the PR :
