feat: Add support for third-party API models#164
feat: Add support for third-party API models#164marcbal77 merged 15 commits intobio-learn:masterfrom
Conversation
2a46602 to
4be828e
Compare
0c2c5ef to
ad8b326
Compare
947bf6d to
c3d75fe
Compare
This commit introduces a minimal framework for integrating third-party API models into biolearn, with Hurdle Bio's Inflammage model as the first implementation. Key changes: - Add HurdleAPIModel class to model.py for API-based predictions - Integrate HurdleAPIModel into ModelGallery for consistent usage - Implement privacy-first approach with explicit user consent - Add automatic imputation for missing CpG sites - Include comprehensive error handling with user-friendly messages Security & Privacy: - API keys managed via environment variables or direct input - Explicit consent required before sending data to external servers - Updated .gitignore to exclude credential files Documentation: - Add user guide in docs/hurdle_api_guide.md - Include working example in examples/hurdle_api_example.py - Placeholder for Hurdle CpG sites list Testing: - Add test suite for HurdleAPIModel functionality - Mock API calls to avoid requiring credentials in tests This implementation minimizes changes to the existing codebase while providing a robust foundation for future API model integrations.
- Add actual CpG sites to Hurdle_CpGs.csv.example file (first 100 sites) - Fix test_consent_denied decorator issue (missing mock_input parameter) - Skip HurdleAPIModel in main test_model.py suite (requires API credentials) - Update test data to include proper CpG site indices for Hurdle tests - Fix file path to use Hurdle_CpGs.csv.example instead of .csv
- Format biolearn/test/test_hurdle_model.py - Format biolearn/test/test_model.py
- Add real CpG sites to example file, improve error handling and validation - Simplify consent mechanism, reduce test redundancy, add type hints - Move documentation to doc/ folder, clean up examples
- Update HurdleAPIModel implementation to match latest codebase - Fix test imports and references after rebase - Maintain compatibility with existing model architecture
- Add self._consent_given flag to track consent state - Only ask for consent if not already given - Remove duplicate base_url initialization - Fixes failing test test_consent_only_asked_once
- Format base_url assignment as multi-line for consistency - Fixes CI formatting check failure
…ntation - Rename HurdleInflammage to HurdleInflammAge - Switch to production API (use_production=True) - Update registration URL to https://dashboard.hurdle.bio/register - Add non-commercial use disclaimer to documentation and docstring - Rename Hurdle_CpGs.csv.example to Hurdle_CpGs.csv - Document 0.5 imputation for missing CpG sites - Update example script with better error handling - Update all references in tests and documentation
- Model now errors if any required CpG sites are missing - Provides informative error with count and examples of missing sites - Directs users to use ModelGallery imputation methods - Updated documentation to explain missing data handling - Added test for missing CpG error - All tests passing (143 passed, 4 skipped)
c3d75fe to
418a91b
Compare
|
This PR is ready for review (and to be merged) I rebased on latest master, all CI passing. Tested with production API key - smoke test successful (5 samples returned valid InflammAge predictions). There ended up being a couple fixes: corrected sex mapping (0=female per biolearn standard that we previously merged and I hadn't updated to prior), added methylation_sites() for imputation support, removed empty row in CpG file that somehow I missed prior. @sarudak |
sarudak
left a comment
There was a problem hiding this comment.
Looks good. Just a few minor concerns
.gitignore
Outdated
| # API credentials and sensitive files | ||
| *_api_key* | ||
| *_credentials* | ||
| HurdleTesting.ipynb |
There was a problem hiding this comment.
Nit: This probably doesn't need to be here. We already have an excluded folder for notebooks
| def __init__( | ||
| self, | ||
| api_key: Optional[str] = None, | ||
| use_production: bool = False, |
There was a problem hiding this comment.
Shouldn't production be the default for users of the library?
There was a problem hiding this comment.
The design is intentional - ModelGallery.get("HurdleInflammAge") already uses production via the model definition (use_production: True). The init default of False only affects direct class instantiation, which protects developers experimenting with the class from accidentally hitting production. Does that address your concern, or would you prefer a different approach?
There was a problem hiding this comment.
Ah that makes a lot more sense. Let's move forward with this for now then.
biolearn/model.py
Outdated
| # Biolearn standard: 0=female, 1=male (metadata_standard.rst) | ||
| sample_meta["sex"] = ( | ||
| "f" | ||
| if sex_value in [0, "f", "F", "female"] |
There was a problem hiding this comment.
I feel a bit uncomfortable with this kind of logic. Perhaps we should only accept biolearn standard and if for some reason their data has values other than 0 and 1 we throw an exception. We could add a validate_sex function to the geodata. That way it won't proceed silently on an incorrect assumption about what the sex data is.
| from biolearn.model_gallery import ModelGallery | ||
|
|
||
|
|
||
| class TestHurdleAPIModel: |
There was a problem hiding this comment.
Nit: Would be nice to have at least one integration test that can be run if you have the key
There was a problem hiding this comment.
I guess the example kinda counts as an integration test. Required some effort to actually run though
|
@marcbal77 Were you going to change the sex value code or should we merge? We can fix it in another PR and not have this one hanging around. |
- Validate sex values strictly: only accept 0 (female) or 1 (male) - Raise clear error for invalid sex values - Add tests for sex validation - Remove redundant HurdleTesting.ipynb from gitignore
Yes, trying to fix, took a little longer than expected. Cleaned up comments 1, 3, and 4. |
|
I think it's ready to go now |
Description
This PR introduces a minimal framework for integrating third-party API models into biolearn, with Hurdle Bio's Inflammage model as the first implementation.
Key Changes
Security & Privacy Features
Documentation
Testing
Usage Example