Skip to content

Conversation

@ajude2s
Copy link
Collaborator

@ajude2s ajude2s commented Jan 22, 2026

No description provided.

mali-git and others added 13 commits November 18, 2025 11:31
- Added `PairedThresholdFilter` class to filter text JSONL based on scores from paired JSONL files.
- Introduced `ThresholdFilterPipelineBuilder` to facilitate the construction of filtering pipelines with support for local and Slurm execution.
- Created tests for `PairedThresholdFilter` to ensure correct filtering behavior, handling of mismatches, and error conditions.
- Developed `NumWordsFilter` for additional filtering based on word count, with corresponding tests.
- Implemented `ScoresParser` for parsing scores JSONL files with threshold filtering capabilities, including tests for various scenarios.
- Updated documentation and added example configurations for the new filtering pipeline.
- Updated configuration to support multiple input directories for text, scores, and optional domains.
- Added parameters for handling paired alignment errors and word-count filtering.
- Introduced a local dummy configuration for testing with paired data.
- Deprecated the old domain filtering class in favor of a unified approach.
- Modified the JSONL writer to include score and domain metadata in the output.
- Adjusted tests to reflect changes in the pipeline structure and expected outputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants