Skip to content

[Eval] Add chromatin accessibility benchmark to the zero-shot eval suite #18

@BDBGenomics

Description

@BDBGenomics

Hey, I went through the eval suite carefully. It's a solid set of zero-shot tasks covering VEP, sequence recovery, perturbation sensitivity, and long-context retrieval. One notable gap: there's no chromatin accessibility eval, despite the model being trained on eukaryotic regulatory sequence where chromatin state is a primary functional signal.

A natural addition would be a zero-shot ATAC-seq peak discrimination task given two sequences from the same genomic region, one from an open chromatin peak (ENCODE ATAC-seq narrowPeak) and one from flanking closed chromatin, does the model assign higher log-likelihood to the open one? This follows the same pairwise discrimination pattern already used in the perturbation tasks (mean(LL(real) > LL(perturbed))), so it slots cleanly into the existing eval structure.
A minimal implementation would use ENCODE ATAC-seq peak calls (e.g. GM12878 or K562, already publicly available) as positive examples, with matched GC-content flanking regions as negatives. Window size could mirror the VEP setup (8 kb centered on peak summit).

I work on ATAC-seq pipelines and would be happy to put together a PR for evaluation/atac_eval.py if this direction looks right to the team.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions