Open Collider · Research

Built by Cédric Lion · @oparine_ai
Methodology and dataset behind the Open Collider study · Read the launch story

Open Collider · Research

Detailed methodology and shipped dataset for the Open Collider experiment. A 12-project panel testing whether structurally distant domain collisions move LLM-generated ideas away from default-prompt baselines, and whether blind LLM judges prefer them on originality.

What is Open Collider?

Open Collider (OC) is an open-source method that operationalizes Koestler's bisociation theory of human creativity (1964) for LLM idea generation. Rather than telling a language model to "be more creative", OC injects structurally distant knowledge domains into the prompt and forces explicit collisions between the brief and those domains. The hypothesis: outputs land further from the default-prompt distribution (the Artificial Hivemind, per Jiang et al. 2025), in zones the model wouldn't reach on its own.

This repo holds the empirical artifacts that test that hypothesis. The engine itself, with skill-mode and API-mode pipelines, lives at github.com/CL-ML/open-collider.

Headline result

Across 12 real-world projects, 4 conditions, ~23k generated ideas, 4,320 blind LLM-judge verdicts:

Distance. Condition A (OC bisociation) is further from baseline cloud B than B from itself on nn_in_B distance. 12/12 projects, p = .0002, on both BGE-large and e5-large-v2 embeddings.
Judges. Three blind judges (Claude Opus 4.6, GPT-4o, Gemini 2.5) prefer A on originality across 10/12 projects (mean A_share 62%, p = .019).
Falsifiers held. Instruction-only "be original" (C) and length-controlled deep brief (D) move the output noticeably less than A. C's effect on BGE is roughly 13× smaller than A's, D's roughly 4× smaller.

Full write-up: BLOG_POST.md. Full methodology and statistics: methodology-and-results.md.

Run the integrity check

pip install -r requirements.txt
python3 script/reproduce_results.py --check

Expected: Panel complete (12 projects), 216/216 cells at n=20, no leaks.

This re-derives every number cited in the methodology document from the shipped artifacts (cached embeddings + per-pair judge winner labels), with no API calls and no GPU.

Repo map

Path	Role
`BLOG_POST.md`	Public-facing write-up: theory, results, examples
`methodology-and-results.md`	Protocol, statistical results, limitations
`OC_data/`	Condition A (OC bisociation) source generations, per project × batch
`conditions_baseline/`	Conditions B, C, D (controls) source generations
`curation/`	Curator prompts, raw responses, top-curated lists per project × condition
`judge_results/`	216 blind forced-choice judge cells (12 projects × 3 contrasts × 2 axes × 3 judges)
`embeddings_cache_bge/`	Cached BGE-large embeddings, one `.npz` per project
`embeddings_cache_e5/`	Cached e5-large-v2 embeddings, cross-embedding sensitivity
`script/`	Reproduction (`reproduce_results.py`), regeneration, judging, visualization
`assets/diagrams/`	Conceptual diagrams used in the blog
`assets/results/`	Statistical figures (forest plot, judge heatmap)

Data availability

Three of the 12 projects (mood_journal, mood_journal_promotion, online_store) cover client-confidential briefs that cannot be released. For these, the source briefs and generated ideas are removed; only the cached embeddings, the per-pair judge winner labels, and the curated index lists are shipped, with all free-text fields replaced by [REDACTED].

The integrity check still passes on the full 12-project panel because it reads only winner labels and embedding vectors, not the redacted text. End-to-end re-encoding from raw text is reproducible for the other 9 projects only.

Nature of the corpus

Every idea in this dataset is a raw LLM output (Claude Sonnet 4, see methodology §7). Citations embedded inside ideas (case-law references, statute numbers, dates, named studies, monetary penalties) were generated by the model and have not been verified. The corpus is published as an artifact of LLM output for methodological study, not as a body of vetted knowledge. Do not use any specific idea as legal, financial, medical, or professional advice.

Pipeline

The Open Collider pipeline that generated condition A is open source: https://github.com/CL-ML/open-collider.

License

Code (script/): MIT
Content (data, docs, figures, ideas, judge artifacts): CC BY 4.0

About

Open Collider is a method developed at Oparine, a research practice on the limits of artificial creativity. The engine, the methodology, and the 12-project benchmark are open source. Consulting inquiries: hello@oparine.ai.

Citation

Cédric Lion (2026). Open Collider: Methodology and Dataset for
Bisociation-Based Idea Generation.
https://github.com/CL-ML/open-collider-research

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Collider · Research

What is Open Collider?

Headline result

Run the integrity check

Repo map

Data availability

Nature of the corpus

Pipeline

License

About

Citation

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
OC_data		OC_data
assets		assets
conditions_baseline		conditions_baseline
curation		curation
embeddings_cache_bge		embeddings_cache_bge
embeddings_cache_e5		embeddings_cache_e5
judge_results		judge_results
script		script
.gitignore		.gitignore
BLOG_POST.md		BLOG_POST.md
LICENSE		LICENSE
LICENSE-CONTENT.md		LICENSE-CONTENT.md
README.md		README.md
methodology-and-results.md		methodology-and-results.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Open Collider · Research

What is Open Collider?

Headline result

Run the integrity check

Repo map

Data availability

Nature of the corpus

Pipeline

License

About

Citation

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages