Skip to content

CL-ML/open-collider-research

Repository files navigation

Open Collider · Research

Built by Cédric Lion · @oparine_ai
Methodology and dataset behind the Open Collider study · Read the launch story

Open Collider · Research

Detailed methodology and shipped dataset for the Open Collider experiment. A 12-project panel testing whether structurally distant domain collisions move LLM-generated ideas away from default-prompt baselines, and whether blind LLM judges prefer them on originality.

What is Open Collider?

Open Collider (OC) is an open-source method that operationalizes Koestler's bisociation theory of human creativity (1964) for LLM idea generation. Rather than telling a language model to "be more creative", OC injects structurally distant knowledge domains into the prompt and forces explicit collisions between the brief and those domains. The hypothesis: outputs land further from the default-prompt distribution (the Artificial Hivemind, per Jiang et al. 2025), in zones the model wouldn't reach on its own.

This repo holds the empirical artifacts that test that hypothesis. The engine itself, with skill-mode and API-mode pipelines, lives at github.com/CL-ML/open-collider.

Headline result

Across 12 real-world projects, 4 conditions, ~23k generated ideas, 4,320 blind LLM-judge verdicts:

  • Distance. Condition A (OC bisociation) is further from baseline cloud B than B from itself on nn_in_B distance. 12/12 projects, p = .0002, on both BGE-large and e5-large-v2 embeddings.
  • Judges. Three blind judges (Claude Opus 4.6, GPT-4o, Gemini 2.5) prefer A on originality across 10/12 projects (mean A_share 62%, p = .019).
  • Falsifiers held. Instruction-only "be original" (C) and length-controlled deep brief (D) move the output noticeably less than A. C's effect on BGE is roughly 13× smaller than A's, D's roughly 4× smaller.

Full write-up: BLOG_POST.md. Full methodology and statistics: methodology-and-results.md.

Run the integrity check

pip install -r requirements.txt
python3 script/reproduce_results.py --check

Expected: Panel complete (12 projects), 216/216 cells at n=20, no leaks.

This re-derives every number cited in the methodology document from the shipped artifacts (cached embeddings + per-pair judge winner labels), with no API calls and no GPU.

Repo map

Path Role
BLOG_POST.md Public-facing write-up: theory, results, examples
methodology-and-results.md Protocol, statistical results, limitations
OC_data/ Condition A (OC bisociation) source generations, per project × batch
conditions_baseline/ Conditions B, C, D (controls) source generations
curation/ Curator prompts, raw responses, top-curated lists per project × condition
judge_results/ 216 blind forced-choice judge cells (12 projects × 3 contrasts × 2 axes × 3 judges)
embeddings_cache_bge/ Cached BGE-large embeddings, one .npz per project
embeddings_cache_e5/ Cached e5-large-v2 embeddings, cross-embedding sensitivity
script/ Reproduction (reproduce_results.py), regeneration, judging, visualization
assets/diagrams/ Conceptual diagrams used in the blog
assets/results/ Statistical figures (forest plot, judge heatmap)

Data availability

Three of the 12 projects (mood_journal, mood_journal_promotion, online_store) cover client-confidential briefs that cannot be released. For these, the source briefs and generated ideas are removed; only the cached embeddings, the per-pair judge winner labels, and the curated index lists are shipped, with all free-text fields replaced by [REDACTED].

The integrity check still passes on the full 12-project panel because it reads only winner labels and embedding vectors, not the redacted text. End-to-end re-encoding from raw text is reproducible for the other 9 projects only.

Nature of the corpus

Every idea in this dataset is a raw LLM output (Claude Sonnet 4, see methodology §7). Citations embedded inside ideas (case-law references, statute numbers, dates, named studies, monetary penalties) were generated by the model and have not been verified. The corpus is published as an artifact of LLM output for methodological study, not as a body of vetted knowledge. Do not use any specific idea as legal, financial, medical, or professional advice.

Pipeline

The Open Collider pipeline that generated condition A is open source: https://github.com/CL-ML/open-collider.

License

  • Code (script/): MIT
  • Content (data, docs, figures, ideas, judge artifacts): CC BY 4.0

About

Open Collider is a method developed at Oparine, a research practice on the limits of artificial creativity. The engine, the methodology, and the 12-project benchmark are open source. Consulting inquiries: hello@oparine.ai.

Citation

Cédric Lion (2026). Open Collider: Methodology and Dataset for
Bisociation-Based Idea Generation.
https://github.com/CL-ML/open-collider-research

About

Methodology and dataset behind the Open Collider study: a 12-project panel testing whether structurally distant domain collisions move LLM-generated ideas off default-prompt baselines, and whether blind LLM judges prefer them on originality.

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE-CONTENT.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages