Human vision occupies an astonishing 30-40% of the cerebral cortex's total surface area. Visual information flows through a hierarchical processing pipeline:
Retina → LGN (Thalamus) → V1 → V2 → V4 → IT (Inferotemporal)
↓
Object Recognition
| Area | Function | Complexity |
|---|---|---|
| V1 | Edge detection, orientation, motion | Simple features |
| V2 | Contours, illusory edges, border ownership | Early grouping |
| V4 | Angles, curvatures, color | Shape fragments |
| IT | Whole objects, faces, categories | Object identity |
Key Insight: The brain doesn't just detect features—it assigns meaning at every level. Even V2 (very early) computes "border ownership"—which side of an edge belongs to the object.
Sources:
- How does the brain solve visual object recognition? - PMC
- Visual cortical processing—From image to object representation
The brain splits visual processing into two parallel pathways:
| Stream | Path | Function | Damage Causes |
|---|---|---|---|
| Ventral ("What") | V1 → V2 → V4 → IT | Object recognition | Can't recognize objects |
| Dorsal ("How/Where") | V1 → V2 → MT → Parietal | Spatial awareness, action | Can't reach for objects |
Critical Finding: These streams are NOT independent. The dorsal stream needs to know WHAT an object is to know HOW to interact with it.
"The dorsal stream never acts alone, nor does the ventral stream. Both streams need to work together in an integrated way."
For ARC: This maps to our kitchen metaphor:
- Ventral = What objects are in the grid (perception)
- Dorsal = What actions are possible (affordances)
Sources:
In the 1920s, Gestalt psychologists discovered that the brain automatically groups visual elements using specific rules:
| Principle | Rule | Example |
|---|---|---|
| Proximity | Close things group together | ●● ●● → two pairs |
| Similarity | Similar things group together | ●●○○ → two groups |
| Continuity | We see smooth, continuous lines | Crossing lines stay separate |
| Closure | We complete incomplete shapes | ◗ → we see a circle |
| Figure-Ground | We separate object from background | Vase/faces illusion |
| Prägnanz | We see the simplest interpretation | "Law of simplicity" |
"The whole is something else than the sum of its parts."
For ARC: The brain AUTOMATICALLY does what we're trying to teach LLMs:
- Group pixels into objects (proximity, similarity)
- See patterns (continuity)
- Complete missing parts (closure)
- Find the simplest rule (prägnanz)
Sources:
Karl Friston's Free Energy Principle revolutionized our understanding of perception:
"The brain is not a passive receiver of sensory data—it actively PREDICTS what it will see, then updates based on prediction errors."
┌─────────────────────────────────────────┐
│ HIGHER CORTICAL AREAS │
│ │
│ Generate PREDICTIONS about input │
│ ↓ (top-down) │
├─────────────────────────────────────────┤
│ LOWER CORTICAL AREAS │
│ │
│ Compute PREDICTION ERROR │
│ ↑ (bottom-up) │
├─────────────────────────────────────────┤
│ SENSORY INPUT │
└─────────────────────────────────────────┘
Key Points:
- Top-down signals PREDICT what lower areas should see
- Bottom-up signals carry PREDICTION ERRORS (surprises)
- Perception = minimize prediction error
- Learning = update internal model to predict better
For ARC: This is exactly what iterative solving does:
- Hypothesis = prediction of rule
- Verification = compute prediction error
- Feedback = update the model
Sources:
- Predictive coding under the free-energy principle - PMC
- The free-energy principle: a unified brain theory? - Nature
The brain can't process everything—it uses selective attention to focus:
"Selective visual attention describes the tendency of visual processing to be confined largely to stimuli that are relevant to behavior."
Spotlight Model:
- Attention acts like a spotlight, enhancing processing at attended locations
- fMRI shows attention literally increases activity in corresponding V1 regions
- Unattended stimuli are suppressed
Competition Model:
- Objects in the visual field COMPETE for processing
- Bottom-up: salient things grab attention (bright, moving, unique)
- Top-down: goals direct attention (looking for red, looking for faces)
For ARC: This is why perception_v10 failed:
- 400+ objects competing for attention
- LLM didn't know which to attend to
- Noise overwhelmed signal
Sources:
- On the role of selective attention in visual perception - PNAS
- A physiological correlate of the 'spotlight' of visual attention - Nature
James J. Gibson (1979) proposed a radical idea:
"The affordances of the environment are what it offers the animal, what it provides or furnishes, either for good or ill."
Key Properties:
- Affordances are relational - between organism and environment
- They are directly perceived - not computed through reasoning
- They are action possibilities - what you CAN DO with something
| Object | Affordances |
|---|---|
| Chair | Sit-on-able |
| Ladder | Climb-on-able |
| Apple | Grasp-able, eat-able |
| Knife | Cut-with-able |
Gibson's Radical Claim: We don't see objects then infer actions. We DIRECTLY PERCEIVE what actions are possible.
"An affordance is neither an objective property nor a subjective property; or it is both. It is equally a fact of the environment and a fact of behavior."
Sources:
Donald Norman adapted Gibson's theory for human-computer interaction:
| Element | Affordance |
|---|---|
| Button | Push-able |
| Handle | Pull-able |
| Slider | Slide-able |
| Text field | Type-into-able |
Good design makes affordances visible. You shouldn't need instructions—the object tells you what it does.
This is your kitchen metaphor:
| Grid Element | State | Affordance |
|---|---|---|
| Color-8 rectangle | MARKER | Reference point for extraction |
| Full-span line | DIVIDER | Split grid here |
| Output-sized region | EXTRACTABLE | Can be cropped |
| Pattern with anomalies | REPAIRABLE | Fix by majority |
| Empty region | FILLABLE | Destination for pattern |
The Key Insight:
LLMs fail on ARC not because they can't reason, but because they don't know which actions are LEGAL at each step.
Humans look at a grid and IMMEDIATELY perceive:
- "That rectangle can be extracted"
- "That line divides the grid"
- "That pattern needs repair"
LLMs see pixels and try to COMPUTE the transformation, missing the direct perception of action possibilities.
| Layer | What it detects |
|---|---|
| Conv1 | Edges, blobs, colors |
| Conv2-3 | Textures, patterns |
| Conv4-5 | Parts (eyes, wheels) |
| FC | Objects, categories |
This LOOKS like the visual hierarchy (V1→V2→V4→IT), but there are critical differences.
| Aspect | Human Vision | Deep Learning |
|---|---|---|
| Training | Few examples | Millions of images |
| Generalization | Abstract rules | Statistical patterns |
| Robustness | Handles novel transforms | Fails on OOD |
| Grouping | Automatic (Gestalt) | Not built-in |
| Affordances | Direct perception | Not represented |
| Top-down | Strong predictions | Weak/none |
| Attention | Dynamic, goal-driven | Fixed (mostly) |
"Deep neural networks are highly valuable scientific tools but should only be regarded as promising—but not yet adequate—computational models of human core object recognition behavior."
Why DNNs fail on ARC:
- No few-shot learning - Need millions of examples
- No abstract rules - Learn statistical correlations
- No compositionality - Can't combine primitives
- No affordances - Don't perceive action possibilities
Sources:
- Are Deep Neural Networks Adequate Behavioral Models of Human Visual Perception?
- Computer vision: Why it's hard to compare AI and human perception
| Human Process | ARC Equivalent | LLM Status |
|---|---|---|
| Gestalt grouping | See objects, not pixels | Partial (raw grid) |
| Figure-ground | Separate objects from background | Partial |
| Affordance perception | Know what actions are possible | MISSING |
| Predictive coding | Hypothesize → verify → update | Yes (iterative) |
| Selective attention | Focus on relevant parts | MISSING |
| Prägnanz | Find simplest rule | Partial (MDL prompt) |
Based on biology, perception should give the LLM:
-
Object segmentation (Gestalt grouping)
- Not 400 objects, but 2-5 meaningful ones
- Figure-ground separation
-
Affordance labels (Gibson)
- What each object CAN DO
- What actions are LEGAL
-
Attention guidance (Spotlight)
- What to focus on
- What to ignore
-
Pattern hints (Predictive coding)
- Likely rule families
- Hypothesis space reduction
| Problem | Cause | Solution |
|---|---|---|
| Too many objects (400+) | No attention filtering | Show only key objects |
| Verbose labels | Equal weight to all | Prioritize by relevance |
| Token dilution | Noise overwhelms signal | Compress to essentials |
| SOURCE FOUND obsession | Only helps 3.3% | Drop it |
| Feature | Biological Basis | Tokens |
|---|---|---|
| Task type | Prägnanz (simplest interpretation) | ~20 |
| Diff map | Visual attention focus | ~100 |
| Key transitions | Change detection | ~30 |
| One affordance | Action possibility | ~20 |
Total: ~150 tokens with signal, not noise.
| Component | From | Purpose |
|---|---|---|
| Perception | Biology (Gestalt, Affordances) | Structure the problem |
| Diversity | AI (Ensemble) | Explore hypothesis space |
| Reasoning | LLM (Extended thinking) | Derive the rule |
| Verification | Predictive coding | Check predictions |
┌─────────────────────────────────────────────────────────────────┐
│ HYBRID SOLVER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ PERCEPTION │ ← Gestalt grouping │
│ │ (Slim) │ ← Affordance labels │
│ │ │ ← Attention guidance │
│ │ ~150 tokens │ ← Pattern hints │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 8 PARALLEL EXPERTS │ │
│ │ │ │
│ │ Each with different seed (diversity) │ │
│ │ Each with extended thinking (reasoning) │ │
│ │ Each iterating 10 times (predictive coding) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ VOTING │ ← Competition (attention) │
│ │ │ ← Best hypothesis wins │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
| Biological Principle | Implementation | Expected Benefit |
|---|---|---|
| Gestalt grouping | Slim perception | Better object focus |
| Affordances | Task type + hint | Know legal actions |
| Attention | Key objects only | Less noise |
| Predictive coding | Iteration + feedback | Rule refinement |
| Competition | 8 experts + voting | Best hypothesis wins |
| Prägnanz | Simplest rule prompt | Avoid overfit |
- Vision is hierarchical - but meaning is assigned at every level
- What and How are parallel - you need both for action
- Grouping is automatic - Gestalt principles are built-in
- Affordances are direct - we see action possibilities, not compute them
- Attention is selective - we focus on what matters
- Perception is prediction - we hypothesize and verify
- Built-in grouping - they see tokens, not objects
- Affordance perception - they don't know legal actions
- Selective attention - they process everything equally
- Few-shot generalization - they need many examples
| Without Perception | With Perception |
|---|---|
| Raw pixels | Grouped objects |
| All details equal | Key features highlighted |
| Unknown actions | Legal actions suggested |
| Compute everything | Focus on relevant |
Success = Perception (structure) + Diversity (exploration) + Reasoning (depth)
- Perception alone: 30% (can't explore enough)
- Diversity alone: 55% (no structure)
- Perception + Diversity: ??? (untested hypothesis)
- Understanding how visual information is processed in the brain - NIH
- Visual Cognitive Neuroscience - MIT
- How does the brain solve visual object recognition? - PMC
- Gibson's Ecological Approach - Brown CS PDF
- Affordance - Wikipedia
- The History and Philosophy of Ecological Psychology - PMC
- Predictive coding under the free-energy principle - PMC
- The free-energy principle: a unified brain theory? - Nature