Deep Research: Perception - From Biology to AI

How Humans See and What AI is Missing

1. Biological Vision: The Human Visual System

1.1 The Visual Pathway

Human vision occupies an astonishing 30-40% of the cerebral cortex's total surface area. Visual information flows through a hierarchical processing pipeline:

Retina → LGN (Thalamus) → V1 → V2 → V4 → IT (Inferotemporal)
                                           ↓
                                    Object Recognition

Area	Function	Complexity
V1	Edge detection, orientation, motion	Simple features
V2	Contours, illusory edges, border ownership	Early grouping
V4	Angles, curvatures, color	Shape fragments
IT	Whole objects, faces, categories	Object identity

Key Insight: The brain doesn't just detect features—it assigns meaning at every level. Even V2 (very early) computes "border ownership"—which side of an edge belongs to the object.

Sources:

1.2 Two Visual Streams: What vs How

The brain splits visual processing into two parallel pathways:

Stream	Path	Function	Damage Causes
Ventral ("What")	V1 → V2 → V4 → IT	Object recognition	Can't recognize objects
Dorsal ("How/Where")	V1 → V2 → MT → Parietal	Spatial awareness, action	Can't reach for objects

Critical Finding: These streams are NOT independent. The dorsal stream needs to know WHAT an object is to know HOW to interact with it.

"The dorsal stream never acts alone, nor does the ventral stream. Both streams need to work together in an integrated way."

For ARC: This maps to our kitchen metaphor:

Ventral = What objects are in the grid (perception)
Dorsal = What actions are possible (affordances)

Sources:

1.3 Gestalt Principles: The Brain's Grouping Rules

In the 1920s, Gestalt psychologists discovered that the brain automatically groups visual elements using specific rules:

Principle	Rule	Example
Proximity	Close things group together	●● ●● → two pairs
Similarity	Similar things group together	●●○○ → two groups
Continuity	We see smooth, continuous lines	Crossing lines stay separate
Closure	We complete incomplete shapes	◗ → we see a circle
Figure-Ground	We separate object from background	Vase/faces illusion
Prägnanz	We see the simplest interpretation	"Law of simplicity"

"The whole is something else than the sum of its parts."

For ARC: The brain AUTOMATICALLY does what we're trying to teach LLMs:

Group pixels into objects (proximity, similarity)
See patterns (continuity)
Complete missing parts (closure)
Find the simplest rule (prägnanz)

Sources:

1.4 Predictive Coding: The Brain as Prediction Machine

Karl Friston's Free Energy Principle revolutionized our understanding of perception:

"The brain is not a passive receiver of sensory data—it actively PREDICTS what it will see, then updates based on prediction errors."

         ┌─────────────────────────────────────────┐
         │         HIGHER CORTICAL AREAS           │
         │                                         │
         │    Generate PREDICTIONS about input     │
         │              ↓ (top-down)               │
         ├─────────────────────────────────────────┤
         │         LOWER CORTICAL AREAS            │
         │                                         │
         │    Compute PREDICTION ERROR             │
         │              ↑ (bottom-up)              │
         ├─────────────────────────────────────────┤
         │           SENSORY INPUT                 │
         └─────────────────────────────────────────┘

Key Points:

Top-down signals PREDICT what lower areas should see
Bottom-up signals carry PREDICTION ERRORS (surprises)
Perception = minimize prediction error
Learning = update internal model to predict better

For ARC: This is exactly what iterative solving does:

Hypothesis = prediction of rule
Verification = compute prediction error
Feedback = update the model

Sources:

1.5 Selective Attention: The Spotlight

The brain can't process everything—it uses selective attention to focus:

"Selective visual attention describes the tendency of visual processing to be confined largely to stimuli that are relevant to behavior."

Spotlight Model:

Attention acts like a spotlight, enhancing processing at attended locations
fMRI shows attention literally increases activity in corresponding V1 regions
Unattended stimuli are suppressed

Competition Model:

Objects in the visual field COMPETE for processing
Bottom-up: salient things grab attention (bright, moving, unique)
Top-down: goals direct attention (looking for red, looking for faces)

For ARC: This is why perception_v10 failed:

400+ objects competing for attention
LLM didn't know which to attend to
Noise overwhelmed signal

Sources:

2. Gibson's Affordance Theory

2.1 What Are Affordances?

James J. Gibson (1979) proposed a radical idea:

"The affordances of the environment are what it offers the animal, what it provides or furnishes, either for good or ill."

Key Properties:

Affordances are relational - between organism and environment
They are directly perceived - not computed through reasoning
They are action possibilities - what you CAN DO with something

Object	Affordances
Chair	Sit-on-able
Ladder	Climb-on-able
Apple	Grasp-able, eat-able
Knife	Cut-with-able

Gibson's Radical Claim: We don't see objects then infer actions. We DIRECTLY PERCEIVE what actions are possible.

"An affordance is neither an objective property nor a subjective property; or it is both. It is equally a fact of the environment and a fact of behavior."

Sources:

2.2 Affordances in Design

Donald Norman adapted Gibson's theory for human-computer interaction:

Element	Affordance
Button	Push-able
Handle	Pull-able
Slider	Slide-able
Text field	Type-into-able

Good design makes affordances visible. You shouldn't need instructions—the object tells you what it does.

2.3 Affordances for ARC

This is your kitchen metaphor:

Grid Element	State	Affordance
Color-8 rectangle	MARKER	Reference point for extraction
Full-span line	DIVIDER	Split grid here
Output-sized region	EXTRACTABLE	Can be cropped
Pattern with anomalies	REPAIRABLE	Fix by majority
Empty region	FILLABLE	Destination for pattern

The Key Insight:

LLMs fail on ARC not because they can't reason, but because they don't know which actions are LEGAL at each step.

Humans look at a grid and IMMEDIATELY perceive:

"That rectangle can be extracted"
"That line divides the grid"
"That pattern needs repair"

LLMs see pixels and try to COMPUTE the transformation, missing the direct perception of action possibilities.

3. AI Vision vs Human Vision

3.1 How Deep Learning "Sees"

Layer	What it detects
Conv1	Edges, blobs, colors
Conv2-3	Textures, patterns
Conv4-5	Parts (eyes, wheels)
FC	Objects, categories

This LOOKS like the visual hierarchy (V1→V2→V4→IT), but there are critical differences.

3.2 Key Differences

Aspect	Human Vision	Deep Learning
Training	Few examples	Millions of images
Generalization	Abstract rules	Statistical patterns
Robustness	Handles novel transforms	Fails on OOD
Grouping	Automatic (Gestalt)	Not built-in
Affordances	Direct perception	Not represented
Top-down	Strong predictions	Weak/none
Attention	Dynamic, goal-driven	Fixed (mostly)

3.3 The Core Problem

"Deep neural networks are highly valuable scientific tools but should only be regarded as promising—but not yet adequate—computational models of human core object recognition behavior."

Why DNNs fail on ARC:

No few-shot learning - Need millions of examples
No abstract rules - Learn statistical correlations
No compositionality - Can't combine primitives
No affordances - Don't perceive action possibilities

Sources:

4. What This Means for ARC Solving

4.1 What Humans Do (That LLMs Don't)

Human Process	ARC Equivalent	LLM Status
Gestalt grouping	See objects, not pixels	Partial (raw grid)
Figure-ground	Separate objects from background	Partial
Affordance perception	Know what actions are possible	MISSING
Predictive coding	Hypothesize → verify → update	Yes (iterative)
Selective attention	Focus on relevant parts	MISSING
Prägnanz	Find simplest rule	Partial (MDL prompt)

4.2 What Perception Should Provide

Based on biology, perception should give the LLM:

Object segmentation (Gestalt grouping)
- Not 400 objects, but 2-5 meaningful ones
- Figure-ground separation
Affordance labels (Gibson)
- What each object CAN DO
- What actions are LEGAL
Attention guidance (Spotlight)
- What to focus on
- What to ignore
Pattern hints (Predictive coding)
- Likely rule families
- Hypothesis space reduction

4.3 Why perception_v10 Failed

Problem	Cause	Solution
Too many objects (400+)	No attention filtering	Show only key objects
Verbose labels	Equal weight to all	Prioritize by relevance
Token dilution	Noise overwhelms signal	Compress to essentials
SOURCE FOUND obsession	Only helps 3.3%	Drop it

4.4 What perception_slim Does Right

Feature	Biological Basis	Tokens
Task type	Prägnanz (simplest interpretation)	~20
Diff map	Visual attention focus	~100
Key transitions	Change detection	~30
One affordance	Action possibility	~20

Total: ~150 tokens with signal, not noise.

5. The Hybrid Approach: Biology + AI

5.1 What We Should Combine

Component	From	Purpose
Perception	Biology (Gestalt, Affordances)	Structure the problem
Diversity	AI (Ensemble)	Explore hypothesis space
Reasoning	LLM (Extended thinking)	Derive the rule
Verification	Predictive coding	Check predictions

5.2 The Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     HYBRID SOLVER                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐                                           │
│  │  PERCEPTION     │  ← Gestalt grouping                       │
│  │  (Slim)         │  ← Affordance labels                      │
│  │                 │  ← Attention guidance                      │
│  │  ~150 tokens    │  ← Pattern hints                          │
│  └────────┬────────┘                                           │
│           │                                                     │
│           ▼                                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              8 PARALLEL EXPERTS                          │   │
│  │                                                          │   │
│  │   Each with different seed (diversity)                   │   │
│  │   Each with extended thinking (reasoning)                │   │
│  │   Each iterating 10 times (predictive coding)           │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│           │                                                     │
│           ▼                                                     │
│  ┌─────────────────┐                                           │
│  │  VOTING         │  ← Competition (attention)                │
│  │                 │  ← Best hypothesis wins                    │
│  └─────────────────┘                                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

5.3 Why This Should Work

Biological Principle	Implementation	Expected Benefit
Gestalt grouping	Slim perception	Better object focus
Affordances	Task type + hint	Know legal actions
Attention	Key objects only	Less noise
Predictive coding	Iteration + feedback	Rule refinement
Competition	8 experts + voting	Best hypothesis wins
Prägnanz	Simplest rule prompt	Avoid overfit

6. Conclusions

6.1 Key Insights from Biology

Vision is hierarchical - but meaning is assigned at every level
What and How are parallel - you need both for action
Grouping is automatic - Gestalt principles are built-in
Affordances are direct - we see action possibilities, not compute them
Attention is selective - we focus on what matters
Perception is prediction - we hypothesize and verify

6.2 What LLMs Are Missing

Built-in grouping - they see tokens, not objects
Affordance perception - they don't know legal actions
Selective attention - they process everything equally
Few-shot generalization - they need many examples

6.3 How Perception Helps

Without Perception	With Perception
Raw pixels	Grouped objects
All details equal	Key features highlighted
Unknown actions	Legal actions suggested
Compute everything	Focus on relevant

6.4 The Formula

Success = Perception (structure) + Diversity (exploration) + Reasoning (depth)

FilesExpand file tree

PERCEPTION_RESEARCH.md

Latest commit

History

PERCEPTION_RESEARCH.md

File metadata and controls

Deep Research: Perception - From Biology to AI

How Humans See and What AI is Missing

1. Biological Vision: The Human Visual System

1.1 The Visual Pathway

1.2 Two Visual Streams: What vs How

1.3 Gestalt Principles: The Brain's Grouping Rules

1.4 Predictive Coding: The Brain as Prediction Machine

1.5 Selective Attention: The Spotlight

2. Gibson's Affordance Theory

2.1 What Are Affordances?

2.2 Affordances in Design

2.3 Affordances for ARC

3. AI Vision vs Human Vision

3.1 How Deep Learning "Sees"

3.2 Key Differences

3.3 The Core Problem

4. What This Means for ARC Solving

4.1 What Humans Do (That LLMs Don't)

4.2 What Perception Should Provide

4.3 Why perception_v10 Failed

4.4 What perception_slim Does Right

5. The Hybrid Approach: Biology + AI

5.1 What We Should Combine

5.2 The Architecture

5.3 Why This Should Work

6. Conclusions

6.1 Key Insights from Biology

6.2 What LLMs Are Missing

6.3 How Perception Helps

6.4 The Formula

References

Biological Vision

Gestalt Psychology

Affordance Theory

Predictive Coding

Visual Pathways

Attention

AI Vision