Per-Head Tracking Rationale

Visualizing what each attention head learns to track

The
model
processes
each
token
Positional
Syntactic
Semantic
Copy/Induction