Per-Head Tracking Rationale
Visualizing what each attention head learns to track
The
model
processes
each
token
Positional
Syntactic
Semantic
Copy/Induction