Abstract
Evaluation should measure the full interaction trace, including focus decisions, pane reuse, interruption recovery, and the stability of intermediate artifacts.
This site frames the project as an academic discussion artifact: it states the adaptation problem, proposes a design stance, and lists evaluation lenses that can be expanded into experiments.
Research Question
Which measurements reveal whether EvoCUA is using tmux as a durable workspace rather than a passive command transport?
Adaptation Notes
- Tasks are grouped by session length and required recovery depth.
- Ablations remove pane memory, replay buffers, and focus metadata independently.
- Metrics distinguish successful task completion from reproducible terminal conduct.
Evaluation Lens
- Long-horizon task completion
- Recovery latency after injected faults
- Agreement between trace replay and final artifacts
Open Discussion
The central methodological risk is mistaking terminal completion for agent understanding. The project therefore treats tmux as both infrastructure and evidence: pane state, focus movement, command output, and recovery behavior all become part of the argument.
Future work can connect this static discussion to executable harnesses, trace viewers, and standardized task suites for cross-agent comparison.