tmux for EvoCUA: Empirical Evaluation Protocol

Abstract

Evaluation should measure the full interaction trace, including focus decisions, pane reuse, interruption recovery, and the stability of intermediate artifacts.

This site frames the project as an academic discussion artifact: it states the adaptation problem, proposes a design stance, and lists evaluation lenses that can be expanded into experiments.

Research Question

Which measurements reveal whether EvoCUA is using tmux as a durable workspace rather than a passive command transport?

Adaptation Notes

Tasks are grouped by session length and required recovery depth.
Ablations remove pane memory, replay buffers, and focus metadata independently.
Metrics distinguish successful task completion from reproducible terminal conduct.

Evaluation Lens

Long-horizon task completion
Recovery latency after injected faults
Agreement between trace replay and final artifacts

Open Discussion

The central methodological risk is mistaking terminal completion for agent understanding. The project therefore treats tmux as both infrastructure and evidence: pane state, focus movement, command output, and recovery behavior all become part of the argument.

Future work can connect this static discussion to executable harnesses, trace viewers, and standardized task suites for cross-agent comparison.