Abstract

The evaluation compares pane-aware, pane-blind, and replay-assisted settings to isolate the value of terminal structure.

This site frames the project as an academic discussion artifact: it states the adaptation problem, proposes a design stance, and lists evaluation lenses that can be expanded into experiments.

Research Question

Which controlled comparisons show whether tmux support improves UITARS15_v2 beyond simple terminal automation?

Adaptation Notes

  • Tasks include pane navigation, concurrent logs, interrupted builds, and delayed output.
  • Ablations remove segmentation, focus history, and replay memory.
  • Analysis separates action selection, focus selection, and verification.

Evaluation Lens

  • Ablation sensitivity
  • Concurrent-output resilience
  • Verification accuracy after delayed feedback

Open Discussion

The central methodological risk is mistaking terminal completion for agent understanding. The project therefore treats tmux as both infrastructure and evidence: pane state, focus movement, command output, and recovery behavior all become part of the argument.

Future work can connect this static discussion to executable harnesses, trace viewers, and standardized task suites for cross-agent comparison.