Search for a command to run...
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents