Search for a command to run...
VESPO: Variational Sequence-Level Soft Policy Optimization für stabile Off-Policy LLM-Training