HyperAI

Choosing the Right Model-Prompt Combination for LLM Routers Developers are increasingly integrating large language models (LLMs) into various applications to enhance user experiences and streamline processes. One crucial aspect of LLM integration is the development of routers, which are responsible for directing user inputs to the appropriate components within the application. This can range from handling web research to adding budget items or even exporting data. Ensuring high accuracy in these routers is essential to avoid user frustration and dismissal of the app as non-functional. The post explores how to use OpenAI Evals to select the optimal model-prompt combination for a budget chatbot. Understanding Evals Evals are task-oriented and iterative processes designed to evaluate and improve the performance of LLM integrations. They help developers test various prompts and models over a set of inputs and compare the results against the expected outcomes. Despite the lack of comprehensive documentation and the complexity of the OpenAI Eval SDK, which uses string-based dispatch, Evals are a powerful tool for optimizing LLM-based applications. Use Case: Routing Budget Requests with LLMs The specific use case discussed in the post is a budget chatbot named "Better Get Done!" that aims to make the arduous task of creating departmental budgets more engaging. The chatbot interfaces with users through a chat-like environment, helping with web research, adding budget items, and compiling spreadsheets. The router's role is to interpret user inputs and direct them to the correct handler—either for structured responses, chat interactions, or file downloads. For example, a user might input: "Create a budget item for 10 Mac M1s. They cost $3,500 each." The system should recognize this as a request for a structured response and route it accordingly. Other inputs might require a chat interaction or a file download. Proper routing is critical for maintaining user trust and ensuring the app works as intended. Data Preparation To test the router, a dataset of 100 user inputs and their corresponding ideal responses was created. This dataset includes various types of user queries and the expected actions the router should take. The dataset is uploaded to OpenAI using the API: bash curl https://api.openai.com/v1/files \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -F purpose="evals" \ -F file="@llm-as-router-IT-budget-requests.jsonl" The response provides a unique file ID, which is used in subsequent steps to reference the dataset. Setting Up the Eval An Eval configuration holds the settings that are shared across multiple runs. It consists of two main components: Data Source Configuration: Defines the schema of the data used in the eval. Testing Criteria: Specifies how to determine if the integration is working correctly for each record. The data source configuration is defined as follows: python data_source_config = { "type": "custom", "item_schema": Record.model_json_schema(), "include_sample_schema": True, } The testing criteria ensure that the model's output matches the ideal response from the dataset: python router_grader = { "name": "Router Grader", "type": "string_check", "input": "{{sample.output_text}}", "reference": "{{item.ideal}}", "operation": "eq", } Creating the Eval The Eval is registered on OpenAI’s platform using the openai.evals.create method. This step creates a template that can be reused for different datasets and scenarios: python eval_create_result = openai.evals.create( name="Route check Eval", metadata={"description": "This eval tests several prompts and models to find the best performing combination."}, data_source_config=data_source_config, testing_criteria=[router_grader], ) eval_id = eval_create_result.id print(eval_id) Prompts of Increasing Specificity To test the router's effectiveness, several prompts of increasing specificity are created: Prefix Only: A basic prompt that instructs the model to determine the appropriate route. Basic: Adds more detailed instructions and constraints. With Samples: Includes sample user queries and ideal responses. With Examples: Provides additional examples to guide the model. Here are snippets of two prompts: Prefix Only Prompt: python PROMPT_PREFIX = """You are a router. Your only job is to determine which route best matches the user's query.""" Basic Prompt: ```python PROMPT_VARIATION_BASIC = f"""{PROMPT_PREFIX} Your choices are: - chat_response - structured_response - download_file You should return just the route and nothing else. Instructions Always review the entire user input message before responding You will choose the appropriate type of response based on the user's query. If unsure, default to 'chat_response'. For multiple requests, use 'chat_response' unless clearly separate budget items. Return 'download_file' only if a reasonable person would infer this. """ ``` Models to Test Three models are selected for testing: gpt-4o-mini gpt-4.1-mini gpt-4.1-nano Creating Runs Runs are created to test each prompt against every model. The openai.evals.runs.create method is used within a nested loop to automate this process. Each run configuration specifies the prompt, model, and dataset file ID: python tasks = [] for prompt_name, prompt in prompts: for model in models: run_data_source = { "type": "completions", "input_messages": { "type": "template", "template": [ {"role": "developer", "content": prompt}, {"role": "user", "content": "{{item.input}}"}, ], }, "model": model, "source": {"type": "file_id", "id": dataset_file_id}, } tasks.append(client.evals.runs.create( eval_id=eval_id, name=f"{prompt_name}_{model}", data_source=run_data_source, )) result = await asyncio.gather(*tasks) View Results After running the evaluations, the results are analyzed to determine the best-performing model-prompt combination. The notebook includes code to plot the results, although the same information can be accessed through OpenAI’s Evals dashboard. The analysis reveals that the score is more closely correlated with the prompt’s specificity than the model, although both factors are significant. Conclusion The evaluation process led to the selection of the most elaborate prompt (with examples) paired with the gpt-4o-mini model, which is relatively cost-effective at $0.15 per million tokens. This combination achieved the highest accuracy in correctly interpreting user inputs. Routers must be highly accurate to ensure a seamless user experience, and this simple eval can be rerun periodically or whenever new routes or models are introduced to maintain optimal performance. Industry Insight and Company Profile Industry insiders emphasize the importance of systematic evaluation in LLM integration projects. While it’s tempting to rely on intuition or ad-hoc testing, using a structured approach like OpenAI Evals ensures that the chosen model and prompt combination is robust and reliable. Companies like OpenAI continue to refine their tools and documentation to support developers in implementing AI responsibly and effectively. Better Get Done! is currently in development, combining practical budgeting with a gamified user experience. The founder, inspired by cookbooks from OpenAI, is open to feedback and potential interest in the product. Such innovative approaches highlight the potential for AI to transform mundane tasks into engaging and efficient processes.

Related Links

Related Links

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Command Palette

Optimizing LLM Routers: How to Choose the Best Model-Prompt Combo for Your App

Related Links

Command Palette

Optimizing LLM Routers: How to Choose the Best Model-Prompt Combo for Your App

Related Links

Command Palette

Optimizing LLM Routers: How to Choose the Best Model-Prompt Combo for Your App

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models