HyperAI

BlackBox Optimizers

In 2024, Carnegie Mellon University (CMU) proposed a new black-box optimization strategy that automatically adjusts natural language prompts through a large language model to optimize the performance of visual language models (VLMs) in multiple downstream tasks such as text maps and visual recognition. This method not only does not need to touch the internal parameters of the model, but also greatly improves the flexibility and speed of optimization, allowing even users without technical background to easily improve model performance. Related research results are "Language Models as Black-Box Optimizers for Vision-Language Models", this research has been accepted by CVPR 2024.

Figure 1: Prompting Vision-Language Models (VLMs) using Chat-based Large Language Models (LLMs). Just as human prompt engineers iteratively test and optimize prompts, researchers use ChatGPT to continuously optimize the prompts of Vision-Language Models (VLMs). The study's iterative approach evaluates the performance of prompts generated by ChatGPT on few-shot datasets (highlighted in blue) and provides feedback to ChatGPT through simple conversations (marked in purple), as shown in the example figure. This simple and straightforward approach achieves state-of-the-art results on single-shot image classification on 11 datasets using CLIP, and operates in a black-box manner without access to model weights, feature embeddings, or output log-odds. The study shows that providing both positive (green) and negative (red) prompts improves efficiency. Notably, in this extremely low-shot scenario, the research method outperforms white-box methods such as gradient-based continuous prompts (CoOp) and manually designed prompts. The figure shows only a typical conversation using the ChatGPT web user interface. The code implementation of this study was carried out using the ChatGPT API in this mode.

Specifically, the researchers optimized VLMs through natural language prompts, which avoids obtaining model parameters, feature embedding, and output log-probability operations. Specifically, they used chat-based large language models (LLMs) to search for the best text prompts for VLMs through an automatic "hill-climbing" procedure, which can make the prompts converge to a valid state during the conversation without human intervention.

In the challenging single-shot image classification setting, the proposed simple method is tested on 11 datasets including ImageNet, and its performance is 1.5% higher than the white-box continuous prompt method (CoOp) on average, and outperforms manually designed prompts and prompts generated by LLMs. The study also highlights the advantages of dialogue feedback containing positive and negative prompts, because LLMs can exploit the implicit "gradient" direction in text feedback to achieve more efficient search. In addition, the text prompts generated by this strategy are not only more interpretable, but also can be well transferred between different VLM architectures in a black-box manner.

Finally, this framework is applied to optimize a state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt reversal, and personalization.