BlackBox Optimizers
In 2024, Carnegie Mellon University (CMU) proposed a new black-box optimization strategy that automatically adjusts natural language prompts through a large language model to optimize the performance of visual language models (VLMs) in multiple downstream tasks such as text maps and visual recognition. This method not only does not need to touch the internal parameters of the model, but also greatly improves the flexibility and speed of optimization, allowing even users without technical background to easily improve model performance. Related research results are "Language Models as Black-Box Optimizers for Vision-Language Models", this research has been accepted by CVPR 2024.

Specifically, the researchers optimized VLMs through natural language prompts, which avoids obtaining model parameters, feature embedding, and output log-probability operations. Specifically, they used chat-based large language models (LLMs) to search for the best text prompts for VLMs through an automatic "hill-climbing" procedure, which can make the prompts converge to a valid state during the conversation without human intervention.
In the challenging single-shot image classification setting, the proposed simple method is tested on 11 datasets including ImageNet, and its performance is 1.5% higher than the white-box continuous prompt method (CoOp) on average, and outperforms manually designed prompts and prompts generated by LLMs. The study also highlights the advantages of dialogue feedback containing positive and negative prompts, because LLMs can exploit the implicit "gradient" direction in text feedback to achieve more efficient search. In addition, the text prompts generated by this strategy are not only more interpretable, but also can be well transferred between different VLM architectures in a black-box manner.
Finally, this framework is applied to optimize a state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt reversal, and personalization.