HyperAI超神经

Cambrian-1 is a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While powerful language models can enhance multimodal capabilities, the design choices of the visual component are often underexplored and out of touch with visual representation learning research.

Cambrian-1 is built around five key pillars, each of which provides important insights into the design space of MLMs:

Visual Representation: The research team explored various visual encoders and their combinations.
Connector Design: The research team designed a new dynamic and spatially aware connector that integrates visual features from several models while reducing the number of tokens.
Instruction tuning data: The research team curates high-quality visual instruction tuning data from public resources, emphasizing the importance of balanced distribution.
Instruction Tuning Cookbook: The research team discusses instruction tuning strategies and practices.
Benchmarks: The research team examined existing MLM benchmarks and introduced a new vision-centric benchmark “CV-Bench”.

Cambrian-1 project website:https://cambrian-mllm.github.io/#visual-representation

Model performance

Model	# Vis. Tok.	MMB	SQA-I	MathVistaM	ChartQA	MMVP
GPT-4V	UNK	75.8	–	49.9	78.5	50.0
Gemini-1.0 Pro	UNK	73.6	–	45.2	–	–
Gemini-1.5 Pro	UNK	–	–	52.1	81.3	–
Grok-1.5	UNK	–	–	52.8	76.1	–
MM-1-8B	144	72.3	72.6	35.9	–	–
MM-1-30B	144	75.1	81.0	39.4	–	–
Base LLM: LLaMA3-8B-Instruct
Mini-Gemini-HD-8B	2880	72.7	75.1	37.0	59.1	18.7
LLaVA-NeXT-8B	2880	72.1	72.8	36.3	69.5	38.7
Cambrian-1-8B	576	75.9	80.4	49.0	73.3	51.3
Base LLM: Vicuna1.5-13B
Mini-Gemini-HD-13B	2880	68.6	71.9	37.0	56.6	19.3
LLaVA-NeXT-13B	2880	70.0	73.5	35.1	62.2	36.0
Cambrian-1-13B	576	75.7	79.3	48.0	73.8	41.3
Base LLM: Hermes2-Yi-34B
Mini-Gemini-HD-34B	2880	80.6	77.7	43.4	67.6	37.3
LLaVA-NeXT-34B	2880	79.3	81.8	46.5	68.7	47.3
Cambrian-1-34B	576	81.4	85.6	53.2	75.6	52.7

Deploy the inference step

This tutorial has deployed the model and environment. You can directly use the large model for reasoning dialogue according to the tutorial instructions. The specific tutorial is as follows:

1. Initial Setup

1. Open the workspace after resource configuration

2. Open the terminal and enter the command `bash setup.sh`

3. After the system outputs Environment variable added to .bashrc, enter the command `source ~/.bashrc`

2. Start the controller

4. After initialization is complete, enter the command in the terminal `bash control.sh`

3. Open the interface

5. Wait for about 15 seconds and then open aNew Terminal, and enter the command `bash gradio.sh`, click the link generated on the page to enter the model interface

6. At this time, notice that there is no model for us to choose in the model interface. This is because we have not configured the model yet. At this time, we need to proceed to the fourth step.

4. Model Configuration

7. Open anotherNew Terminal And enter the command `bash model.sh` When "Uvicorn running on..." appears, return to the Gradio web page that has been opened. After refreshing, you can see that the model has been deployed. You can then upload pictures and prompt words to communicate with the model.

There are also multiple parameters in the model that can be adjusted by the user.

Temperature can affect the creativity and randomness of the output content.
Top p can control the size of the candidate word set, affecting the quality and diversity of the generated text
Max output tokens can change the maximum number of output tokens.