HyperAI

Selected for ACL 2024! Zhejiang University Launches the First Ocean Language Model OceanGPT, Making Underwater Embodied Intelligence a Reality

特色图像

AI tools, including large language models (LLMs), are gradually changing the scientific paradigm.Listed by Nature as one of the scientific events worth paying attention to in 2024.As a core tool in the field of text data mining,Big language models can extract key scientific information, patterns, and trends from massive amounts of text data.This will deepen the understanding of different disciplines and provide strong support and insights for scientific research processes, decision-making and complex problem solving.

for example,Biomedicine,Microsoft has trained the language model BioGPT on millions of relevant scientific papers in the PubMed database. The model is good at understanding complex concepts such as professional terms, gene names, protein sequences, etc., compared with non-professional models.BioGPT can quickly and accurately generate answers to biomedical questions.Complete tasks such as text mining, lab report writing, molecular design, and literature review writing.

Likewise,In the field of marine science,Using large language models to analyze massive amounts of marine science text data and understand theories and methods related to ocean characteristics, changing patterns, resource development and utilization, is crucial to global climate regulation, weather pattern shaping, biodiversity maintenance, and future economic development of mankind.

However, multi-dimensional and multi-scale ocean data is large in scale and rich in types, and traditional data processing methods are difficult to cope with. At the same time, marine science covers multiple fields and disciplines, each of which has its own unique data attributes and patterns, which requires LLM to have a richer professional knowledge reserve.However, the current mainstream LLM still cannot fully meet the specific needs of oceanographers.

In this regard,The team led by Zhang Ningyu and Chen Huajun from the School of Computer Science and Technology at Zhejiang University proposed the first large language model in the ocean field, OceanGPT.The model is good at handling various ocean science tasks and can answer questions according to the instructions of oceanographers. Through the evaluation of the oceanography benchmark OCEANBENCH, OceanGPT not only demonstrated high knowledge expertise in ocean science tasks, but also gained preliminary embodied intelligence capabilities in ocean engineering.
OceanGPT project address:

http://oceangpt.zjukg.cn/

In addition, to alleviate the difficulty of obtaining ocean data,The researchers also proposed a marine science instruction generation framework DoInstruct based on multi-agent collaboration.Among them, each Agent is regarded as an expert in a specific field (such as science and research, resources and development, ecology and environment, etc.) and is responsible for generating data in the corresponding field.

The research is titled "OceanGPT: A Large Language Model for Ocean Science Tasks".It was recently accepted as the main conference paper by ACL 2024 (CCF-A conference), a top natural language processing conference.

Research highlights:
* Compared with existing open source large language models, OceanGPT, a large language model for the ocean domain, can handle more professional ocean tasks.

* The ocean science instruction generation framework DoInstruct is very flexible and can be optimized and applied to different scientific fields (such as astronomy)

Paper address:

https://arxiv.org/abs/2310.02031

The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:

https://github.com/hyperai/awesome-ai4s

Dataset: High quality driven, from 67,633 marine science papers

The researchers collected 67,633 articles in the field of marine science in recent years as the original corpus.We also selected some historically significant documents to help LLM understand the history of the development of the ocean field. To ensure diversity, the articles come from different sources and cover a variety of research perspectives and methods.

To ensure data quality and consistency,The researchers used regular expressions to filter out graphics, tables, headers, footers, page numbers, URLs and references, remove extra spaces, line breaks and other non-text characters, and replace or delete special characters, emoticons and garbled characters. The processed documents cover various fields of marine science, such as marine physics, marine chemistry, marine biology, geology, hydrology, etc.

Then,The researchers used a hash algorithm to deduplicate the data.This helps reduce the risk of overfitting during model pre-training and improves its generalization ability.

Since the marine science corpus contains multiple fields and topics, each topic has its own unique data characteristics and patterns. In order to effectively simulate and obtain these data,The researchers proposed a domain instruction generation framework DoInstruct.
*Ocean Themes: Based on the expertise of oceanographers, the ocean science data are manually divided into five relatively independent ocean themes, namely science and research, resources and development, ecology and environment, technology and engineering, life, culture and others.

High quality/professional/diverse, DoInstruct can generate marine instruction data

The domain instruction generation framework DoInstruct is based on multi-agent collaboration and can effectively realize ocean data generation.

DoInstruct Framework

As shown in the figure above, in the DoInstruct framework,The researchers designed three agent roles:Evolving Agent as Generator, Fine-tuned Agent as Literature Extractor, and Inspector. Each agent is considered an expert in a specific field (topic) and is responsible for generating the corresponding data.

Evolving Agent as the Generator

To build the seed data set, the researchers hired dozens of annotators with rich backgrounds in marine science, each of whom was responsible for several topics and manually wrote some representative examples for each marine topic.

The researchers then used a large language model to imitate existing data and generate a large number of similar samples, all of which were manually checked by annotators. The final seed instruction dataset includes 5 main categories, more than 500 subcategories, and more than 10,000 data samples.

Left: Evolutionary Data Synthesis Agent

After obtaining the seed instruction dataset, the researchers selected samples from it and called Agent (gpt-3.5-turbo) to evolve the selected samples.

As shown in the left figure, specifically, by supplementing and expanding the background knowledge of the seed samples, and conducting refined analysis, enhancement and improvement on the knowledge points contained in the seed data, through multiple rounds of iterations, researchers can quickly expand the existing seed data set and expand the breadth and depth of information.

Fine-Tuned Agent as the Literature Extractor

Fine-tuned Literature Reading Agent

The researchers collected an expert-annotated corpus and used the BM25 algorithm to retrieve high-quality sentences from the larger Ocean Corpus, treating both as high-quality candidate samples. At the same time, the researchers used the seed instruction dataset to fine-tune gpt-3.5-turbo and treated the fine-tuned Agent as a document extractor that can extract high-quality text from the massive Ocean Corpus.

Audit Agent to ensure data quality: Agent as the Inspector with Rule Constraints

Audit Agent to ensure data quality

For the large number of instructions generated, the researchers used grammar, semantics, basic definitions of the ocean field, etc. as rule constraints, built agents through prompts, and filtered the data to ensure that the generated ocean instruction data was of higher quality.

To further ensure data quality, the researchers randomly selected 10% samples from the generated instruction dataset and asked trained domain expert volunteers to verify whether these samples had potential errors. The final data had an IAA (inter-annotator agreement) score of 0.82, which met the research purpose.

As shown in the figure below,The DoInstruct framework can use multiple agents to quickly build marine science datasets and can be expanded to more than 150,000 instructions (Data-Evolving, Data-Extracting). In addition, the professionalism and accuracy of the data are also guaranteed.

Statistics of the final instruction dataset

As shown in the figure below, researchers measured the data generation effect of DoInstruct from the perspectives of knowledge quality, expertise, and diversity.

Performance analysis of different agents

It can be seen that the evolving generator agent can effectively enhance the richness of ocean data. The extraction agent can improve the professionalism of the content, and the inspector agent can improve the quality of the generated data. In summary, multi-agent collaboration is effective for ocean instruction generation.

Based on LLaMA-2, OceanGPT performs better in ocean tasks

After obtaining the instruction data, the researchers pre-trained OceanGPT for 7 days based on LLaMA-2 using 6 Nvidia A800 GPUs.

The overall framework of the OceanGPT model

After obtaining the pre-trained model OceanGPT, the researchers used the LoRA method to fine-tune the model. In order to evaluate the ability of the large language model OceanGPT in oceanographic tasks, the researchers selected three models, LLaMA-2 (Llama-2-7b-chat-hf), Vicuna-1.5, and ChatGLM2-6B, to compare with OceanGPT.

Before making the comparison, the researchers designed a benchmark test OCEANBENCH. As shown in the figure below, the benchmark includes 15 ocean-related tasks such as Analysis, Judgment, etc.

OCEANBENCH Detailed Statistics

As shown in the figure below, researchers compared the performance of OceanGPT with three baseline models on 15 subtasks in the ocean field from a task-level perspective.The results show that OceanGPT performs better than other models in both automatic evaluation and human evaluation.

Ocean task-level results Left: GPT-4 automatic evaluation, Right: human evaluation

As shown in the figure above, researchers showed the evaluation results of the OceanGPT model in the OCEANBENCH ocean science mission, and found thatOceanGPT outperforms other baseline language models in the vast majority of tasks.

Evaluation results of OceanGPT in the OCEANBENCH ocean science mission

From nuclear pollution to underwater robots, OceanGPT's double victory in the marine field

In order to prove the application potential of OceanGPT in the ocean field, researchers tested OceanGPT from the perspectives of ocean science and ocean engineering.

A new tool for radionuclide research: OceanGPT has better professional knowledge depth

For ocean science, the researchers focused on nuclear contamination of the marine environment and compared the performance of OceanGPT and Vicuna-7b-1.5 in this mission.

Marine science mission case analysis: how to conduct research on the surface and interface chemistry and toxicological effects of key radionuclides

As shown in the figure above, OceanGPT shows a higher level of knowledge in describing the content of radionuclide research. Its text content is not only clearly structured and well organized, but also covers all aspects of radionuclide research, such as experimental design, data analysis, risk assessment, and processing guidelines.

In contrast, although Vicuna-7b-1.5 is clearly expressed and logical, it lacks the deeper, more specific content related to radionuclides.

In summary, OceanGPT has advantages in terms of knowledge expertise, quality, and richness.

Intelligent marine engineering: OceanGPT achieves precise control of underwater robots

Marine engineering is critical to the sustainability and safety of offshore operations. To facilitate OceanGPT's interaction with the outside world, researchers synthesized robot code data and integrated these machine code instructions into the training data to evaluate the model's capabilities through code or console commands.

OceanGPT controls underwater robots

As shown in the figure above, OceanGPT can issue instructions to underwater robots through code or console commands so that underwater robots can perform complex tasks (based on human instructions), which shows that OceanGPT has acquired preliminary embodied intelligence capabilities, paving the way for advanced ocean models to perform complex robot control and planning tasks.

OceanGPT "evolves" again, and marine science ushers in the era of intelligence

Led by Professors Zhang Ningyu and Chen Huajun of Zhejiang University, the research team, which includes Bi Zhen, Xue Yida, Ou Yixin, Ji Daxiong, Zheng Guozhou and others, successfully built the first large language model OceanGPT in the ocean field, marking a key step in the intelligent process of the ocean field., OceanGPT has become an important milestone in the ocean field.

However, the development of OceanGPT did not stop there. With the deepening of research and the improvement of technology,OceanGPT ushered in a new round of optimization and upgrade.

According to a recent report by Zhejiang University Knowledge Engine Laboratory ZJUKG, the first author of the paper, Bi Zhen, announced a series of major advances in OceanGPT:

* First, officially launch two new versions: OceanGPT-14B and OceanGPT-2B;

* Secondly, OceanGPT based on Qwen2 Chinese base is added to achieve efficient interaction between Chinese and English;

* At the same time, the team also open-sourced a 20K-scale ocean model instruction dataset OceanInstruct, providing valuable resource support for marine science researchers;

OceanInstruct dataset download address:

https://go.hyper.ai/3QuLq

* Finally, the multimodal version of OceanGPT-V is released, which not only supports the processing of multimodal ocean information such as sonar data and scientific images, but also provides an online demonstration of OceanGPT-V, opening up new perspectives and possibilities for ocean science exploration. It is reported that the model will soon be open source.

In order to analyze the changes in model capabilities after updating,Taking OceanGPT-14B as an example, the researchers gave a Chinese question "Please generate a construction plan for submarine cables in the East China Sea", as shown in the figure below:

The results show that the content generated by OceanGPT is richer, covers more levels, and has a stronger ability to understand and generate marine scientific knowledge.

At the same time, to verify OceanGPT's English generation capabilities, the researchers gave the English input "Please describe the seafloor topography and geomorphology characteristics of the East China Sea", as shown in the figure below:

The results show that the descriptions generated by OceanGPT are relatively good in terms of detail, comprehensiveness, professionalism and regional division, and can provide more accurate and in-depth information on the seafloor topography and geomorphology.

In addition, Bi Zhen also gave the development plan of OceanGPT, as shown in the figure below:

OceanGPT Planning

It is expected that between August and December 2024,A bilingual and multimodal version of OceanGPT-V+ will be launched.Based on the large-scale corpus, they will continue to train OceanGPT using larger models (such as 30B and 70B), and maintain OceanGPT by adding new data and new tasks to explore more unknown worlds of ocean science.

We look forward to OceanGPT bringing more surprises and breakthroughs, and opening a new chapter in marine science research!

References:
https://blog.csdn.net/gitblog_00055/article/details/138176998
https://mp.weixin.qq.com/s/TZuVvZfr1DsRGUXsxc3cGQ

Call to action

HyperAI (hyper.ai) is China's largest search engine in the field of data science. It has long focused on the latest research results of AI for Science and has interpreted more than 100 academic papers in top journals.

Research groups and teams that are conducting research and exploration around AI for Science are welcome to contact us to share their latest research results, contribute in-depth interpretation articles, and participate in the Meet AI4S live broadcast column. More ways to promote AI4S are waiting for us to explore together!