1. Tutorial Introduction

This tutorial uses a sample model and dataset, and the computing resources are a single 4090 GPU. If you need to train a larger model or dataset, please use a graphics card with better performance.

The project was developed by a joint team from Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, and East China University of Science and Technology in 2025. The relevant paper results are "VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning".

VenusFactory is a unified platform designed for the protein engineering field, aiming to integrate biological data retrieval, standardized task benchmarking, and modular fine-tuning of pre-trained protein language models (PLMs). The platform supports command-line execution and a Gradio-based code-free interface, integrating more than 40 protein-related datasets and more than 40 popular PLMs, making it easy for researchers in computer science and biology to use.

This tutorial provides a comprehensive Demo startup guide to quickly understand the main functions of VenusFactory and perform fine-tuning training, evaluation, and prediction on a Demo dataset for protein solubility prediction.

2. Operation steps

All data is stored in /openbayes/home/VenusFactory

1. Start the container

After starting the container, click the API address to enter the Web interface. Since the model is large, you need to wait for about 1 minute to display the WebUI interface, otherwise it will display "Bad Gateway"

2. Use Documentation

Click Manual and select the language to see the detailed usage guide for each module. This tutorial includes four modules: Training, Evaluation, Predict, and Download.

3. Brief usage examples

3.1 Training

Click the Training module, select the model you want to train in Protein Language Model, and configure the training data in Dataset Configuration

If you want to use your own dataset, you can use the Use Custom Dataset configuration. Just fill in the path of your dataset (see the Manual documentation for details)

Set the training model save path and click Start to start training

At this point you can see the training parameters and loss curve

3.2 Evaluation

Click the Evaluation module, configure the model path generated by training and the trained model, process the data, adjust the hyperparameters and start the evaluation

3.3 Prediction

Click the Prediction module, configure the model path generated by the training and the trained model, enter the protein sequence you want to predict, and click Predict to make a prediction.

Protein sequence example: MKTWFGHVLQ

3.4 Download

Click the Download module to download protein data in this interface

3. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established an AI4S exchange group. Welcome friends to scan the QR code and remark [AI4S] to join the group to discuss various technical issues and share application results↓