HyperAI超神経
Back to Headlines

Fine-Tuning Qwen 2.5 VL to Extract Handwritten Text from Norwegian Phenology Dataset

2日前

Fine-Tuning Visual Large Language Models for Document Understanding In this article, Eivind Kjosbakken and Lars Aurdal explore the process of fine-tuning Visual Large Language Models (VLMs) like Qwen 2.5 VL 7B to accurately extract handwritten text from images. Their work, conducted at Findable, aims to digitize and share a valuable Norwegian phenology dataset that could have significant implications for climate research. The authors also presented this topic at the Data & Draft event hosted by Netlight. Motivation and Goal The primary objective of this article is to demonstrate how to fine-tune a VLM to optimize its performance in extracting handwritten text. The dataset in question is a collection of around 82,000 images containing tabular handwritten data from the Norwegian phenology dataset. This dataset is crucial for climate research, capturing long-term changes in plant flowering and other ecological events. Why Use VLMs? While traditional Optical Character Recognition (OCR) engines like Tesseract, DocTR, and EasyOCR are effective for standardized text, they often fail to handle the complexities of handwritten text. VLMs, such as Qwen 2.5 VL, excel in this area due to their advanced training methods and the ability to understand context, which is vital for distinguishing between similar-looking characters (e.g., "1" and "7"). Advantages of VLMs Superior OCR Performance: VLMs are trained on diverse datasets, including OCR-specific datasets, which makes them more adept at recognizing handwritten text. Contextual Understanding: Unlike traditional OCR engines, VLMs can infer context from images, reducing errors in ambiguous cases. Customizable Instructions: You can provide detailed instructions to the VLMs, guiding them on how to interpret and extract the text, a feature unavailable with conventional OCR models. Challenges with Handwritten Text Handwritten text poses unique challenges due to its variability. For instance, the digit "1" can look similar to "7," and cell borders in the images can be misinterpreted as characters. The authors manually inspected the dataset to identify these issues, noting that: - "1" and "7" are often hard to distinguish. - Dots and noise in the background can interfere with text recognition. - Cell borders can be mistaken for actual characters. - Parentheses and brackets can look similar to each other. - Some text is faint and difficult to read. Annotation and Fine-Tuning Process To fine-tune the Qwen model, the authors followed a three-step iterative process: 1. Predict: Use the base model to make initial predictions on a few hundred images. 2. Review & Correct: Manually review and correct the model’s mistakes, ensuring high label accuracy. 3. Retrain: Fine-tune the model using the corrected labels and repeat the process until performance stabilizes. Predict The first step involves running the base Qwen model on a subset of images to generate initial labels. This helps in quickly producing a large number of labeled samples. Review & Correct The second step is crucial for maintaining label accuracy. The authors set up a Jupyter notebook environment to display images and their corresponding labels, making it easy to review and correct any mistakes. They noted that even a small percentage of incorrect labels can significantly degrade model performance. Retrain Finally, the model is fine-tuned using the corrected labels. The authors used the Unsloth package, which provides a convenient notebook for fine-tuning. They iteratively repeated the predict, review, and retrain cycle, monitoring model performance on a separate test set. Supervised Fine-Tuning (SFT) Technical Details SFT involves adjusting the model’s weights to perform better on the specific dataset. Key considerations include: - Label Correctness: Ensuring that nearly all labels are accurate is paramount. Even 0.5% label errors can reduce model performance. - Data Balancing: The dataset contains a large proportion of blank images (around 70%). To focus on meaningful text extraction, the authors balanced the dataset to limit blank images to 30% of the training set. - Layer Selection: While ideally, all layers should be fine-tuned, compute constraints may necessitate tuning only the vision and vision-language adapter layers. - Hyperparameter Search: Conducting a hyperparameter search helped in finding the optimal parameters for fine-tuning. Given the small image size and the 7B model, training took only 10-20 minutes per cycle, making hyperparameter optimization feasible. Results and Plots After several iterations of the fine-tuning process, the authors achieved significant improvements in model performance. They tested the fine-tuned model on four test sets, each containing 278 samples. The results showed: - EasyOCR performed poorly, with accuracy ranging from 30-50%. - The base Qwen 2.5 VL 7B model achieved 93-99% accuracy. - The fine-tuned Qwen model outperformed the base model, with accuracy ranging from 94-99.5%. The fine-tuning process was successful in addressing the initial challenges, particularly in distinguishing "1" from "7" and handling faint text. The extracted data, including tree line numbers, was plotted on a map of Norway using H3 by Uber. The visualization revealed patterns consistent with expected climate trends, with lower tree lines near the coast and in northern regions, and higher tree lines inland. All the code and data used in this project are available on the GitHub repository and HuggingFace, respectively. The extracted phenology data, complete with geographical coordinates, can be found in a parquet file on HuggingFace. Evaluation and Company Profile Industry insiders praise the approach taken by Kjosbakken and Aurdal, highlighting the practical application of VLMs in real-world data extraction tasks. The ability to fine-tune models for specific challenges is a game-changer in the field of OCR, particularly for historical and handwritten datasets. Findable is a data science company focused on digitizing and analyzing valuable datasets. The company emphasizes the importance of thorough data inspection and high-quality annotation, which are critical steps in machine learning projects, despite being often overlooked. Their work on the Norwegian phenology dataset exemplifies their commitment to leveraging cutting-edge technology to advance scientific research.

Related Links