PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Handwritten Mathematical Expression Recognition (HMER) has wide applicationsin human-machine interaction scenarios, such as digitized education andautomated offices. Recently, sequence-based models with encoder-decoderarchitectures have been commonly adopted to address this task by directlypredicting LaTeX sequences of expression images. However, these methods onlyimplicitly learn the syntax rules provided by LaTeX, which may fail to describethe position and hierarchical relationship between symbols due to complexstructural relations and diverse handwriting styles. To overcome thischallenge, we propose a position forest transformer (PosFormer) for HMER, whichjointly optimizes two tasks: expression recognition and position recognition,to explicitly enable position-aware symbol feature representation learning.Specifically, we first design a position forest that models the mathematicalexpression as a forest structure and parses the relative position relationshipsbetween symbols. Without requiring extra annotations, each symbol is assigned aposition identifier in the forest to denote its relative spatial position.Second, we propose an implicit attention correction module to accuratelycapture attention for HMER in the sequence-based decoder architecture.Extensive experiments validate the superiority of PosFormer, which consistentlyoutperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62%gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNEdatasets, respectively, with no additional latency or computational cost. Codeis available at https://github.com/SJTU-DeepVisionLab/PosFormer.