HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago
Transformer

Model retrieves cross-script names via contrastive learning

A new lightweight transformer model has been developed to solve the critical problem of cross-script name retrieval, addressing the silent failure mode where systems cannot match names written in different alphabets. Traditional methods like edit distance or phonetic hashing fail when script boundaries are crossed, such as matching the Russian "Владимир Путин" against the Latin "Vladimir Putin." Researchers trained a compact transformer encoder from scratch using raw UTF-8 bytes, eliminating the need for tokenizers, pre-trained backbones, or script detection. The model consists of only six layers and roughly four million parameters. By training the network to map byte sequences from different scripts into similar vector spaces, the system achieves a Mean Reciprocal Rank (MRR) of 0.775 and a Recall@10 of 0.897 across eight non-Latin scripts. This reduces the performance gap between Latin and non-Latin queries by ten times compared to the best classical baselines. The training pipeline generated a massive dataset of 4.67 million positive name pairs through a four-stage process. First, 119,040 entities were stratified from Wikidata to ensure balanced script coverage. Second, an LLM generated phonetic variants of English names to simulate real-world misspellings. Third, another LLM transliterated these variants into Arabic, Russian, Chinese, Japanese, Hebrew, Hindi, Greek, and Korean. Finally, the data was merged and tagged for training. Training utilized InfoNCE loss combined with Approximate Nearest Neighbor Contrastive Estimation (ANCE). To address the challenge of distinguishing phonetically similar names, the system employed hard negative mining. Early in training, random negatives were used to build basic structure. Subsequently, a FAISS index was rebuilt periodically to mine hard negatives—names that the model currently confused—sharpening the embedding space. Evaluation showed the byte-level encoder significantly outperformed classical baselines. While traditional methods scored near zero on cross-script queries, the new model maintained high performance across all query types. It reduced the script gap to just 0.096 and improved Latin-only retrieval from 0.944 to 0.983. Performance varied slightly by script, with Arabic and Russian reaching over 0.95 R@10, while Chinese and Korean lagged at 0.666 and 0.728. This discrepancy stems from inherent romanization ambiguity in Chinese and Korean, where a single character maps to multiple Latin spellings. The study highlights that byte-level tokenization is a robust solution for multilingual tasks where surface form matters more than semantics. It also demonstrates that LLMs can serve as effective data engines for generating training pairs in low-resource scenarios. The researchers note a limitation: the training data relies on LLM-generated transliterations from Latin, meaning the model has not seen native-script spelling variations within scripts like Chinese or Arabic. Future work could add a stage to generate native-script variants to further close this gap. The full code, dataset pipeline, and evaluation scripts are available on GitHub. The findings suggest that for enterprise applications involving immigration databases or financial compliance, this approach offers a scalable, accurate alternative to existing fragmented tools.

Related Links

Model retrieves cross-script names via contrastive learning | Trending Stories | HyperAI