17 days ago
ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference
{Premkumar T. Devanbu, Kevin Jesse}
Abstract
In this paper, we present ManyTypes4TypeScript, a very largecorpus for training and evaluating machine-learning models forsequence-based type inference in TypeScript. The dataset includesover 9 million type annotations, across 13,953 projects and 539,571files. The dataset is approximately 10x larger than analogous typeinference datasets for Python, and is the largest available for TypeScript. We also provide API access to the dataset, which can beintegrated into any tokenizer and used with any state-of-the-artsequence-based model. Finally, we provide analysis and performance results for state-of-the-art code-specific models, for baselining. ManyTypes4TypeScript is available on Huggingface, Zenodo,and CodeXGLUE.