CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Pre-trained models for Natural Languages (NL) like BERT and GPT have beenrecently shown to transfer well to Programming Languages (PL) and largelybenefit a broad set of code-related tasks. Despite their success, most currentmethods either rely on an encoder-only (or decoder-only) pre-training that issuboptimal for generation (resp. understanding) tasks or process the codesnippet in the same way as NL, neglecting the special characteristics of PLsuch as token types. We present CodeT5, a unified pre-trained encoder-decoderTransformer model that better leverages the code semantics conveyed from thedeveloper-assigned identifiers. Our model employs a unified framework toseamlessly support both code understanding and generation tasks and allows formulti-task learning. Besides, we propose a novel identifier-aware pre-trainingtask that enables the model to distinguish which code tokens are identifiersand to recover them when they are masked. Furthermore, we propose to exploitthe user-written code comments with a bimodal dual generation task for betterNL-PL alignment. Comprehensive experiments show that CodeT5 significantlyoutperforms prior methods on understanding tasks such as code defect detectionand clone detection, and generation tasks across various directions includingPL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can bettercapture semantic information from code. Our code and pre-trained models arereleased at https: //github.com/salesforce/CodeT5 .