Learning and Evaluating Contextual Embedding of Source Code

Jul 12, 2020



Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained con-textual embeddings such as BERT. These can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple tasks simultaneously. In this paper, we alleviate this gap by curating a code-understanding benchmark and evaluating a learned contextual embedding of source code. More specifically, we curate a massive, deduplicated corpus of Python code from GitHub and train a BERT model, which we call B4C. We also create a benchmark comprising five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. For comparison, we train different variants of Word2Vec token embeddings, and BiLSTM and Transformer models. For the repair task, we also compare to SOTA models. We show that fine-tuned B4C models give better results, even with shorter training or fewer examples. Future work on source-code embedding could benefit from reusing our benchmark and comparing against B4C as a strong baseline.



About ICML 2020

The International Conference on Machine Learning (ICML) is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence known as machine learning. ICML is globally renowned for presenting and publishing cutting-edge research on all aspects of machine learning used in closely related areas like artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, and robotics. ICML is one of the fastest growing artificial intelligence conferences in the world. Participants at ICML span a wide range of backgrounds, from academic and industrial researchers, to entrepreneurs and engineers, to graduate students and postdocs.

Store presentation

Should this presentation be stored for 1000 years?

How do we store presentations

Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%


Recommended Videos

Presentations on similar topic, category or speaker