POSIT: Simultaneously Tagging Natural and Programming Languages
Partachi, Profir-Petru, Dash, Santanu, Treude, Christoph and Barr, Earl T. (2019) POSIT: Simultaneously Tagging Natural and Programming Languages In: (ICSE) International Conference on Software Engineering, May 23-29, 2020, Seoul, South Korea.
|
Text
icse-main-984.pdf - Accepted version Manuscript Download (927kB) | Preview |
Abstract
Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.
Item Type: | Conference or Workshop Item (Conference Paper) | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Divisions : | Faculty of Engineering and Physical Sciences > Computer Science | |||||||||||||||
Authors : |
|
|||||||||||||||
Date : | 8 December 2019 | |||||||||||||||
Funders : | EPSRC | |||||||||||||||
Grant Title : | EPSRC Grant | |||||||||||||||
Copyright Disclaimer : | © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. | |||||||||||||||
Uncontrolled Keywords : | part-of-speech Tagging, Mixed-Code, Code-Switching, Language Identification | |||||||||||||||
Depositing User : | James Marshall | |||||||||||||||
Date Deposited : | 02 Mar 2020 10:44 | |||||||||||||||
Last Modified : | 02 Mar 2020 10:44 | |||||||||||||||
URI: | http://epubs.surrey.ac.uk/id/eprint/853851 |
Actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year