University of Surrey

Test tubes in the lab Research in the ATI Dance Research

POSIT: Simultaneously Tagging Natural and Programming Languages

Partachi, Profir-Petru, Dash, Santanu, Treude, Christoph and Barr, Earl T. (2019) POSIT: Simultaneously Tagging Natural and Programming Languages In: (ICSE) International Conference on Software Engineering, May 23-29, 2020, Seoul, South Korea.

[img]
Preview
Text
icse-main-984.pdf - Accepted version Manuscript

Download (927kB) | Preview

Abstract

Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.

Item Type: Conference or Workshop Item (Conference Paper)
Divisions : Faculty of Engineering and Physical Sciences > Computer Science
Authors :
NameEmailORCID
Partachi, Profir-Petru
Dash, Santanus.dash@surrey.ac.uk
Treude, Christoph
Barr, Earl T.
Date : 8 December 2019
Funders : EPSRC
Grant Title : EPSRC Grant
Copyright Disclaimer : © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Uncontrolled Keywords : part-of-speech Tagging, Mixed-Code, Code-Switching, Language Identification
Depositing User : James Marshall
Date Deposited : 02 Mar 2020 10:44
Last Modified : 02 Mar 2020 10:44
URI: http://epubs.surrey.ac.uk/id/eprint/853851

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year


Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800