University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Automatic term extraction and text categorisation.

Manomaisupat, Pensiri. (2006) Automatic term extraction and text categorisation. Doctoral thesis, University of Surrey (United Kingdom)..

Full text is not currently available. Please contact sriopenaccess@surrey.ac.uk, should you require it.

Abstract

Automatic text categorisation is a major challenge for information retrieval, information extraction, and semantic web projects. The categorisation of texts depends on the 'meaning' of the individual texts - texts sharing the same meaning should be categorised together and those with different meaning will be categorised separately. This is an intelligent task and requires the knowledge of a given domain and expertise in text categorisation. Meaning is expressed by using keywords in a specialist domain; these keywords can change over time and new keywords are added and old ones removed. I present a method where keywords are extracted automatically from large collection of texts and the keywords are then used to train neural computing systems - in a limited way the systems 'learn' to categorise. A number of different techniques of extracting keywords are presented. The keywords were extracted using the traditional tfidf metric, a technique used in contrastive corpus linguistics - weirdness; multi-word compounds have been used as well as vectors for text collections. For 'learning' algorithms, we used the unsupervised self-organising maps and the supervised support vector machines - both have been used for the purposes of dimension reduction that is mapping from a highdimensional feature space to a lower dimensional without much loss of accuracy. The performance of such systems is evaluated through classification accuracy and average quantisation error. Three large text collections were used for training and testing - the TREC-AP news wire, the Reuters RCV1 and streaming news from Reuters Financial - the focus of the experimentation was on financial news. An archetype was developed that incorporates text analysis, terminology management, neural computing and feature vector generation systems. A novel evaluation scheme is reported where a vector of randomly selected words from a text collection is used as a baseline. The other comparisons are between systems trained by different techniques and with different learning algorithms. The key results include the classification accuracy is highest when the compound terms are chosen for creating vectors - the compounds were extracted automatically - however, when a terminology-based method was used to create vectors the single words from this method appear to be a better vector for training. The results of the experiments are encouraging. With further research, improvements in quantitative performance can be expected in the future.

Item Type: Thesis (Doctoral)
Divisions : Theses
Authors :
NameEmailORCID
Manomaisupat, Pensiri.UNSPECIFIEDUNSPECIFIED
Date : 2006
Contributors :
ContributionNameEmailORCID
http://www.loc.gov/loc.terms/relators/THSUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Depositing User : EPrints Services
Date Deposited : 09 Nov 2017 12:16
Last Modified : 09 Nov 2017 14:45
URI: http://epubs.surrey.ac.uk/id/eprint/844130

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year


Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800