University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Machine Learning and Data Validation.

Pantziarka, P. (2005) Machine Learning and Data Validation. Doctoral thesis, University of Surrey (United Kingdom)..

[img]
Preview
Text
27733181.pdf
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (19MB) | Preview

Abstract

Data validation describes the process of checking the internal consistency, correctness and quality of a data-set. The role of data validation in the broader context of data quality/data cleansing is described. In particular problems related to syntactical and semantic errors are defined, and the concept of a validation model is introduced. The role of machine learning in the building of validation models is described and a range of machine learning techniques is surveyed. A novel machine learning strategy that combines genetic algorithms and association rules to generate data validation models is proposed. An algorithm is developed to discover validation rules from numeric data sets and is implemented as a Java toolset called eaVal. A series of experiments using eaVal for data validation are carried out and it is shown that it can successfully discover validation rules which identify records within a dataset which have a high probability of containing errors. A method of post-processing the results from eaVal is proposed. This utilises Bayesian Networks, which are derived directly from the validation rules discovered by eaVal, to identify which fields within an invalid record set have the highest probability of being invalid. Experimental evidence of the efficay of the technique is shown. The post-processing phase is shown to be a major step towards semantic data validation. A case study is also described that uses the tools and techniques described in this work to perform a data validation exercise on a clinical dataset. The case study indicates that the methods developed can provide useful information to a data analyst when validating numerical datasets. Furthermore it is also shown that the discovery of validation rules is a useful mechanism for identifying records which are interesting or unusual. Finally current limitations and future directions of this work are also discussed.

Item Type: Thesis (Doctoral)
Divisions : Theses
Authors : Pantziarka, P.
Date : 2005
Additional Information : Thesis (Ph.D.)--University of Surrey (United Kingdom), 2005.
Depositing User : EPrints Services
Date Deposited : 06 May 2020 14:23
Last Modified : 06 May 2020 14:30
URI: http://epubs.surrey.ac.uk/id/eprint/856182

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year


Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800