University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Interpretation and mining of statistical machine learning (Q)SAR models for toxicity prediction.

Webb, Samuel J. (2015) Interpretation and mining of statistical machine learning (Q)SAR models for toxicity prediction. Doctoral thesis, University of Surrey.

This is the latest version of this item.

[img] Text (Final thesis version pdf)
SamWebb_thesis_corrected_for_electronic_submission.pdf - Thesis (version of record)
Restricted to Repository staff only
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (9MB)
[img] Text (author deposit agreeement)
2014_08_13_Author_Deposit_Agreement_sjw.docx - Supplemental Material
Restricted to Repository staff only
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (42kB)

Abstract

Structure Activity Relationship (SAR) modelling capitalises on techniques developed within the computer science community, particularly in the fields of machine learning and data mining. These machine learning approaches are often developed for the optimisation of model accuracy which can come at the expense of the interpretation of the prediction. Highly predictive models should be the goal of any modeller, however, the intended users of the model and all factors relating to usage of the model should be considered. One such aspect is the clarity, understanding and explanation for the prediction. In some cases black box models which do not provide an interpretation can be disregarded regardless of their predictive accuracy. In this thesis the problem of model interpretation has been tackled in the context of models to predict toxicity of drug like molecules. Firstly a novel algorithm has been developed for the interpretation of binary classification models where the endpoint meets defined criteria: activity is caused by the presence of a feature and inactivity by the lack of an activating feature or the deactivation of all such activating features. This algorithm has been shown to provide a meaningful interpretation of the model’s cause(s) of both active and inactive predictions for two toxicological endpoints: mutagenicity and skin irritation. The algorithm shows benefits over other interpretation algorithms in its ability to not only identify the causes of activity mapped to fragments and physicochemical descriptors but also in its ability to account for combinatorial effects of the descriptors. The interpretation is presented to the user in the form of the impact of features and can be visualised as a concise summary or in a hierarchical network detailing the full elucidation of the models behaviour for a particular query compound. The interpretation output has been capitalised on and incorporated into a knowledge mining strategy. The knowledge mining is able to extract the learned structure activity relationship trends from a model such as a Random Forest, decision tree, k Nearest Neighbour or support vector machine. These trends can be presented to the user focused around the feature responsible for the assessment such as ACTIVATING or DEACTIVATING. Supporting examples are provided along with an estimation of the models predictive performance for a given SAR trend. Both the interpretation and knowledge mining has been applied to models built for the prediction of Ames mutagenicity and skin irritation. The performance of the developed models is strong and comparable to both academic and commercial predictors for these two toxicological activities.

Item Type: Thesis (Doctoral)
Divisions : Theses
Authors :
AuthorsEmailORCID
Webb, Samuel J.emailsjwebb@gmail.comUNSPECIFIED
Date : 31 March 2015
Funders : Lhasa Limited, Technology Strategy Board
Contributors :
ContributionNameEmailORCID
Thesis supervisorKrause, Paul J.p.krause@surrey.ac.ukUNSPECIFIED
Thesis supervisorHowlin, Brendanb.howlin@surrey.ac.ukUNSPECIFIED
Depositing User : Samuel Webb
Date Deposited : 27 Apr 2015 08:13
Last Modified : 18 Dec 2015 18:44
URI: http://epubs.surrey.ac.uk/id/eprint/807269

Available Versions of this Item

  • Interpretation and mining of statistical machine learning (Q)SAR models for toxicity prediction. (deposited 27 Apr 2015 08:13) [Currently Displayed]

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year


Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800