Logical Analysis of Biological Data
Methods and models of machine learning have become indispensable for the analysis and interpretation of big data sets in the biomedical field. This work focuses on the machine learning method Logical Analysis of Data (LAD), which combines concepts from combinatorics, Boolean functions and optimization. LAD is based on the generation of patterns. These patterns are used to communicate relevant information and form theories, which are the classifiers for the prediction of new data points. This thesis makes contributions to practice, theory and applications of LAD. With regards to LAD practice, we present the design and development of our freely available software package AnswerSetLAD. In the implementation we make use of the declarative programming paradigm Answer Set Programming (ASP), which is oriented towards difficult combinatorial search and optimization problems. For that reason, it provides a perfectly suited framework for the LAD functionalities. In this thesis, we substantiate this statement with an empirical study on the running time of our ASP approach and a state-of-the-art Mixed-Integer Linear Programming (MILP) approach for the generation of maximal patterns, which are a specific type of LAD patterns. We present two theoretical advancements of LAD concerning prime patterns. This pattern type plays a key role in the LAD method. Firstly, we propose an algorithm for the enumeration of all prime patterns of a data set. The algorithm is preferable to classical methods in the case that the data set has a small maximal Hamming distance between the two classes of data points. Secondly, we investigate theories formed of prime patterns. Since the number of such prime theories for a data set is large in general, we define a statistical measure that can be used to rank prime patterns and, based on this, select those prime patterns that are more significant than others to form a theory. Finally, we illustrate two biological applications of LAD. In the first application, we use prime patterns to successfully identify protein interactions in a cell signaling network based on perturbation measurement data. The second application is located in the field of synthetic biology. Here we outline our approach of Boolean classifier generation out of miRNA data. These classifiers can be used for the assembly of in-vitro synthetic cell circuits to distinguish healthy from cancerous tissue.