PhD in Mathematics/Bioinformatics
Methods and models of machine learning have become indispensable for the analysis and interpretation of big data sets in the biomedical field. This work focuses on the machine learning method Logical Analysis of Data (LAD), which combines concepts from combinatorics, Boolean functions and optimization. LAD is based on the generation of patterns. These patterns are used to communicate relevant information and form theories, which are the classifiers for the prediction of new data points. This thesis makes contributions to practice, theory and applications of LAD. With regards to LAD practice, we present the design and development of our freely available software package AnswerSetLAD. In the implementation we make use of the declarative programming paradigm Answer Set Programming (ASP), which is oriented towards difficult combinatorial search and optimization problems. For that reason, it provides a perfectly suited framework for the LAD functionalities. In this thesis, we substantiate this statement with an empirical study on the running time of our ASP approach and a state-of-the-art Mixed-Integer Linear Programming (MILP) approach for the generation of maximal patterns, which are a specific type of LAD patterns. We present two theoretical advancements of LAD concerning prime patterns. This pattern type plays a key role in the LAD method. Firstly, we propose an algorithm for the enumeration of all prime patterns of a data set. The algorithm is preferable to classical methods in the case that the data set has a small maximal Hamming distance between the two classes of data points. Secondly, we investigate theories formed of prime patterns. Since the number of such prime theories for a data set is large in general, we define a statistical measure that can be used to rank prime patterns and, based on this, select those prime patterns that are more significant than others to form a theory. Finally, we illustrate two biological applications of LAD. In the first application, we use prime patterns to successfully identify protein interactions in a cell signaling network based on perturbation measurement data. The second application is located in the field of synthetic biology. Here we outline our approach of Boolean classifier generation out of miRNA data. These classifiers can be used for the assembly of in-vitro synthetic cell circuits to distinguish healthy from cancerous tissue.
Metabolic and gene regulatory networks are two classic models of systems biology. Biologically, gene regulatory networks are the control system of protein expression while metabolic networks, especially the genome-scale reconstructions consist of thousands of enzymatic reactions breaking down nutrients into precursors and energy to support the cellular survival. Metabolic-genetic networks, in addition, include the translational processes as an integrated model of classical metabolic networks and the gene expression machinery. Conversely, genetic regulation is also affected by the metabolic activities that provide feedbacks and precursors to the regulatory system. Thus, the two systems are highly interactive and depend on each other. Up to now, various efforts have been made to bridge the two network types. Yet, the dynamic integration of metabolic networks and genetic regulation remains a major challenge in computational systems biology. This PhD thesis is a contribution to mathematical modeling approaches for studying metabolic-regulatory systems. Inspired by regulatory flux balance analysis (rFBA), we first propose an analytic pipeline to explore the optimal solution space in rFBA. Then, our efforts focus on the dynamic combination of metabolic networks together with enzyme production costs and genetic regulation. For this purpose, we first explore the intuitive idea that incorporates Boolean regulatory rules while iterating resource balance analysis. However, with the iterative strategy, the gene expression states are only updated in discrete time steps. Furthermore, formalizing the metabolic-regulatory networks (MRNs) by hybrid automata provides a new mathematical framework that allows the quantitative integration of the metabolic-genetic network with the genetic regulation in a hybrid discrete-continuous system. For the application of this theoretical formalization, we develop a constraint-based approach regulatory dynamic enzyme-cost flux balance analysis (r-deFBA) as an optimal control strategy for the hybrid automata representing MRNs. This allows the prediction of optimal regulatory state transitions, dynamics of metabolism, and resource allocation capable of achieving a maximal biomass production over a time interval. Finally, this PhD project ends with a chapter on perspectives; we apply the theory of product automata to model the dynamics at population-level, integrating continuous metabolism and discrete regulatory states.
Systems biology is an interdisciplinary field of research that combines mathematics, computer science, and engineering in order to analyse biological processes. It has become more and more important in the last two decades, in particular because of successful applications for human health and biotechnology. It aims at simulating biological systems as mathematical models to support time- and cost-intensive research in laboratories. To do so, researchers create formalisms, algorithms, and techniques which can be widely used. One technique to obtain data describing biological entities is genome sequencing. Using modern high-throughput sequencing, increasing knowledge is gained about genomes which can then be used in order to reconstruct metabolic processes and networks of the organisms. Success does not come for free and data gathered with modern techniques is often too large to be analysed by hand. Therefore, methods which extract relevant information from data are in great demand. In this thesis, we introduce different methods which reduce given data in metabolic networks in a meaningful way. We present a technique which computes minimal metabolic subnetworks which are still able to satisfy predefined functionalities. We also develop a method to compute minimum sets of elementary flux modes which compose the network, where the size is significantly reduced compared to the whole set of elementary flux modes. Furthermore, we provide procedures that reduce the number of variables in a given problem in order to accelerate (already existing) algorithms by using information given by the data. Moreover, we develop a novel procedure to compute minimal cut sets on a projected network. This enables us to compute minimal cut sets of larger cardinality than before and to analyse larger networks. This projection of metabolic networks also gives rise to other applications such as computing minimal metabolic behaviours. Even though we apply and suit our methods to real metabolic systems, this thesis is focused on the mathematical methods. In order to create and prove the new techniques we make use of (mixed integer) linear optimisation, polyhedral cones, linear algebra, and oriented matroids.
A key topic in systems biology is to understand the intricate relations between molecular network structures, dynamic properties and biological function. In this context, gene regulatory networks (GRNs) describing the regulatory interactions between genes and their products are of crucial importance. The general goal of this thesis is to explore the relationship between the structure and dynamics of GRNs. This is done in a discrete modelling framework using the Thomas formalism. A GRN is represented by a discrete model which includes an interaction graph (IG) and a logical parameter function that characterise the regulatory interactions. The dynamics of a GRN is modelled by an asynchronous state transition graph (ASTG), where the states of the system can only be changed by asynchronous and unitary updates. In 2011, T. Lorenz proposed two reverse engineering algorithms for inferring from a given ASTG models satisfying specific properties. In the first part of the thesis, the focus is on the explanation, implementation, and generalisation of the Lorenz algorithms. In order to handle general inputs, three necessary and sufficient conditions are presented to characterise ASTGs among all graphs on a given state space. Furthermore, a fourth condition is derived which is necessary and sufficient for an ASTG to admit a realistic model. These four ASTG conditions provide the basis for a generalisation of Lorenz algorithms and several applications. Multistationarity and homeostasis are two important dynamical properties of high biological relevance, which can be represented by attractors in the ASTG. In the second part of the thesis, two discrete modelling workflows are developed for exploring all those GRNs that are able to realise a given functionality. The forward modelling workflow includes enumerating all possible models and searching for those models whose ASTG exhibits the desired properties. The reverse engineering workflow starts from enumerating all graphs on the state space satisfying the dynamic properties and then infers all models of these graphs using the generalised Lorenz algorithms. To analyse the resulting functional IGs, a logical analysis method is developed, which encodes IGs by Boolean expressions, and then uses Boolean function minimisation to obtain a compact representation. The same logical analysis method can also be applied to the logical parameters. In the last part of the thesis, the discrete modelling workflows are applied to explore the space of GRNs realising some typical dynamic behaviours of biological interest. Three case studies are presented. The first one concerns homeostasis in a simplified MAPK cascade, the second one multistationarity in cell differentiation, and the third one single-stripe forming in the embryogenesis of the fruit fly Drosophila melanogaster.
Reimers, Alexandra-Mirela: Understanding metabolic regulation and cellular resource allocation through optimization
This thesis is a contribution to the field of systems biology, where mathematical and computational models are used to study large biological networks such as the metabolism or the signaling pathways of living organisms. These models are simplified representations of the studied biological systems and come in different granularities and abstraction levels, depending on the size of the networks and on the modeling formalisms. One of the largest networks studied within systems biology is the metabolism, which comprises all the biochemical reactions happening inside a cell. Until recently, such large metabolic networks have been studied mainly in isolation and under stationary conditions, without considering the environment dynamics or the enzymatic resources needed to catalyze all the biochemical reactions. This has been mainly done using constraint-based analysis and optimization. While proven to be very successful in predicting cellular behavior in some cases, this approach is not suited for microorganisms living under changing environments. Two examples are cyanobacteria, whose metabolism is adapted to the daily changes in the sunlight availability, and yeasts living in large bioreactors and thus moving in an environment governed by local heterogeneities. This thesis builds on top of recent developments in dynamic resource allocation formalisms for metabolism, which use tools from dynamic optimization and optimal control. We focus on modeling and understanding resource allocation in large (sometimes genome-scale) metabolic models. After giving an overview of existing tools for the study of metabolic resource allocation, the thesis presents a new mathematical derivation of the widely used steady-state assumption for metabolic networks and shows how this can be used to provide upper bounds on dynamic resource allocation solutions. In preparation for the case studies, we present a guide for generating a dynamic resource allocation model using information from online databases, as well as guidelines and useful problem transformations. All the theory developed so far is then applied in two case studies. One of them investigates the cyanobacterium Synechococcus elongatus PCC 7942. This is the first genome-scale dynamic resource allocation study. It gives insight into the temporal organization of enzyme synthesis processes following light availability and shows that the linear pattern of glycogen accumulation throughout the day period is an optimal behavior that arises as a tradeoff between several conflicting resource allocation objectives. The second case study concerns the yeast Saccharomyces cerevisiae. We aim to understand what mechanisms enable some of the cells to survive environmental transitions. We show that overflow metabolism and diauxie, which are phenomenons widely spread in nature, are optimal behaviors from a resource allocation perspective. Moreover, we investigate how one can use resource allocation models to understand how yeast adapts to oxygen and nutrient availability shifts. We end with a perspectives chapter which provides some preliminary results for using time courses from dynamic resource allocation models to infer the regulatory structures that implement these optimal behaviors.
Two cellular subsystems are the metabolic network and the gene regulatory network. In systems biology they have mostly been modelled in isolation with ordinary differential equations (ODEs) or with tailored formalisms as e.g. constraint-based methods for metabolism or logical networks for gene regulation. In reality the two systems are strongly interdependent. For mathematical modelling the integration is a challenge and a variety of different approaches has been proposed. Long term alterations in metabolism result from changes in gene expression, which determines the production of enzymes. This transcriptional control can adjust the metabolic network to changes in the environment or the requirements of the cell. In fact, the cell cycle is connected to cyclic changes in metabolism, so-called metabolic cycling, but alterations are also observed in non-proliferating cells in a constant environment. A mathematical model to describe and explain alterations in metabolism will be proposed here. At first, a resource allocation model for the enzymes in a metabolic network is developed and integrated into a constraint-based model of metabolism in Chap. 3. The reaction rates are bounded depending on the availability of enzymes, which in turn is determined by the overall distribution of the limited resources. In Chap. 4, this model is used to test the hypothesis that metabolic alterations are a means of the cell to achieve the required production of metabolic output most efficiently. First a toy model is analysed and then the method is applied to a core metabolic network of the central carbon metabolism. The tasks of this metabolic network are the production of biomass precursors as well as constantly providing a minimum of energy and anti-oxidants. The mathematical model gives a mixed integer linear optimisation problem with a few quadratic constraints and a quadratic objective function. Instead of searching for a single flux distribution, a feasible solution corresponds here to a sequence of several flux distributions together with the time that is spent in each of them. The consecutive usage of these flux distributions during the associated time spans yields the required output. The objective is the minimisation of the total time needed. The computations demonstrate that switching between several flux distributions allows producing the output in a significantly shorter time span, compared to an optimal single flux distribution. In a toy model we could identify the relationship between the model parameters and the results concerning the efficiency of static versus sequential flux distributions. Such a comprehensive analysis is not possible for the large number of parameters in our core metabolic network. To make sure that the confirmation of the hypothesis is not restricted to a minor region in the parameter space of the resource allocation model, we perturbed the parameters randomly and repeated all computations. This empirical analysis showed that the significant gain in performance is a robust feature of the model. From the mathematical point of view the proposed resource allocation model defines for each gene expression state a flux space from which a flux distribution can be chosen. This flux space is in general not linear and not convex, which turns out to depend on the space of all possible gene expression states. In our model the genes regulate the enzyme concentrations in an on-off manner, only determining the active and inactive parts of metabolism. Furthermore, certain groups of genes are regulated together as functional units. As a consequence, the enzyme concentrations cannot be perfectly adjusted to a given flux distribution in this model and it is for this reason that switching can increase the efficiency. A simpler model of resource allocation, which is solely based on molecular crowding, has been proposed before in the literature. It allows distributing the resources to perfectly match any given flux distribution and switching is then not necessary to obtain the minimal production time. In contrast to such a resource allocation model, our modelling assumptions and computational results suggest a design principle, where the optimal adjustment to given conditions and requirements is not achieved by fine-tuning of enzyme concentrations, but by switching between different flux distributions, which are only roughly determined by transcriptional control and which do not perfectly match one certain condition or requirement. In terms of geometry, the difference lies in the convexity of the flux space. If it is convex, minimal production time can always be achieved with a single flux distribution. To characterise a set of flux distributions sufficient to constitute an optimal sequence, the flux space of the network without the resource allocation model is considered in Chap. 3. The corresponding polytope allows characterising a finite subset of the flux space in terms of decomposability, a notion which is closely related to elementary modes. For any output requirements, an optimal sequence can be constituted from this finite set of flux distributions. In practice, solving the optimisation problem that was derived from the modelling approach as well as computing the sufficient finite subset, is not tractable for large networks. Also divide and conquer strategies are not promising to obtain optimal solutions in general, a counterexample is given in Chap. 6. Alternative computational methods to obtain optimal or approximative optimal solutions are then presented. The gene regulatory network behind the metabolic genes is not fully considered in the resource allocation model of Chap. 3. Only some constraints are added in the application to the core metabolic network in order to exclude unrealistic patterns of gene expression. Incorporating more information about the gene regulation into the computational model is in fact improving the tractability, because the search space is reduced. A sufficiently small search space of gene expression sequences gives the possibility to perform a more precise and extensive analysis using an alternative computational approach. In Chap. 5, the perturbations of model parameters, as applied to the core metabolic network to verify the robustness, are considered in general. From the mathematical point of view, the linear constraints that bound the flux space are perturbed. The consequences on the geometry of the flux space and on the objective value of an optimisation problem over this flux space are analysed and an effect is discovered, which is surprising at first sight. If the bounds on the reaction rates are perturbed individually, without a bias for increase or decrease, the expected objective value of a given linear optimisation problem is decreased in expectation. This effect emerges from the representation of the flux space. In particular redundancy of the constraints plays a crucial role. The modelling and the analysis of the dynamics of gene regulatory networks with so-called logical networks is a common discrete approach. Logical networks are often represented by logical functions, which have the advantage of being mathematical objects that can be given in a natural and easily understandable format, namely Boolean expressions. In Chap. 7, a method is presented to obtain a short and well readable representation of a given logical function. It is based on the minimisation of Boolean expressions, but is designed for multi-valued logical functions in particular. All possible dynamics of a logical network can be represented in the so-called state transition graph. Simply by assigning rates to all edges, which represent the transitions between different states, this directed graph becomes a continuous time Markov chain (CTMC) which we call a stochastic logical network. This modelling approach opens new possibilities for the analysis of quantitative dynamical properties as shown in Chap. 8. In contrast to this abstract model, detailed mechanistic and stochastic models of biochemical reaction systems can be formulated with the chemical master equation, which also defines a CTMC. In fact, these two formalisms can be combined, so that distinct components of the biological system are modelled in much detail by the master equation and other parts on a higher abstraction level as a stochastic logical network. The combined model can focus on certain aspects, capturing related quantitative and stochastic effects, while keeping the overall complexity to a minimum. Finally, Chap. 9 discusses the feedback regulation from metabolism to gene regulation. In an integrated dynamic model of gene regulation and metabolism, this aspect should not be missing. Since constraint-based models neglect the concentrations of metabolites, it is difficult to determine the regulatory feedback to the genes. This problem can be circumvented by only inferring metabolic mediated interactions between genes, in the sense that a switch in gene expression leads to an alteration in the metabolic network, which in turn gives a new regulatory input to the gene network. To this end, a constraint- based approach is proposed and compared to a method from the literature, which is based on metabolic sensitivity analysis. Furthermore, a strategy to derive concentration changes from changes in flux rates and enzyme activities is shortly presented.
Constraint-based analysis of genome-scale metabolic networks has become increasingly important for describing and predicting the cellular behavior of living organisms. The steady-state constraints provide a reliable framework without the need for additional kinetic details of the system. As the number of metabolic network reconstructions and their level of detail continually increases, many computational tools for their analysis become unpractical to use. This invites for more efficient algorithms and tools for the analysis of metabolic networks. We have a two-fold aim with this work: first, to design new algorithms that improve the efficiency of some of the existing methods, secondly to create additional tools that fill in gaps in our toolbox for metabolic network analysis. These methods will provide additional insight into the structure of metabolic networks and ultimately broaden our understanding of cellular systems. In the first part of the thesis we focus on improving flux coupling analysis (FCA). We prove that solving certain linear programs to feasibilty is sufficient to correctly deduce most coupling information. The FFCA algorithm improves the efficiency of existing algorithms in the literature and is further developed by proving that all fully coupled reactions can be computed algebraically. Utilizing the transitive nature of couplings and reusing existing LP solutions we can dramatically decrease running-time. Using these improvements we design the F2C2 algorithm. We extend the concepts of FCA to the constrained flux space, where any number of additional linear constraints can be imposed on the reactions. Constrained flux coupling analysis (CFCA) is proven to reveal coupling information that is only visible under special environmental conditions. We study the relationship between the FCA and CFCA relations and present an efficient algorithm to compute the latter. In our next effort we study whether relations similar to FCA could also be applied for metabolites. We introduce the concept of metabolic activity coupling (MAC), which finds implicative relations between the momentary production and consumption of different metabolites. Elementary flux modes (EM) are a canonical representation of the steady-state flux space. Our novel method for finding EMs containing several predefined reactions can be used to identify pathways that synthesize a desired target from one or more source metabolites. While the problem solved belongs to the class of NP-hard problems, we show that for current-generation networks the method is applicable in practice. In the last part of the thesis, we show that the study of isolated subnetworks may result in undesired artifacts, and as an alternative solution, we present a method that projects the flux space onto lower dimension, while preserving key features of the network. We show that in certain cases our method can be applied where an exhaustive EM enumeration would otherwise fail.
This thesis addresses three challenges in modeling regulatory and signal transduction networks. Starting point is the generalized logical formalism as introduced by R. Thomas and further developed by D. Thieffry, E. H. Snoussi and M. Kaufman. We introduce the fundamental concepts that make up such models, the interaction graph and the state transition graph, as well as model checking, a computer science technique for deciding whether a finite transition system satisfies a given temporal specification. The first problem we turn to is that of whether a given model is consistent with time series data. To do so we introduce query patterns that can be automatically derived from discretized data. Time series data, being such an abundant source of information for reverse engineering, has previously been used in the context of logical models but only under the synchronous, transition-based notion of consistency. The arguably more realistic asynchronous transition relation has so far been excluded from such data driven reverse engineering, probably because the corresponding non-determinism in the transition system introduces additional obstacles to the already hard problem. Our contribution here is a path-based notion of consistency between model and data that works for any transition relation. In particular, we discuss linear time properties like monotony and branching time properties like robustness. The result are several query patterns, similar to but more complex than the ones proposed by P. T. Monteiro et al. A toolbox, called TemporalLogicTimeSeries for the automated construction of queries from data is also presented. The second problem we turn to concerns the two types of long-term behaviors that logical models are capable of producing: steady states, in which the activity levels of all network components are kept at a fixed value, and cyclic attractors in which some components are unsteady and produce sustained oscillations. We attempt to understand the emergence of these behaviors by searching for symbolic steady states as defined by H. Siebert. Our main contribution is the introduction of the prime implicant graph, which describes all minimal conditions under which components may change their activities, and an optimization-based algorithm for the enumeration of all maximal and minimal symbolic steady states. Essentially, we generalize the canalizing effects and forcing structure that were first introduced and studied by S. Kauffman and F. Fogelman in the context of random Boolean networks. The chapter includes a theorem that relates symbolic steady states to the existence of positive feedback circuits in the interaction graph. A toolbox, called BoolNetFixpoints that implements our algorithm is also described. The theme of the last chapter is how to deal with uncertainties that inevitably appear during the modeling of biological systems. One is often forced to resolve them since most types of analysis require a single, fully specified model. The knowledge gap is usually filled by making simplifications or by introducing additional assumptions that are hard to justify and therefore somewhat arbitrary. The alternative is to work with and analyze sets of alternative models, rather than single models. This idea entails additional theoretical and practical challenges: With which language should we describe our partial knowledge about a system? How can predictions be made given that each model in the set may behave differently? How can hypotheses and additional data be added to the current knowledge in a systematic manner? It seems that there are in principle two different approaches. The first one is constraint-based and studied by F. Corblin et al. It translates the partial knowledge and modeling formalism into facts and rules of a logic program. Common solvers can then deduce additional properties or test the validity of given queries across all models. In contrast, we propose to study the pros and cons of an explicit approach that enumerates all models that agree with a given partial specification. During the first step, models are enumerated and stored in a database. During a second step, models are annotated with additional information that is obtained from custom algorithms. The relationships between the annotations are then analyzed in a third step. The chapter is based on the prototype implemention LogicModelClassifier that performs the discussed steps. Throughout, we apply our results to two previously published models of biological systems. The first one is a small model of the galactose switch which regulates the transcription of genes that are involved in the metabolism of yeast. We address questions that arise during the construction of the model, for example the number of involved components and their interactions, as well as issues related to model validation and model revision with time series data. The case study also discusses different approaches to data discretization. The second one is a medium size model of the MAPK network studied by D. Thieffry et al. that is used to predict the cell fate response to different stimuli involving the growth factors EGF, TGFB, FGF and DNA damage. With the methods developed in this thesis we can prove that the model is capable of 18 different asymptotic behaviors, 12 of them steady states and 6 cyclic attractors. The question of which attractor is reached from which initial state is answered and we can show that the response in terms of proliferation or growth arrest and apoptosis is fully determined by the input stimulus.
The rate of nonalcoholic fatty liver disease (NAFLD) such as steatosis and nonalcoholic steatohepatitis (NASH) in populations is continuing to grow vigorously and became a worldwide public health issue. To understand the liver disease progression one needs to investigate complex interactions occurring within biological systems. Systems biology tries to understand the interactions within biological systems by means of mathematical models. Exploiting this approach I want to describe interactions of genes, proteins and metabolites that are involved in nonalcoholic fatty liver disease. Molecular data from liver tissue samples of three mouse strains (A/J, C57Bl6 and PWD) under two different conditions: 3,5-diethoxycarbonyl-1,4-dihydrocollidine (DDC)-treated and untreated (control) were analyzed. Each of these mouse strains shows different degrees of the disease under DDC treatment displaying high, low, and no steatohepatitis-like phenotypes for A/J, C57Bl6 and PWD, respectively. In this work I performed pathway analysis using gene expression data of mouse liver samples and identified metabolism of histidine, beta-alanine, purine along with glycolysis and gluconeogenesis pathways as top hit candidates that may be involved in liver dysfunction. Furthermore, gene expression and metabolite data of the arachidonic acid metabolism were found to be deregulated and this pathway was used for kinetic modeling. Genes and metabolites of S-adenosylmethionine (SAMe) metabolism were found to be perturbed under DDC- treatment. In addition, I have developed a novel enrichment analysis approach that may be used for identification of the most relevant fully coupled modules in a disease context. The approach includes three steps: 1) obtain fully coupled reactions which represent a module, 2) use gene expression data of a disease context to obtained marked correlated modules and 3) select modules in which at least one gene is differentially expressed between normal and disease conditions. Aforementioned steps are used to identify liver disease specific modules such as modules of pentose phosphate pathway and hepatic SAMe metabolism which are linked to oxidative stress. Furthermore, I also identified a module of cholesterol metabolism which is linked to apoptosis along with a module of pyrimidine catabolism, for which experimentally measured genes and metabolites were also found to be deregulated. The goal was to identify modules for which genes and metabolites are perturbed under DDC- treatment. The identified modules may be involved in liver disease and they can be used to build kinetic models for better understanding of the liver disease progression. Thus, in addition to enrichment analysis of fully coupled modules, I developed an approach which is based on elementary flux modes (EFMs). In this approach initially differentially regulated metabolites due to DDC-treatment were identified. Then reactions which can produce differentially regulated metabolites were used as a target set. For each reaction in the target set, 50 EFMs that contain the reaction were identified. After that, gene expression data of mouse liver samples were used to select important EFMs that may be involved in the liver disease progression. I identified two EFMs: one EFM comprises differentially regulated metabolites L-arginine, ornithine and putrescine and another EFM comprises differentially regulated metabolites D-glucose, L-glutamine and L-asparagine. I introduced a mGX-FBA method which is a modified version of the previously published GX-FBA method. Differences between metabolic flux levels among mouse strains may provide a better understanding of the reasons behind NAFLD. To address this, I performed an in silico flux-based analysis using E-Flux and the modified mGX-FBA method by incorporating gene expression data of the mouse model. Furthermore, during the course of my thesis I compared the results of both methods, E-Flux and mGX- FBA, and validated the results with experimental data. The change of flux through metabolic pathways may change metabolic concentrations. Due to the absence of experimental flux data for mouse liver samples it is difficult to assess in silico predicted flux regulation. However, in silico flux regulation may give a hint about the regulation of metabolic concentrations. Hence, to observe the flux regulation I colored metabolic maps with the deviation of fluxes between DDC-treated versus control. Different degrees of flux regulation was identified through cholesterol biosynthesis among all three mouse strains. The concentration of desmosterol that is a downstream metabolite of cholesterol biosynthesis was found to be regulated at different degrees. The regulation of desmosterol concentration is inline with the flux regulation of cholesterol biosynthesis among all three strains. In addition, to understand whether one can speculate about the prediction of flux regulation in metabolic pathways based on gene regulation I used bile acid synthesis and cholesterol biosynthesis pathways. For these pathways, I observed that based on gene expression data it is difficult to estimate flux regulation, but the integration of gene expression data using E-Flux improves the prediction of flux regulation. I introduced novel objective functions for measuring the readout of steatosis and steatohepatitis. To construct an objective function for steatosis I used metabolites which are involved in the formation of lipid droplets (LDs), while for steatohepatitis an objective function integrating both metabolites that are involved in LDs formation and metabolites involved in oxidative stress was used. Gene expression data of liver samples of all three mouse strains were incorporated to a mouse metabolic model and obtained objective values were validated with strain's phenotypic data. I also constructed an objective function with metabolites which may be involved in cell proliferation. Cell proliferation is used as a readout of hepatocellular carcinoma (HCC). Applying these functions I performed an in silico drug target analysis in which potential drug candidates for steatosis, steatohepatitis and HCC were identified. Cholesterol metabolism and triacylglycerol synthesis were found as top hits that contain the largest number of potential drug target candidates. Out of 78 identified potential drug candidates, 7 were found to be approved by the Food and Drug Administration (FDA) as anticancer drugs. Metabolite concentrations can be viewed as end points of perturbations occurring at the gene level, so that changes of gene expression might explain changes in metabolite concentrations. I proposed a novel hypothesis to predict changes in metabolite concentrations between two conditions based on gene expression data. To address this, I have developed a Petri net-based method (MPN) and used it to simulate the arachidonic acid model. As an alternative to MPN, Monte Carlo ODE-based simulation was used, but both methods cannot predict the metabolic concentrations in the real range of experimental data. To overcome this, I have developed a fitted detailed kinetic model of the arachidonic acid metabolism that comprises metabolites that are markedly deregulated due to DDC-treatment in all three mouse strains.
Biological experiments are time-consuming and expensive. Hence, computer-aided experimental design is used more and more often to select those experiments that are likely to yield new insights. In this work I consider variants of the constraint based method Flux Balance Analysis. Constraints are used to exclude biologically unrealistic phenotypes. Many constraints (for example flow conservation) can be formulated using linear inequalities, which allows efficient analysis using linear programming. Thermodynamic constraints are used to include also energetic aspects. However, thermodynamic constraints are not linear and induce a non-closed solution space. In this work I show that this non-linearity of thermodynamic constraints often leads to NP-hard decision problems. I show how this has consequences on the reliability of sampling methods. However, I also present solution approaches that allow us to solve optimization problems and qualitative analysis methods, like flux coupling analysis with thermodynamic constraints efficiently in practice. These insights I then use to develop a bi-level optimization method to analyze a growth behavior of the green alga Chlamydomonas reinhardtii. Another area covered by this work are flux modules. Based on a work by Kelk et al., I give a mathematical definition of flux module and show that this definition satisfies the desired properties. This definition also allows me to show several decomposition theorems. These decomposition theorems simplify the analysis of metabolic networks. Using matroid theory I show how modules can efficiently be computed. With the definition of k-module, I also show a decomposition theorem for general polyhedra using the concept of matroid branch-width. With flux modules and algorithmic approaches to also include complex constraints I present methods in this thesis that simplify and speed- up the analysis of metabolic networks. This allows us to gain biological insights faster and develop better methods for the production of bio-fuels in bio-engineering and cancer therapies in medicine.
Jamshidi, Shahrad: Comparing discrete, continuous and hybrid modelling approaches of gene regulatory networks
Mathematical modelling of biological networks can help us understand the complex mechanisms that are behind cell proliferation, differentiation or other cellular processes. From these models, we are able to replicate and predict system behaviour that can help in the design of experiments in the systems biology context. Multiple formalisms capture the evolution or dynamics of a system as implied by the network. Ordinary differential equation (ODE) models provide a precise representation of the system, where the concentrations of network components evolve based on chemical kinetics, e.g. mass action kinetics. The kinetic parameters required to generate the dynamics accurately, however, are often lacking, which has led to the development of more qualitative or discrete modelling methods. Discrete formalisms, like the well known Thomas formalism, provide a very coarse representation of the systems dynamics, whilst still highlighting fundamental features of the network structure. When modelling a given system, it could occur that the different approaches yield contrary dynamics. From a modelling perspective, this is highly impractical as we expect the system to behave uniquely irrespective of the modelling approach used. By mathematically relating different formalisms, we can analyse the dynamics of the formalisms and determine conditions for which the dynamics of each formalism are common or contrary between formalisms. Hybrid modelling approaches, that is formalisms that combine discrete and continuous methods, help in relating the purely discrete Thomas formalism with the purely continuous ODE formalism. Approximating the ODEs, we obtain piecewise affine differential equations (PADEs), which have well defined dynamics that can be discretised to reflect features of the Thomas formalism. Incorporating the hybrid formalism of PADEs into our analysis, we can break up the otherwise rough transformation between ODE and Thomas formalisms in order to specify the conditions for contrary dynamics to occur between formalisms. Our main result compares the qualitative approach of PADEs with the Thomas formalism. In particular, we show that even though the qualitative parameter information of the PADEs is inherent in the Thomas formalism and vice versa, the dynamics in both models still yield contrary dynamics. However, with the well-defined correspondences of the transition systems implied by the two approaches, we can provide proofs of paths and terminal strongly connected components that are common between both formalisms. With our analysis, we bridge the gap between discrete and continuous modelling methods. More specifically, we establish the dynamics that is common regardless of the choice of formalism and the dynamics that can be seen as artefacts of the formalism. From this analysis, therefore, we achieve a more rigorous modelling framework that allows us to model and predict biological systems with greater accuracy.
This thesis aims to describe metabolic networks and reaction pathways in the cell using an algebraic structure. On this basis we want to establish analytical methods that can be used for a variety of approaches in qualitative modeling. To analyze the metabolism, the steady state flux cone is presently the most commonly used mathematical model. Concepts such as elementary flux modes, flux coupling analysis (FCA) and minimal cut sets have been introduced based on this flux cone. This thesis identifies modeling approaches and their applications based on the developed abstract algebraic structure -- thus introducing a general framework for qualitative analysis of the metabolism. Reaction sets that can be simultaneously active in a cell are represented as the elements of a join (semi)lattice. A qualitative description of elementary flux modes is given and the concepts used are translated in lattice theory. Partially based on existing FCA algorithms, a method is presented that can rapidly identify the maximum of a lattice. This maximum is then used to define an abstract term of flux coupling that can be applied to a diversity of qualitative models. The implemented algorithm is applied to simulate a complete reaction double knockout (EFCA) in different cell models. These models are based on a steady state assumption that can be combined with further thermodynamic constraints. Additionally a concept of target prediction is introduced, utilizing the calculation of lattice maxima, and related to (minimal) cut sets. Taken as a whole, it can be said that the most known analytical approaches for the metabolism correspond perfectly well to the most special elements of join lattices. We are able to apply lattice theory to both the standard model based on the flux cone, pathways through hyper graphs and a model using logic formulas. It becomes apparent that this logic model is a relaxation of the standard model. Depending on the application, the elementary flux modes are shown to be either the minimal or the irreducible elements of a lattice. The Starbucks lemma proves that both these sets can be easily worked with in many qualitative models. It further proves our FCA algorithm is correct for literally all kinds of qualitative models. We will show that our implementation, even though much easier to adapt to other models, has a running time comparable to well-established FCA algorithms. As a matter of fact we will see that further improvements of speed are most likely reachable only by more efficient ways of programming, not by more effective theoretical approaches. Using lattice theory it is much easier to answer many open questions about the cell's metabolism. One reason for this is that it becomes easier to apply known results from certain models on other ones. Hence, the general framework of lattice theory is perfectly suited for qualitative modeling and to develop or optimize analytical methods.
Schulz, Marvin: Identification of potential drug targets in kinetic networks described by ordinary differential equations
Over the last decade the productivity of the pharma industry has been constantly declining. Less and less drugs against new diseases are admitted to the market each year. This is mainly due to the fact that an increasing number of drug candidates fail for a lack of in vivo activity or for their toxicity in clinical trials. In order to reduce this failure rate, the targets against which new drugs are developed have to be chosen more carefully. This can be done with the help of methods from Systems Biology with which the dynamical effects of hypothetical drugs can be modelled in silico. The combination of mathematical models with experimental data will improve the target selection and will make the resulting drugs less likely to fail in clinical trials. Within this work I have developed a framework for the application of kinetic models in the drug development process. Furthermore, I have developed methods and tools that support researchers in pursuing the framework. This includes methods for the automated retrieval of mathematical models that describe processes relevant to an investigated disease, methods for the integration of knowledge stored in these models, and the investigation of the combined information for potential drug targets. For the priorisation of drug targets I propose different objectives and methods. Depending on the diseases, one can either choose to only consider the efficacy of drugs against potential targets or one can decide to incorporate information on potential side-effects in the considered or in alternative models. These objective can then be used in exhaustive searches for optimal combinations of hypothetical drugs. Apart from this identification of optimal treatments, I introduce methods that allow for the discovery of treatment alternatives, which can be useful when drugs against a selected target are hard or even impossible to create. Furthermore, I discuss methods for the investigation of synergisms and antagonisms amongst hypothetical drugs. Knowledge about these drug combination effects can be exploited to create treatments with fewer side-effects or treatments against which resistances are less likely to develop. In order to prove the relevance of the investigated methods, these are applied to two example systems, the glycolysis in Trypanosoma brucei, the pathogen causing the African sleeping sickness, and the arachidonic acid pathway in different human cells. The obtained results generally agree with the knowledge available in the literature but extend the understanding of drug effects on these networks.
In this thesis a new probabilistic model in predictive microbiology, the NPMPM, is presented. It is based on a new approach for including variability and uncertainty. The NPMPM was developed for risk assessment of bacterial contaminations in the food supply chain. It is introduced by the example of a contamination of the milk supply chain with Listeria monocytogenes. Human illness that results from the consumption of contaminated food may be caused by bacteria and their toxins. Hence, assessment of growth and tenacity of bacteria during food production processes and storage is vital for food security. Predictive modelling is used to forecast the development of microorganisms in food, depending on different influence factors. Getting the desired information solely by means of laboratory experiments is time and cost intensive. Biological processes like growth and death are highly variable. Existing approaches that take into account variability and uncertainty either assume parameters to follow probability distributions, or model a biological process as stochastic process. The NPMPM follows a newly developed approach. It calculates samples of possible bacterial concentrations by means of deterministic models. Such a sample is used for estimation of the probability distribution of the corresponding population. This thesis is an interdisciplinary one. In the first chapter the research approach is motivated, and an overview of the organisation of this thesis is given. The second chapter describes causes and impact of foodborne diseases. Chapter three discusses influencing factors on growth and survival kinetics of bacterial populations. In chapter four dairy manufacturing in Germany is summarised. The fifth chapter presents existing models in predictive microbiology. Chapter six gives an overview of data and methods used, and a discussion of model assumptions. In chapter seven the NPMPM is introduced. The model is validated in chapter eight, and results are discussed. Finally, in chapter nine the contributions to the field of predictive microbiology are summarised, and some future work is suggested.
Constraint-based methods (CBMs) are promising tools for the analysis of metabolic networks, as they do not require detailed knowledge of the biochemical reactions. Some of these methods only need information about the stoichiometric coefficients of the reactions and their reversibility types, i.e., constraints for steady-state conditions. Nevertheless, CBMs have their own limitations. For example, these methods may be sensitive to missing information in the models. Additionally, they may be slow for the analysis of genome-scale metabolic models. As a result, some studies prefer to consider substructures of networks, instead of complete models. Some other studies have focused on better implementations of the CBMs. In Chapter 2, the sensitivity of flux coupling analysis (FCA) to missing reactions is studied. Genome-scale metabolic reconstructions are comprehensive, yet incomplete, models of real- world metabolic networks. While FCA has proved an appropriate method for analyzing metabolic relationships and for detecting functionally related reactions in such models, little is known about the impact of missing reactions on the accuracy of FCA. Note that having missing reactions is equivalent to deleting reactions, or to deleting columns from the stoichiometric matrix. Based on an alternative characterization of flux coupling relations using elementary flux modes, we study the changes that flux coupling relations may undergo due to missing reactions. In particular, we show that two uncoupled reactions in a metabolic network may be detected as directionally, partially or fully coupled in an incomplete version of the same network. Even a single missing reaction can cause significant changes in flux coupling relations. In case of two consecutive E. coli genome-scale networks, many fully-coupled reaction pairs in the incomplete network become directionally coupled or even uncoupled in the more complete reconstruction. In this context, we found gene expression correlation values being significantly higher for the pairs that remained fully coupled than for the uncoupled or directionally coupled pairs. Our study clearly suggests that FCA results are indeed sensitive to missing reactions. Since the currently available genome-scale metabolic models are incomplete, we advise to use FCA results with care. In Chapter 3, a different, but related problem is considered. Due to the large size of genome-scale metabolic networks, some studies suggest to analyze subsystems, instead of original genome-scale models. Note that analysis of a subsystem is equivalent to deletion of some rows from the stoichiometric matrix, or identically, assuming some internal metabolites to be external. We show mathematically that analysis of a subsystem instead of the original model can lead the flux coupling relations to undergo certain changes. In particular, a pair of (fully, partially or directionally) coupled reactions may be detected as uncoupled in the chosen subsystem. Interestingly, this behavior is the opposite of the flux coupling changes that may happen due to the existence of missing reactions, or equivalently, deletion of reactions. We also show that analysis of organelle subsystems has relatively little influence on the results of FCA, and therefore, many of these subsystems may be studied independent of the rest of the network. In Chapter 4, we introduce a rapid FCA method, which is appropriate for genome-scale networks. Previously, several approaches for FCA have been proposed in the literature, namely flux coupling finder algorithm, FCA based on minimal metabolic behaviors, and FCA based on elementary flux patterns. To the best of our knowledge none of these methods are available as a freely available software. Here, we introduce a new FCA algorithm FFCA (Feasibility-based Flux Coupling Analysis). This method is based on checking the feasibility of a system of linear inequalities. We show on a set of benchmarks that for genome-scale networks FFCA is faster than other existing FCA methods. Using FFCA, flux coupling analysis of genome-scale networks of S. cerevisiae and E. coli can be performed in a few hours on a normal PC. A corresponding software tool is freely available for non-commercial use. In Chapter 5, we introduce a new concept which can be useful in the analysis of fluxes in network substructures. Analysis of elementary modes (EMs) is proven to be a powerful CBM in the study of metabolic networks. However, enumeration of EMs is a hard computational task. Additionally, due to their large numbers, one cannot simply use them as an input for subsequent analyses. One possibility is to restrict the analysis to a subset of interesting reactions, rather than the whole network. However, analysis of an isolated subnetwork can result in finding incorrect EMs, i.e. the ones which are not part of any steady-state flux distribution in the original network. The ideal set of vectors to describe the usage of reactions in a subnetwork would be the set of all EMs projected onto the subset of interesting reactions. Recently, the concept of ``elementary flux patterns'' (EFPs) has been proposed. Each EFP is a subset of the support (i.e. non-zero elements) of at least one EM. In the present work, we introduce the concept of ProCEMs (Projected Cone Elementary Modes). The ProCEM set can be computed by projecting the flux cone onto the lower-dimensional subspace and enumerating the extreme rays of the projected cone. In contrast to EFPs, ProCEMs are not merely a set of reactions, but from the mathematical point of view they are projected EMs. We additionally prove that the set of EFPs is included in the set of ProCEM supports. Finally, ProCEMs and EFPs are compared in the study of substructures in biological networks.
The phase problem is the major problem in the field of X-ray crystallography. In the context of direct methods, that use mathematical techniques to compute an electron density map from the diffraction data without any further experiments, binary integer programming models for solving the phase problem have been de- veloped. Based on descriptions of topological properties of 2-dimensional binary pictures known from the field of discrete tomography, these models have been extended for the 3-dimensional case. As the formulations are in general not sufficient to describe the more complex properties of the shape of proteins, binary integer pro- grams have been derived for describing different additional topological properties. In general, the binary integer program for solving the phase problem, leads to a set of different optimal solutions. The additional constraints increase the quality of the solution set. The main property considered is one restricting the number of components in the resulting solution. Using graph theoretical methods and a separation algorithm, a model to describe this property has been found and implemented. Computational results have been presented and evaluated. It has been shown, that the added topological constraints increase significantly the quality of the solution set. In the last chapter, a method to find the solutions all at once based on singu- lar value decomposition and methods to find integer points in ellipsoids has been developed. In further work, the efficiency of this method for the phase problem should be evaluated and the method could be implemented and tested. In order to further increase the solutions’ quality, more additional constraints could be formulated and added. If the running time of the solving algorithm could be decreased, a refinement of the model would be possible. Bigger grids could be considered showing more de- tails of the reconstructed protein. More phase values than just four ones could be introduced. A restriction of the electron density distribution to a finite number of states instead of regarding just the two binary ones would be a possible extension. So, based on the promising results presented here, lots of further work extending and refining the developed approaches is possible.
Constraint-based approaches have proved successful in analyzing complex metabolic networks. They restrict the range of all possible behaviors that a metabolic system can display under governing constraints. The set of all possible flux distributions over a metabolic network at steady state defines a polyhedral cone, the steady-state flux cone. This cone can be analyzed using an inner description, based on sets of generating vectors such as elementary modes or extreme pathways. We present a new constraint-based approach to metabolic network analysis, characterizing a metabolic network by its minimal metabolic behaviors and the reversible metabolic space. Our method uses an outer description of the flux cone, based on sets of non-negativity constraints. The resulting description is minimal and unique. We then study the relationship between inner and outer descriptions of the cone. We give a generic procedure to show how inner descriptions can be computed from the outer one. We use this procedure to explain why the size of the inner descriptions may be several orders of magnitude larger than that of the outer description. Our approach suggests a refined classification of reactions according to their reversibility type (irreversible, pseudo-irreversible, and fully reversible). Using these concepts, we improve an existing algorithm for identifying blocked and coupled reactions and devise a new algorithm for flux coupling analysis. We extend this analysis by introducing minimal direction cuts (MDCs) which prevent a target reaction from being performed in an undesired direction. We develop an algorithm which allows not only for computing MDCs in a metabolic network, but also for a direct calculation of minimal cut sets (MCSs). Based on our refined classification of reactions, we also provide a constraint-based approach for analyzing the changes in the overall capabilities of a metabolic network following a gene deletion. Flux coupling and gene deletion analysis help for identifying important reactions in metabolic networks. Alternatively, the essentiality of reactions can be assessed using control-effective flux (CEF) analysis, which is based on elementary modes. We compare CEF analysis with the use of a minimal generating set of the flux cone to elucidate crucial reactions.