draft programme evobio 2012

please note that the final order of presentations may change

Thursday 12 April

Poster Session (09:30-11:00)

A Biologically Informed Method For Detecting Associations With Rare Variants

Carrie Buchanan, John Wallace, Eric Torstenson, Sarah Pendergrass, Marylyn Ritchie
With the recent flood of genome sequence data, there has been increasing interest in rare variants and methods to detect their association to disease. Many of these methods are collapsing strategies which bin rare variants based on allele frequency and functional predictions; but at this point, most have been limited to candidate gene studies with a small number of candidate genes. We propose a novel method to collapse rare variants based on incorporating biological information from the public domain. This paper introduces the functionality of BioBin, a biologically informed method to collapse rare variants and detect associations with a particular phenotype. We tested BioBin using low coverage data from the 1000 Genomes Project and discovered appropriate binning characteristics based on what one might expect given the size of the gene. We also tested BioBin using the pilot targeted exome data from 1000 Genomes Project. We used biologically-informed binning and differences in minor allele frequencies as a means to distinguish between two ancestral populations. Although BioBin is still in developmental stages, it will be a useful tool in analyzing sequence data and uncovering novel associations with complex disease.

Improving Phylogenetic Tree Interpretability by means of Evolutionary Algorithms

Francesco Cerutti, Luigi Bertolotti, Tony L. Goldberg, Mario Giacobini
A recent research article, entitled ‘Taxon ordering in phylogenetic trees: a workbench test’ presented the application of an evolutionary algorithm to order taxa in a phylogenetic tree, according to a given distance matrix. In previous articles, the authors introduced the first approaches to study the influence of algorithm parameters on the efficacy of finding the tree with the shortest distance among taxa, based on genetic distances. In the considered work, the authors tested the algorithm using both genetic and geographic distances, and a combination of the two, on three phylogenetic trees of different viruses. The results were interesting, especially when applying geographic distances, allowing a new reading direction, orthogonal to the classical root-to-taxa one.

Artificial Immune Systems Perform Valuable Work When Detecting Epistasis in Human Genetic Datasets

Delaney Granizo-Mackenzie, Jason H. Moore
We implement an Artificial Immune System (AIS) for epistasis detection in human genetic datasets. Our AIS outperforms previous attempts to solve the same problem by Penrod et al. by a factor of over 2.4 and performs at 81% of the power of the field standard exhaustive search, Multifactor Dimensionality Reduction (MDR). We show that the immune system performs best when ‘paring down’ large antibodies to more specific and accurate classifiers. This is promising as it shows that the AIS is doing valuable work, and needs not rely on a near-perfect antibody showing up by chance. We perform a receiver operator characteristic (ROC) analysis to further examine this property.

A NSGA-II Algorithm for the Residue-residue Contact Prediction

Alfonso E. Márquez-Chamorro, Federico Divina, Jesús S. Aguilar-Ruiz, Jaume Bacardit, Gualberto Asencio-Cortés, Cosme E. Santiesteban-Toca
We present a multi-objective evolutionary approach to predict protein contact maps. The algorithm provides a set of rules, inferring whether there is contact between a pair of residues or not. Such rules are based on a set of specific amino acid properties. These properties determine the particular features of each amino acid represented in the rules. In order to test the validity of our proposal, we have compared results obtained by our method with results obtained by other classification methods. The algorithm shows better accuracy and coverage rates than other contact map predictor algorithms. A statistical analysis of the resulting rules was also performed in order to extract conclusions of the protein folding problem.

Short-range interactions and decision tree-based protein contact map predictor

Cosme E. Santiesteban-Toca, Alfonso E. Márquez-Chamorro, Gualberto Asencio-Cortés , Jesús S. Aguilar-Ruiz
In this paper, we focus on protein contact map prediction, one of the most important intermediate steps of the protein folding problem. The objective of this research is to know how short-range interactions can contribute to a system based on decision trees to learn about the correlation among the covalent structures of a protein residues. We propose a solution to predict protein contact maps that combines the use of decision trees with a new input codification for short-range interactions. The method’s performance was very satisfactory, improving the accuracy instead using all information of the protein sequence. For a globulin data set the method can predict contacts with a maximal accuracy of 43\%. The presented predictive model illustrates that short-range interactions play the predominant role in determining protein structure.

Measuring gene expression noise in early Drosophila embryos: the highly dynamic compartmentalized micro-environment of the blastoderm is one of the main sources of noise

Alexander Spirov, Nina Golyandina, David Holloway, Theodore Alexandrov, Ekaterina Spirova, Francisco Lopes
Fluorescence imaging has become a widely used technique for quantitatively measuring mRNA or protein expression. The first measurements were on gene expression noise in bacteria and yeast. The relative biological and physicochemical simplicity of these single cells encouraged a number of groups to try similar approaches in multicellular organisms. Such work has been primarily on whole Drosophila embryos, where the genes forming the body plan are very well understood. The numerous sources of noise in complex embryonic tissues are a major challenge for characterizing gene expression noise. Here, we present our approach for first separating experimental from biological noise, followed by distinguishing sources of biological noise. We decompose raw signal into trend and residual noise using Singular Spectrum Analysis. We demonstrate our statistical techniques on the Drosophila Hunchback protein pattern. We show that the ‘texture noise’, arising from the pre-cellular compartmentalization of the embryo surface, which is highly dynamic in time, is a major component of total biological noise, and can exceed gene transcription/translation noise.

Feature Selection for Lung Cancer Detection using SVM based Recursive Feature Elimination Method

Kesav Kancherla, Srinivas Mukkala
Cancer is the uncontrolled growth of abnormal cells, which do not carry out the functions of normal cells. Lung cancer is the leading cause of death due to cancer in the world. The survival rate of cancer is about 15%. In order to improve the survival rate, we need an early detection method. In this study, we propose a new method for early detection of lung cancer using Tetrakis Carboxy Phenyl Porphine (TCPP) and well-known machine learning techniques. Tetrakis Carboxy Phenyl Porphine (TCPP) is a porphyrin that is able to label cancer cells due to the increased numbers of low density lipoproteins coating the surface of cancer cells and the porous nature of the cancer cell membrane. In our previous work we studied the performance of well know ma-chine learning techniques in the context of classification accuracy on Biomoda internal study. We used 79 features related to shape, intensity, and texture. We obtained an accuracy of 80% using the current feature set. In order to improve the accuracy of our method, we performed feature selection on these 79 features. We used Support Vector Machine (SVM) based Recursive feature Elimination (RFE) method in our experiments. We obtained an accuracy of 87.5% using reduced 19 feature set.

Complex detection in protein-protein interaction networks: a compact overview for researchers and practitioners

Clara Pizzuti, Simona E. Rombo, Elena Marchiori
The availability of large volumes of protein-protein interaction data has allowed the study of biological networks to unveil the complex structure and organization in the cell. It has been recognized by biologists that proteins interacting with each other often participate in the same biological processes, and that protein modules may be often associated with specific biological functions. Thus the detection of protein complexes is an important research problem in systems biology. In this review, recent graph-based approaches to clustering protein interaction networks are described and classified with respect to common peculiarities. The goal is that of providing a useful guide and reference for both computer scientists and biologists.

Thursday 12 April

Best paper nominees (11:20-13:00)

Chair: Mario Giacobini

EvoBIO: past, present and future (Invited talk to celebrate the 10th anniversary of EvoBIO.)

Jason H. Moore
EvoBIO was started as a workshop ten years ago to explore the interface between the emerging area of bioinformatics and the mature disciplines of evolutionary computing, data mining and machine learning. There was a sense at the time that computational intelligence had a very important role to play in biological and biomedical research given the complexity of the problems and the coming tidal wave of ‘big data’. The workshop was a success and quickly transitioned to a full conference with high-quality papers on cutting edge topics in bioinformatics. The goal of this presentation is review the ten-year history of the conference and to highlight some of the emerging topics that are likely to dominate the next ten years.

Automatic Task Decomposition for the NeuroEvolution of Augmenting Topologies (NEAT) Algorithm

Timmy Manning, Paul Walsh
Neuroevolution, the process of creating artificial neural networks through simulated evolution, can become impractical for arbitrarily complex problems requiring large or intricate neural network architectures. The modular feed forward neural network (MFFN) architecture decomposes a problem among a number of independent task specific neural networks, and is suggested here as a means of managing neuroevolution for complex problems. We present an algorithm for evolving MFFN architectures based on the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. The algorithm proposed here, denoted MFF-NEAT, outlines an approach to automatically evolving, attributing fitness values and combining the task specific networks in a principled manner.

Optimizing the Edge Weights in Optimal Assignment Methods for Virtual Screening with Particle Swarm Optimization

Lars Rosenbaum, Andreas Jahn, Andreas Zell
Ligand-based virtual screening experiments are an important task in the early drug discovery stage. In such an experiment, a chemical database is searched for molecules with similar properties to a given query molecule. The optimal assignment approach for chemical graphs has proven to be a successful method for various cheminformatic tasks, such as virtual screening. The optimal assignment approach assumes all atoms of a query molecule to have the same importance. This assumption is not realistic in a virtual screening for ligands against a specific protein target. In this study, we propose an extension of the optimal assignment approach that allows for assigning different importance to the atoms of a query molecule by weighting the edges of the optimal assignment. Then, we show that particle swarm optimization is able to optimize these edge weights for optimal virtual screening performance. We compared the optimal assignment with optimized edge weights to the original version with equal weights on various benchmark data sets using sophisticated virtual screening performance metrics. The results show that the optimal assignment with optimized edge weights achieved a considerably better performance. Thus, the proposed extension in combination with particle swarm optimization is a valuable approach for ligand-based virtual screening experiments.

Inferring Phylogenetic Trees Using a Multiobjective Artificial Bee Colony Algorithm

Sergio Santander-Jiménez, Miguel A. Vega-Rodríguez, Juan A. Gómez-Pulido, Juan M. Sánchez-Pérez
Phylogenetic Inference is considered as one of the most important research topics in the field of Bioinformatics. A variety of methods based on different optimality measures has been proposed in order to build and evaluate the trees which describe the evolution of species. A major problem that arises with this kind of techniques is the possibility of inferring discordant topologies from a same dataset. Another question to be resolved is how to manage the tree search process. As the space of possible topologies increases exponentially with the number of species in the input dataset, exhaustive methods cannot be applied. In this paper we propose a multiobjective adaptation of a well-known Swarm Intelligence algorithm, the Artificial Bee Colony, to reconstruct phylogenetic trees according to two criteria: maximum parsimony and maximum likelihood. Our approach shows a significant improvement in the quality of the inferred trees compared to other multiobjective proposals.

Thursday 12 April

Complex Biological Systems (14:30-16:10)

Chair: Leonardo Vanneschi

Inferring disease-related metabolite dependencies with a Bayesian Optimization Algorithm

Holger Franken, Alexander Seitz, Rainer Lehmann, Hans-Ulrich Häring, Norbert Stefan, Andreas Zell
Understanding disease-related metabolite interactions is a key issue in computational biology. We apply a modified Bayesian Optimization Algorithm to targeted metabolomics data from plasma samples of insulin-sensitive and -resistant subjects both suffering from non-alcoholic fatty liver disease. In addition to improving the classification accuracy by selecting relevant features, we extract the information that led to their selection and reconstruct networks from detected feature dependencies. We compare the influence of a variety of classifiers and different scoring metrics and examine whether the reconstructed networks represent physiological metabolite interconnections. We find that the presented method is capable of significantly improving the classification accuracy of otherwise hardly classifiable metabolomics data and that the detected metabolite dependencies can be mapped to physiological pathways, which in turn were affirmed by literature from the domain.

A GPU-based Multi-Swarm PSO Method for Parameter Estimation in Stochastic Biological Systems Exploiting Discrete-Time Target Series

Marco Nobile, Daniela Besozzi, Paolo Cazzaniga, Giancarlo Mauri, Dario Pescini
Parameter estimation (PE) of biological systems is one of the most challenging problems in Systems Biology. Here we present a PE method that integrates particle swarm optimization (PSO) to estimate the value of kinetic constants, and a stochastic simulation algorithm to reconstruct the dynamics of the system. The fitness of candidate solutions, corresponding to vectors of reaction constants, is defined as the point-to-point distance between a simulated dynamics and a set of experimental measures, carried out using discrete-time sampling and various initial conditions. A multi-swarm PSO topology with different modalities of particles migration is used to account for the different laboratory conditions in which the experimental data are usually sampled. The whole method has been specifically designed and entirely executed on the GPU to provide a reduction of computational costs. We show the effectiveness of our method and discuss its performances on an enzymatic kinetics and a prokaryotic gene expression network.

Tracking the evolution of cooperation in complex networked populations

Flávio Pinheiro, Francisco Santos, Jorge Pacheco
Social networks affect in such a fundamental way the dynamics of the population they support that the global, population-wide behavior that one observes often bears no relation to the agent processes it stems from. Up to now, linking the global networked dynamics to such agent mechanisms has remained elusive. Here we define an observable dynamic and use it to track the self-organization of cooperators when co-evolving with defectors in networked populations interacting via a Prisoner’s Dilemma. Computations on homogeneous networks evolve towards the coexistence between cooperator and defector agents, while computations in heterogeneous networks lead to the coordination between them. We show how the global dynamics co-evolves with the motifs of cooperator agents in the population, the overall emergence of cooperation depending sensitively on this co-evolution.

Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly

Gualberto Asencio-Cortés, Jesús S. Aguilar-Ruiz, Alfonso E. Márquez-Chamorro, Roberto Ruiz, Cosme E. Santiesteban-Toca
Protein structure prediction consists in determining the three-dimensional conformation of a protein based only on its amino acid sequence. This is currently a difficult and significant challenge in structural bioinformatics because these structures are necessary for drug designing. This work proposes a method that reconstructs protein structures from protein fragments assembled according to their physico-chemical similarities, using information extracted from known protein structures. Our prediction system produces distance maps to represent protein structures, which provides more information than contact maps, which are predicted by many proposals in the literature. Most commonly used amino acid physico-chemical properties are hydrophobicity, polarity and charge. In our method, we performed a feature selection on the 544 properties of the AAindex repository, resulting in 16 properties which were used to predictions. We tested our proposal on 74 mitochondrial matrix proteins with a maximum sequence identity of 30% obtained from the Protein Data Bank. We achieved a recall of 0.80 and a precision of 0.79 with an 8-angstrom cut-off and a minimum sequence separation of 7 amino acids. Finally, we compared our system with other relevant proposal on the same benchmark and we achieved a recall improvement of 50.82%. Therefore, for the studied proteins, our method provides a notable improvement in terms of recall.

Thursday 12 April

eurogp and evobio joint special session (16:30-18:10)

Chair: TBA

Evolutionary Reaction Systems

Luca Manzoni, Mauro Castelli, Leonardo Vanneschi
In the recent years many bio-inspired computational methods were defined and successfully applied to real life problems. Examples of those methods are particle swarm optimization, ant colony, evolutionary algorithms, and many others. At the same time, computational formalisms inspired by natural systems were defined and their suitability to represent different functions efficiently was studied. One of those is a formalism known as reaction systems. The aim of this work is to establish, for the first time, a relationship between evolutionary algorithms and reaction systems, by proposing an evolutionary version of reaction systems. In this paper we show that the resulting new genetic programming system has better, or at least comparable performances to a set of well known machine learning methods on a set of problems, also including real-life applications. Furthermore, we discuss the expressiveness of the solutions evolved by the presented evolutionary reaction systems.

Lévy-flight Genetic Programming: towards a new mutation paradigm

Christian Darabos, Mario Giacobini, Ting Hu, Jason H. Moore
Lévy flights are a class of random walks inspired directly by observing animal foraging habits, in which the stride length is drawn from a power-law distribution. This implies that the vast majority of the strides will be short. However, on rare occasions, the stride are gigantic. We use this technique to self-adapt the mutation rate used in Linear Genetic Programming. We apply this original approach to three different classes of problems: Boolean regression, quadratic polynomial regression, and surface reconstruction. We find that in all cases, our method outperforms the generic, commonly used constant mutation rate of 1 over the size of the genotype. We compare different common values of the power-law exponent to the regular spectrum of constant values used habitually. We conclude that our novel method is a viable alternative to constant mutation rate, especially because it tends to reduce the number of parameters of genetic programing.

Understanding zooplankton long term variability through Genetic Programming

Simone Marini, Alessandra Conversi
Zooplankton are considered good indicators for understanding how oceans are affected by climate change. While climate influence on zooplankton abundance variability is currently accepted, its mechanisms are not understood, and prediction is not yet possible. This paper utilizes the Genetic Programming approach to identify which environmental variables, and at which extent, can be used to express zooplankton abundance dynamics. The zooplankton Copepod long term time (since 1988) series from the L4 station in the Western English Channel, has been used as test case together with local environmental parameters and large scale climate indices. The performed simulations identify a set of relevant ecological drivers and highlight the non linear dynamics of the Copepod variability. The results indicate GP as a promising approach for understanding the long term variability of marine populations.

GeNet: A Graph-Based Genetic Programming Framework for the Reverse Engineering of Gene Regulatory Networks

Leonardo Vanneschi, Matteo Mondini, Martino Bertoni, Alberto Ronchi, Mattia Stefano
A standard tree-based genetic programming system, called GRNGen, for the reverse engineering of gene regulatory networks starting from time series datasets, was proposed in EvoBIO 2011. Despite the interesting results obtained on the simple IRMA network, GRNGen has some important limitations. For instance, in order to reconstruct a network with GRNGen, one single regression problem has to be solved by GP for each gene. This entails a clear limitation on the size of the networks that it can reconstruct, and this limitation is crucial, given that real genetic networks generally contain large numbers of genes. In this paper we present a new system, called GeNet, which aims at overcoming the main limitations of GRNGen, by directly evolving entire networks using graph-based genetic programming. We show that GeNet finds results that are comparable, and in some cases even better, than GRNGen on the small IRMA network, but, even more importantly (and contrarily to GRNGen), it can be applied also to larger networks. Last but not least, we show that the time series datasets found in literature do not contain a sufficient amount of information to describe the IRMA network in detail.

Friday 12 April

Genomics (11:20-13:00)

Chair: William S. Bush

Finding Motifs in DNA Sequences Applying a Multiobjective Artificial Bee Colony (MOABC) Algorithm

David L. González-Álvarez, Miguel A. Vega-Rodríguez, Juan A. Gómez-Pulido, Juan M. Sánchez-Pérez
Multiobjective optimization is successfully applied in many biological problems. Currently, most biological problems require to optimize more than one single objective at the same time, resulting in Multiobjective Optimization Problems (MOP). In the last years, multiple metaheuristics have been successfully used to solve optimization problems. However, many of them are designed to solve problems with only one objective function. In this work, we study several multiobjective adaptations to solve one of the most important biological problems, the Motif Discovery Problem (MDP). MDP aims to discover novel Transcription Factor Binding Sites (TFBS) in DNA sequences, maximizing three conflicting objectives: motif length, support, and similarity. For this purpose, we have used the Artificial Bee Colony algorithm, a novel Swarm Intelligence algorithm based on the intelligent behavior of honey bees. As we will see, the use of one or another multiobjective adaptation causes significant differences in the results.

In Silico Infection of the Human Genome

W. B. Langdon, J. Arno
The human genetic sequence database contains DNA sequences very like those of mycoplasma bacteria. It appears such bacteria infect not only molecular Biology laboratories but their genes were picked up from contaminated samples and inserted into GenBank as if they were homo sapiens. At least one mouldy EST (Expressed Sequence Tag) has transferred from online public databases on the Internet to commercial tools (Affymetrix HG-U133 plus 2.0 microarrays). We report a second example (DA466599) and suggest there is a need to clean up genomic databases but fear current tools will be inadequate to catch genes which have jumped the silicon barrier.

The Role of Mutations in Whole Genome Duplication

Qinxin Pan, Christian Darabos, Jason H. Moore
Genetic mutation is an essential factor in the evolution of biological organisms and a driving force of phenotypical innovation. On rare occasions, nature takes a major evolutionary leap during which an organism’s gene repertoire suddenly doubled. Genetic mutation affects both the whole genome duplication as it happens, and also during all the subsequent evolutionary steps. We develop a Boolean model of gene regulatory networks that simulates the duplication event and subsequent Darwinian evolution using an evolutionary algorithm. We analyze the role of these two different types of mutations on synthetic systems. Our results show that high duplication mutation rate triggers the development of new phenotypes, advantageous in a changing environment, to the detriment of environmental robustness. Additionally, our research highlights the necessity of a low evolutionary mutation rate for the survival of duplicated individuals within a mixed population, ensuring the spreading novel phenotype. We conclude that both types of mutations play complementary roles in determining the successful propagation of organisms with duplicated genomes.

Comparison of methods for meta-dimensional data analysis using in silico and biological data sets

Emily Holzinger, Scott Dudek, Alex Frase, Brooke Fridley, Prabhakar Chalise, Marylyn Ritchie
Recent technological innovations have catalyzed the generation of a massive amount of data at various levels of biological regulation, including DNA, RNA and protein. Due to the complex nature of biology, the underlying model may only be discovered by integrating different types of high-throughput data to perform a ìmeta-dimensionalî analysis. For this study, we used simulated gene expression and genotype data to compare three methods that show potential for integrating different types of data in order to generate models that predict a given phenotype: the Analysis Tool for Heritable and En-vironmental Network Associations (ATHENA), Random Jungle (RJ), and Lasso. Based on our results, we applied RJ and ATHENA sequentially to a biological data set that consisted of genome-wide genotypes and gene expression levels from lymphoblastoid cell lines (LCLs) to predict cytotoxicity. The best model consisted of two SNPs and two gene expression variables with an r-squared value of 0.32.