Talks

Thursday 28 April 09:30-11:00, room 3: Leonardo, Genomics I

Chair: Marylyn Ritchie

Effect of Using Varying Negative Examples in Transcription Factor Binding Site Predictions (Best Paper Nomination)
Faisal Rezwan, Yi Sun, Neil Davey, Alisatir G. Rust, Mark Robinson

Identifying transcription factor binding sites computationally is a hard problem as it produces many false predictions. Combining the predictions from existing predictors can improve the overall predictions by using classification methods like Support Vector Machines(SVM). But conventional negative examples (that is,example of non-binding sites in this type of problem are highly unreliable. In this study, we have used different types of negative examples. One class of the negative examples has been taken from far away from the promoter regions, where the occurrence of binding sites is very low, and another one has been produced by randomization. Thus we observed the effect of using different negative examples in predicting transcription factor binding sites in mouse. We have also devised a novel cross-validation technique for this type of biological problem.

A New Evolutionary Gene Regulatory Network Reverse Engineering Tool (Best Paper Nomination)
Antonella Farinaccio, Leonardo Vanneschi, Paolo Provero, Giancarlo Mauri, Mario Giacobini

We present a new reverse-engineering framework for gene regulatory network reconstruction. It works on temporal series of gene activation data and, using genetic programming, it extracts the activation functions of the different genes from those data. Successively, the gene regulatory network is reconstructed exploiting the automatic feature selection performed by genetic programming and its dynamics can be simulated using the previously extracted activation functions. The framework was tested on the well-known IRMA gene regulatory network, a simple network composed by five genes in the yeast Saccharomyces cerevisiae, defined in~2009 as a simplified biological model to benchmark reverse-engineering approaches. We show that the performances of the proposed framework on this benchmark network are encouraging.

ML-Consensus: A General Consensus Model for Variable-Length Transcription Factor Binding Sites
Saad Quader, Nathan Snyder, Kevin Su, Ericka Mochan, Chun-Hsi Huang

Many DNA motif finding algorithms that use Consensus (or any of its variants) in its motif model implicitly impose some restrictive assumptions over transcription factor (TF) binding sites (TFBS). Examples include all binding sites being of equal length, or having exactly one core region with fixed format, etc. In this paper, we have constructed a generalized consensus model (called Mixed-Length-Consensus, or ML-Consensus) without such constraints through multiple sequence alignment of known TFBS. We have extended this model with Information Content (IC) and Pairwise nucleotide correlation Score (PS), and have experimented with using multiple ML-Consensus for a set of binding sites. We have performed leave-one-out cross validation for training and testing of this algorithm over real binding site data of human, mouse, fruit fly, and yeast. We have produced ROC curves (True Positive Rate against False Positive Rate) for these experiments, and have used Wilcoxon Matched-Pair Signed Ranks Test to determine their statistical significance. Our results show that using IC and PS together with ML-Consensus consistently leads to better performance. We have experimented with various scopes for PS, and have found that scope values of 3-5 yield significantly better performance for different configurations. We have also found that using multiple ML-Consensus for one TF significantly improves recognition performance, but single ML-Consensus does better in yeast than in human data. Finally, we have found that using different multiple sequence alignment strategies for ML-Consensus yields varied performance across different species; a naive sorting based multiple sequence alignment outperformed CLUSTAL W2 alignment in yeast data.

Thursday 28 April 11:30-13:00, room 3: Leonardo, Genomics II

Chair: Elena Marchiori

Applying linear models to learn regulation programs in a transcription regulatory module network

Jianlong Qi, Tom Michoel, Gregory Butler

The module network method has been widely used to infer transcriptional regulatory network from gene expression data. A common strategy of module network learning algorithms is to apply regression trees to infer the regulation program of a module. In this work we propose to apply linear models to fulfill this task. The novelty of our method is to extract the contrast in which a module’s genes are most significantly differentially expressed. Consequently, the process of learning the regulation program for the module becomes one of identifying transcription factors that are also differentially expressed in this contrast. The effectiveness of our algorithm is demonstrated by the experiments in a yeast benchmark dataset.

ATHENA Optimization: The Effect of Initial Parameter Settings Across Different Genetic Models

Emily R. Holzinger, Scott M. Dudek, Eric C. Torstenson, Marylyn D. Ritchie

Rapidly advancing technology has allowed for the generation of massive amounts data assessing variation across the human genome. One analysis method for this type of data is the genome-wide association study (GWAS) where each variation is assessed individually for association to disease. While these studies have elucidated novel etiology, much of the variation due to genetics remains unexplained. One hypothesis is that some of the variation lies in gene-gene interactions. An impediment to testing for interactions is the infeasibility of exhaustively searching all multi-locus models. Novel methods are being developed that perform a non-exhaustive search. Because these methods are new to genetic studies, rigorous parameter optimization is necessary. Here, we assess genotype encodings, function sets, and cross-over in two algorithms which use grammatical evolution to optimize neural networks or symbolic regression formulas in the ATHENA software package. Our results show that the effect of these parameters is highly dependent on the underlying disease model.

Validating a Threshold-based Boolean Model of Regulatory Networks on a Biological Organism

Christian Darabos, Ferdinando Di Cunto, Marco Tomassini, Jason H. Moore, Paolo Provero, Mario Giacobini

Boolean models of regulatory networks are very attractive due to their simplicity and flexibility to integrate new development. We use the signaling network of a plant, along with the Boolean update functions attached to each element, to validate a previously proposed threshold-based additive update function. To do that, we determine the dynamical regime of the original system, then setup the parameters of the Boolean function to match this regime. Results show that there is a higher degree of overlap between the original function and the additive function than with random update function in the specific case at hand. This confirm a previous conjecture that the contribution of different transcription factors to the regulation of a target gene treated additively can explain a significant part of the variation in gene expression.

Thursday 28 April 14:30-16:00, room 3: Leonardo, Proteomics I

Chair: Mario Giacobini

A nearest neighbour-based approach for viral protein structure prediction
Gualberto Asencio-Cortés, Jesús S. Aguilar-Ruiz, Alfonso E. Márquez-Chamorro

Protein tertiary structure prediction consists of determining the three-dimensional conformation of a protein based solely on its amino acid sequence. This study proposes a method in which protein fragments are assembled according to their physicochemical similarities, using information extracted from known protein structures. Several existing protein tertiary structure prediction methods produce contact maps as their output. Our proposed method produces a distance map, which provides more information about the structure of a protein than a contact map. In addition, many existing approaches use the physicochemical properties of amino acids, generally hydrophobicity, polarity and charge, to predict structure. In our method, we used three different physicochemical properties of amino acids obtained from the literature. Using this method, we performed tertiary structure predictions on 63 viral capsid proteins with a maximum identity of 30\% obtained from the Protein Data Bank. We achieved a precision of 0.75 with an 8-angstrom cut-off and a minimum sequence separation of 7 amino acids. Thus, for the studied proteins, our results provide a notable improvement over those of other methods.

Annotated Stochastic Context Free Grammars for Analysis and Synthesis of Proteins
Eva Sciacca, Salvatore Spinella, Dino Ienco, Paola Giannini

An important step to understand the main functions of a specific family of proteins is the detection of protein features that could reveal how protein chains are constituted. To achieve this aim we treated amino acid sequences of proteins as a formal language, building a Context-Free Grammar annotated using an n-gram Bayesian classifier. This formalism is able to analyze the connection between protein chains and protein functions. In order to design new protein chains with the properties of the considered family we performed a rule clustering of the grammar to build an Annotated Stochastic Context Free Grammar. Our methodology was applied to a class of Antimicrobial Peptides (AmPs): the Frog antimicrobial peptides family. Through this case study, our approach pointed out some important aspects regarding the relationship between sequences and functional domains of proteins and how protein domain motifs are preserved by natural evolution in to the amino acid sequences. Moreover our results suggest that the synthesis of new proteins with a given domain architecture can be one of the fields where application of Annotated Stochastic Context Free Grammars can be useful.

Finding Motifs in DNA Sequences Applying a Multiobjective Artificial Bee Colony (MOABC) Algorithm
David L. González-Álvarez, Miguel A. Vega-Rodríguez, Juan A. Gómez-Pulido, Juan M. Sánchez-Pérez

In this work we propose the application of a Swarm Intelligence (SI) algorithm to solve the Motif Discovery Problem (MDP), applied to the specific task of discovering novel Transcription Factor Binding Sites (TFBS) in DNA sequences. In the last years there have appeared many new evolutionary algorithms based on the collective intelligence. Finding TFBS is crucial for understanding the gene regulatory relationship but, motifs are weakly conserved, and motif discovery is an NP-hard problem. Therefore, the use of such algorithms can be a good way to obtain quality results. The chosen algorithm is the Artificial Bee Colony (ABC), it is an optimization algorithm based on the intelligent foraging behaviour of honey bee swarm. To solve the MDP we have applied multiobjective optimization and consequently, we have adapted the ABC to multiobjective problems, defining the Multiobjective Artificial Bee Colony (MOABC) algorithm. New results have been obtained, that significantly improve those published in previous researches.

Thursday 28 April 16:20-17:50, room 3: Leonardo, Proteomics II

Chair: Clara Pizzuti

An Evolutionary Approach for Protein Contact Map Prediction
Alfonso E. M´arquez-Chamorro, Federico Divina, Jes´us S. Aguilar-Ruiz, Gualberto Asencio-Cort´es

In this study, we present a residue-residue contact prediction approach based on evolutionary computation. Some amino acid properties are employed according to their importance in the folding process: hydrophobicity, polarity, charge and residue size. Our evolutionary algorithm provides a set of rules which determine different cases where two amino acids are in contact. A rule represents two windows of three amino acids. Each amino acid is characterized by these four properties. We also include a statistical study for the propensities of contacts between each pair of amino acids, according to their types, hydrophobicity and polarity. Different experiments were also performed to determine the best selection of properties for the structure prediction among the cited properties.

Multi-neighborhood search for discrimination of signal peptides and transmembrane segments
Sami Laroum, Béatrice Duval, Dominique Tessier, Jin-Kao Hao

A key step in study of biosynthesis of membrane proteins is to look for the code that could be used to explain and predict which proteins would eventually be inserted in the membrane and which proteins would be secreted into the ER lumen when they cross the translocon channel. The aim of this work is to present an improvement of a previous method based on a local search approach. The proposed method relies on new in-depth biological observations to design a new search space for the local search algorithm. Experiments conducted on a dedicated dataset show that our new approach leads to improved outcomes in terms of prediction rates.

Approximation of Graph Kernel Similarities for Chemical Graphs by Kernel Principal Component Analysis
Georg Hinselmann, Andreas Jahn, Nikolas Fechner, Lars Rosenbaum, Andreas Zell

Graph kernels have been successfully applied on chemical graphs on small to medium sized machine learning problems. However, graph kernels often require a graph transformation before the computation can be applied. Furthermore, the kernel calculation can have a polynomial complexity of degree three and higher. Therefore, they cannot be applied in large instance-based machine learning problems. By using kernel principal component analysis, we mapped the compounds to the principal components, obtaining q-dimensional real-valued vectors. The goal of this study is to investigate the correlation between the graph kernel similarities and the similarities between the vectors. In the experiments we compared the similarities on various data sets, covering a wide range of typical chemical data mining problems. The similarity matrix between the vectorial projection was computed with the Jaccard and Cosine similarity coefficient and was correlated with the similarity matrix of the original graph kernel. The main result is that there is a strong correlation between the similarities of the vectors and the original graph kernel regarding rank correlation and linear correlation. The method seems to be robust and independent of the choice of the reference subset with observed standard deviations below 5%. An important application of the approach are instance-based data mining and machine learning tasks where the computation of the original graph kernel would be prohibitive.

Posters

Wednesday 27 April 18:15-19:00 in the Atrium

Experimental Approach for Bacterial Strains Characterization
Fabien Chhel, Adrien Göeffon, Frédéric Lardeux, Frédéric Saubion, Gilles Hunault, Tristan Boureau

In plant biology, data acquisition is no longer necessarily a major problem but nevertheless the treatment and the use of these data are still difficult. In this work, we are particularly interested by the characterization of strains of phytopathogenic bacterias, which is an important issue in the study of plant diseases. We study and compare several methods computing the smallest possible characterizations. These experiments have allowed us to characterize specific strains and diagnosis tests have been produced and used.

Do Diseases Spreading on Bipartite Networks Have Some Evolutionary Advantage?
Luca Ferreri, Ezio Venturino, Mario Giacobini

In this work we analyze the complexity of a disease that spreads among two populations and in which the transmission routes are available only throught individuals of the two different families. This peculiar route is typical of diseases such as sexual transmitted diseases on heterosexual populations or vector-host diseases such as tick-borne encephalitis or Lyme borreliosis. In such epidemiological scenarios, the contact network is naturally represented by a bipartite graphs. In this article we determine that a pathogen agent spreading on a bipartite network can have some evolutionary benefits with respect to diffusing on standard unipartite networks.

Genetic Algorithm Optimization of Force Field Parameters. Application to a Coarse-Grained Model of RNA
Filip Leonarski, Fabio Trovato, Valentina Tozzini, Joanna Trylska

Determining force field parameters for molecular dynamics simulations of reduced models of biomolecules is a long, troublesome, and exhaustive process that is often performed manually. To improve this parametrization procedure we apply a continuous-space Genetic Algorithm (GA). GA is implemented to optimize parameters of a coarse-grained potential energy function of ribonucleic acid (RNA) molecules. The parameters obtained using GA are correctly reproducing the dynamical behavior of an RNA helix and other RNA tertiary motifs. Therefore, GA can be a useful tool for force field parametrization of the effective potentials in coarse-grained molecular models.

A decision tree-based method for protein contact map prediction
Cosme Ernesto Santiesteban Toca, Alfonso E. Márquez-Chamorro, Gualberto Asencio-Cortés, Jesús S. Aguilar-Ruiz

In this paper, we focus on protein contact map prediction. We describe a method where contact maps are predicted using decision tree-based model. The algorithm includes the subsequence information between the couple of analyzed amino acids. In order to evaluate the method generalization capabilities, we carry out an experiment using 173 non-homologous proteins of known structures. Our results indicate that the method can assign protein contacts with an average accuracy of 0.34, superior to the 0.25 obtained by the FNETCSS method. This shows that our algorithm improves the accuracy with respect to the methods compared, especially with the increase of protein length.

A Comparison of Machine Learning Methods for the Prediction of Breast Cancer
Sara Silva, Orlando Anunciao, Marco Lotz

In this work we perform a comparison of machine learning methods in an association study with the goal of finding reliable classifiers that predict the presence or absence of breast cancer based on single nucleotide polymorphisms from the BRCA1, BRCA2 and TP53 genes. We emphasize how misleading some common statistical measures can be when evaluating classifiers whose learning was biased by an unbalanced dataset, as in our case. Then we compare and discuss the format of different solutions from the interpretability point of view, revealing a correlation between size and performance of the solutions, and also identify a small set of preferred features that agree with previously published work. We designate CART regression trees as the best classifiers, both in terms of performance and interpretability, and discuss how to improve the results reported here.

An Automatic Identification and Resolution System for Protein-related Abbreviations in Scientific Papers
Daniele Toti, Paolo Atzeni, Fabio Polticelli

We propose a methodology to identify and resolve protein-related abbreviations found in the full texts of scientific papers, as part of a semi-automatic process implemented in our PRAISED framework. The identification of biological acronyms is carried out via an effective syntactical approach, by taking advantage of lexical clues and using mostly domain-independent metrics, resulting in considerably high levels of recall as well as extremely low execution time. The subsequent abbreviation resolution uses both syntactical and semantic criteria in order to match an abbreviation with its potential explanation, as discovered among a number of contiguous words proportional to the abbreviation’s length. We have tested our system against the Medstract Gold Standard corpus and a relevant set of manually annotated PubMed papers, obtaining significant results and high performance levels, while at the same time allowing for great customization, lightness and scalability.

Protein Complex Discovery from Protein Interaction Network with High False-Positive Rate
Yunku Yeu, Jaegyoon Ahn, Youngmi Yoon, Sanghyun Park

Finding protein complexes and their functions is essential work for understanding biological process. However, one of the difficulties in inferring protein complexes from protein-protein interaction(PPI) network originates from the fact that protein interactions suffer from high false positive rate. We propose a complex finding algorithm which is not strongly dependent on topological traits of the protein interaction network. Our method exploits a new measure, GECSS(Gene Expression Condition Set Similarity) which considers mRNA expression data for a set of PPI. The complexes we found exhibit a higher match with reference complexes than the existing methods. Also we found several novel protein complexes, which are significantly enriched on Gene Ontology database.

evobio