IEEE/ACM Transactions on Computational Biology and Bioinformatics
https://www.computer.org/csdl/trans/tb/index.html
The IEEE/ACM Transactions on Computational Biology and Bioinformatics is a new quarterly that will publish archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development and optimization of biological databases; and important biological results that are obtained from the use of these methods, programs, and databases.en-ushttps://s3.amazonaws.com/ieeecs.cdn.common/images/computer.org/CSLogo_dark.pngIEEE Computer Society Digital LibraryList of 100 recently published journal articles.
https://www.computer.org/csdl
REPA: Applying Pathway Analysis to Genome-Wide Transcription Factor Binding Data
https://www.computer.org/csdl/trans/tb/2018/04/07152849-abs.html
Pathway analysis has been extensively applied to aid in the interpretation of the results of genome-wide transcription profiling studies, and has been shown to successfully find associations between the biological phenomena under study and biological pathways. There are two widely used approaches of pathway analysis: over-representation analysis, and gene set analysis. Recently genome-wide transcription factor binding data has become widely available allowing for the application of pathway analysis to this type of data. In this work, we developed regulatory enrichment pathway analysis (REPA) to apply gene set analysis to genome-wide transcription factor binding data to infer associations between transcription factors and biological pathways. We used the transcription factor binding data generated by the ENCODE project, and gene sets from the Molecular Signatures and KEGG databases. Our results showed that 54 percent of the predictions examined have literature support and that REPA’s recall is roughly 54 percent. This level of precision is promising as several of REPA’s predictions are expected to be novel and can be used to guide new research avenues. In addition, the results of our case studies showed that REPA enhances the interpretation of genome-wide transcription profiling studies by suggesting putative regulators behind the observed transcriptional responses.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2453948Proteomic Evidence for In-Frame and Out-of-Frame Alternatively Spliced Isoforms in Human and Mouse
https://www.computer.org/csdl/trans/tb/2018/04/07272055-abs.html
In order to find evidence for translation of alternatively spliced transcripts, especially those that result in a change in reading frame, we collected exon-skipping cases previously found by RNA-Seq and applied a computational approach to screen millions of mass spectra. These spectra came from seven human and six mouse tissues, five of which are the same between the two organisms: liver, kidney, lung, heart, and brain. Overall, we detected 4 percent of all exon-skipping events found in RNA-seq data, regardless of their effect on reading frame. The fraction of alternative isoforms detected did not differ between out-of-frame and in-frame events. Moreover, the fraction of identified alternative exon-exon junctions and constitutive junctions were similar. Together, our results suggest that both in-frame and out-of-frame translation may be actively used to regulate protein activity or localization.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2480068A Framework for Identifying Genotypic Information from Clinical Records: Exploiting Integrated Ontology Structures to Transfer Annotations between ICD Codes and Gene Ontologies
https://www.computer.org/csdl/trans/tb/2018/04/07272066-abs.html
Although some methods are proposed for automatic ontology generation, none of them address the issue of integrating large-scale heterogeneous biomedical ontologies. We propose a novel approach for integrating various types of ontologies efficiently and apply it to integrate International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9CM), and Gene Ontologies. This approach is one of the early attempts to quantify the associations among clinical terms (e.g., ICD9 codes) based on their corresponding genomic relationships. We reconstructed a merged tree for a partial set of GO and ICD9 codes and measured the performance of this tree in terms of associations’ relevance by comparing them with two well-known disease-gene datasets (i.e., MalaCards and Disease Ontology). Furthermore, we compared the genomic-based ICD9 associations to temporal relationships between them from electronic health records. Our analysis shows promising associations supported by both comparisons suggesting a high reliability. We also manually analyzed several significant associations and found promising support from literature.08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2480056IAS: Interaction Specific GO Term Associations for Predicting Protein-Protein Interaction Networks
https://www.computer.org/csdl/trans/tb/2018/04/07276997-abs.html
Proteins carry out their function in a cell through interactions with other proteins. A large scale protein-protein interaction (PPI) network of an organism provides static yet an essential structure of interactions, which is valuable clue for understanding the functions of proteins and pathways. PPIs are determined primarily by experimental methods; however, computational PPI prediction methods can supplement or verify PPIs identified by experiment. Here, we developed a novel scoring method for predicting PPIs from Gene Ontology (GO) annotations of proteins. Unlike existing methods that consider functional similarity as an indication of interaction between proteins, the new score, named the protein-protein Interaction Association Score (IAS), was computed from GO term associations of known interacting protein pairs in 49 organisms. IAS was evaluated on PPI data of six organisms and found to outperform existing GO term-based scoring methods. Moreover, consensus scoring methods that combine different scores further improved performance of PPI prediction.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2476809Nucleosome Positioning of Intronless Genes in the Human Genome
https://www.computer.org/csdl/trans/tb/2018/04/07277021-abs.html
Nucleosomes, the basic units of chromatin, are involved in transcription regulation and DNA replication. Intronless genes, which constitute 3 percent of the human genome, differ from intron-containing genes in evolution and function. Our analysis reveals that nucleosome positioning shows a distinct pattern in intronless and intron-containing genes. The nucleosome occupancy upstream of transcription start sites of intronless genes is lower than that of intron-containing genes. In contrast, high occupancy and well positioned nucleosomes are observed along the gene body of intronless genes, which is perfectly consistent with the barrier nucleosome model. Intronless genes have a significantly lower expression level than intron-containing genes and most of them are not expressed in CD4+ T cell lines and GM12878 cell lines, which results from their tissue specificity. However, the highly expressed genes are at the same expression level between the two types of genes. The highly expressed intronless genes require a higher density of RNA Pol II in an elongating state to compensate for the lack of introns. Additionally, 5’ and 3’ nucleosome depleted regions of highly expressed intronless genes are deeper than those of highly expressed intron-containing genes.08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2476811Discovering Gene Regulatory Elements Using Coverage-Based Heuristics
https://www.computer.org/csdl/trans/tb/2018/04/07312946-abs.html
Data mining algorithms and sequencing methods (such as RNA-seq and ChIP-seq) are being combined to discover genomic regulatory motifs that relate to a variety of phenotypes. However, motif discovery algorithms often produce very long lists of putative transcription factor binding sites, hindering the discovery of phenotype-related regulatory elements by making it difficult to select a manageable set of candidate motifs for experimental validation. To address this issue, the authors introduce the motif selection problem and provide coverage-based search heuristics for its solution. Analysis of 203 ChIP-seq experiments from the ENCyclopedia of DNA Elements project shows that our algorithms produce motifs that have high sensitivity and specificity and reveals new insights about the regulatory code of the human genome. The greedy algorithm performs the best, selecting a median of two motifs per ChIP-seq transcription factor group while achieving a median sensitivity of 77 percent.08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2496261PyMut: A Web Tool for Overlapping Gene Loss-of-Function Mutation Design
https://www.computer.org/csdl/trans/tb/2018/04/07346423-abs.html
Loss-of-function study is an effective approach to research gene functions. However, currently most of such studies have ignored an important problem (in this paper, we call it “off-target” problem), that is, if the target gene is an overlapping gene (A gene whose expressible nucleotides overlaps with that of another one), loss-of-function mutation by deleting the complete open reading frame (ORF) may also cause the gene it overlaps lose function, resulting a phenotype which may be rather different from that of single gene deletion. Therefore, when doing such studies, the loss-of-function mutations should be carefully designed to guarantee only the function of the target gene will be abolished. In this paper, we present PyMut, an easy-to-use web tool for biologists to design such mutations. To the best of our knowledge, PyMut is the first tool that aims to solve the “off-target” problem regarding the overlapping genes. Our web server is freely available at <uri> http://www.bioinfo.tsinghua.edu.cn/∼liuke/PyMut/index.html</uri>.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2505290Towards Extracting Supporting Information About Predicted Protein-Protein Interactions
https://www.computer.org/csdl/trans/tb/2018/04/07348677-abs.html
One of the goals of relation extraction is to identify protein-protein interactions (PPIs) in biomedical literature. Current systems are capturing binary relations and also the direction and type of an interaction. Besides assisting in the curation PPIs into databases, there has been little real-world application of these algorithms. We describe UPSITE, a text mining tool for extracting evidence in support of a hypothesized interaction. Given a predicted PPI, UPSITE uses a binary relation detector to check whether a PPI is found in abstracts in PubMed. If it is not found, UPSITE retrieves documents relevant to each of the two proteins separately, and extracts contextual information about biological events surrounding each protein, and calculates semantic similarity of the two proteins to provide evidential support for the predicted PPI. In evaluations, relation extraction achieved an Fscore of 0.88 on the HPRD50 corpus, and semantic similarity measured with angular distance was found to be statistically significant. With the development of PPI prediction algorithms, the burden of interpreting the validity and relevance of novel PPIs is on biologists. We suggest that presenting annotations of the two proteins in a PPI side-by-side and a score that quantifies their similarity lessens this burden to some extent.08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2505278Gene Regulatory Network Inference from Perturbed Time-Series Expression Data via Ordered Dynamical Expansion of Non-Steady State Actors
https://www.computer.org/csdl/trans/tb/2018/04/07360164-abs.html
The reconstruction of gene regulatory networks from gene expression data has been the subject of intense research activity. A variety of models and methods have been developed to address different aspects of this important problem. However, these techniques are narrowly focused on particular biological and experimental platforms, and require experimental data that are typically unavailable and difficult to ascertain. The more recent availability of higher-throughput sequencing platforms, combined with more precise modes of genetic perturbation, presents an opportunity to formulate more robust and comprehensive approaches to gene network inference. Here, we propose a step-wise framework for identifying gene-gene regulatory interactions that expand from a known point of genetic or chemical perturbation using time series gene expression data. This novel approach sequentially identifies non-steady state genes post-perturbation and incorporates them into a growing series of low-complexity optimization problems. The governing ordinary differential equations of this model are rooted in the biophysics of stochastic molecular events that underlie gene regulation, delineating roles for both protein and RNA-mediated gene regulation. We show the successful application of our core algorithms for network inference using simulated and real datasets.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2509992Encoding Data Using Biological Principles: The Multisample Variant Format for Phylogenomics and Population Genomics
https://www.computer.org/csdl/trans/tb/2018/04/07360178-abs.html
Rapid progress in the fields of phylogenomics and population genomics has driven increases in both the size of multi-genomic datasets and the number and complexity of genome-wide analyses. We present the Multisample Variant Format, specifically designed to store multiple sequence alignments for phylogenomics and population genomic analysis. The signature feature of MVF is a distinctive encoding of aligned sites with specific biological information content (e.g., invariant, low-coverage). This biological pattern-based encoding of sequence data allows for rapid filtering and quality control of data and speeds up computation for many analyses. Similar to other modern formats, MVF has a simple data structure and flexible header structure to accommodate project metadata, allowing to also serve as an effective data publication and sharing format. We also propose several variants of the MVF format to accommodate protein and codon alignments, quality scores, and a mix of <italic>de novo</italic> and reference-aligned data. Using the MVFtools package, MVF files can be converted from other common sequence formats. MVFtools completes tasks ranging from simple transformation and filtering operations to complex genome-wide visualizations in only a few minutes, even on large datasets. In addition to presentation of MVF and MVFtools, we also discuss the application both in MVF and other existing data formats of the broader concept of using biological principles and patterns to inform sequence data encoding.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2509997Querying of Disparate Association and Interaction Data in Biomedical Applications
https://www.computer.org/csdl/trans/tb/2018/04/07778238-abs.html
In biomedical applications, network models are commonly used to represent interactions and higher-level associations among biological entities. Integrated analyses of these interaction and association data has proven useful in extracting knowledge, and generating novel hypotheses for biomedical research. However, since most datasets provide their own schema and query interface, opportunities for exploratory and integrative querying of disparate data are currently limited. In this study, we utilize RDF-based representations of biomedical interaction and association data to develop a querying framework that enables flexible specification and efficient processing of graph template matching queries. The proposed framework enables integrative querying of biomedical databases to discover complex patterns of associations among a diverse range of biological entities, including biomolecules, biological processes, organisms, and phenotypes. Our experimental results on the UniProt dataset show that the proposed framework can be used to efficiently process complex queries, and identify biologically relevant patterns of associations that cannot be readily obtained by querying each dataset independently.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2637344Incorporation of Solvent Effect into Multi-Objective Evolutionary Algorithm for Improved Protein Structure Prediction
https://www.computer.org/csdl/trans/tb/2018/04/07930531-abs.html
The problem of predicting the three-dimensional (3-D) structure of a protein from its one-dimensional sequence has been called the “holy grail of molecular biology”, and it has become an important part of structural genomics projects. Despite the rapid developments in computer technology and computational intelligence, it remains challenging and fascinating. In this paper, to solve it we propose a multi-objective evolutionary algorithm. We decompose the protein energy function Chemistry at HARvard Macromolecular Mechanics force fields into bond and non-bond energies as the first and second objectives. Considering the effect of solvent, we innovatively adopt a solvent-accessible surface area as the third objective. We use 66 benchmark proteins to verify the proposed method and obtain better or competitive results in comparison with the existing methods. The results suggest the necessity to incorporate the effect of solvent into a multi-objective evolutionary algorithm to improve protein structure prediction in terms of accuracy and efficiency.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2705094Controllability Analysis and Control Synthesis for the Ribosome Flow Model
https://www.computer.org/csdl/trans/tb/2018/04/07932937-abs.html
The ribosomal density along different parts of the coding regions of the mRNA molecule affects various fundamental intracellular phenomena including: protein production rates, global ribosome allocation and organismal fitness, ribosomal drop off, co-translational protein folding, mRNA degradation, and more. Thus, regulating translation in order to obtain a <italic>desired</italic> ribosomal profile along the mRNA molecule is an important biological problem. We study this problem by using a dynamical model for mRNA translation, called the ribosome flow model (RFM). In the RFM, the mRNA molecule is modeled as an ordered chain of <inline-formula> <tex-math notation="LaTeX">$n$</tex-math><alternatives><inline-graphic xlink:href="tuller-ieq1-2707420.gif"/> </alternatives></inline-formula> sites. The RFM includes <inline-formula><tex-math notation="LaTeX">$n$</tex-math> <alternatives><inline-graphic xlink:href="tuller-ieq2-2707420.gif"/></alternatives></inline-formula> state-variables describing the ribosomal density profile along the mRNA molecule, and the transition rates from each site to the next are controlled by <inline-formula><tex-math notation="LaTeX">$n+1$</tex-math><alternatives> <inline-graphic xlink:href="tuller-ieq3-2707420.gif"/></alternatives></inline-formula> positive constants. To study the problem of controlling the density profile, we consider some or all of the transition rates as time-varying controls. We consider the following problem: given an initial and a desired ribosomal density profile in the RFM, determine the time-varying values of the transition rates that steer the system to the desired density profile, if they exist. More specifically, we consider two control problems. In the first, all transition rates can be regulated separately, and the goal is to steer the ribosomal density profile and the protein production rate from a given initial value to a desired value. In the second problem, one or more transition rates are jointly regulated by a single scalar control, and the goal is to steer the production rate to a desired value within a certain set of feasible values. In the first case, we show that the system is controllable, i.e., the control is powerful enough to steer the system to any desired value in finite time, and provide simple closed-form expressions for <italic>constant</italic> positive control functions (or transition rates) that asymptotically steer the system to the desired value. In the second case, we show that the system is controllable, and provide a simple algorithm for determining the <italic>constant</italic> positive control value that asymptotically steers the system to the desired value. We discuss some of the biological implications of these results.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2707420Analysis of the Genome Sequence and Prediction of B-Cell Epitopes of the Envelope Protein of Middle East Respiratory Syndrome-Coronavirus
https://www.computer.org/csdl/trans/tb/2018/04/07935343-abs.html
The outbreak of Middle East respiratory syndrome-coronavirus (MERS-CoV) in South Korea in April 2015 led to 186 infections and 37 deaths by the end of October 2015. MERS-CoV was isolated from the imported patient in China. The envelope (E) protein, a small structural protein of MERS-CoV, plays an important role in host recognition and infection. To identify the conserved epitopes of the E protein, sequence analysis was performed by comparing the E proteins from 42 MERS-CoV strains that triggered severe pandemics and infected humans in the past. To predict the potential B cell epitopes of E protein, three most effective online epitope prediction programs, the ABCpred, Bepipred, and Protean programs from the LaserGene software were used. All the nucleotides and amino acids sequences were obtained from the NCBI Database. One potential epitope with a suitable length (amino acids 58–82) was confirmed and predicted to be highly antigenic. This epitope had scores of >0.80 in ABCpred and level 0.35 in Bepipred programs. Due to the lack of X-ray crystal structure of the E protein in the PDB database, the simulated 3D structure of the E protein were also predicted using PHYRE 2 and Pymol programs. In conclusion, using bioinformatics methods, we analyzed the genome sequence of MERS-CoV and identified a potential B-cell epitope of the E protein, which might significantly improve our current MERS vaccine development strategies.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2702588An Inference Attack on Genomic Data Using Kinship, Complex Correlations, and Phenotype Information
https://www.computer.org/csdl/trans/tb/2018/04/07935519-abs.html
Individuals (and their family members) share (partial) genomic data on public platforms. However, using special characteristics of genomic data, background knowledge that can be obtained from the Web, and family relationship between the individuals, it is possible to infer the hidden parts of shared (and unshared) genomes. Existing work in this field considers simple correlations in the genome (as well as Mendel’s law and partial genomes of a victim and his family members). In this paper, we improve the existing work on inference attacks on genomic privacy. We mainly consider complex correlations in the genome by using an observable Markov model and recombination model between the haplotypes. We also utilize the phenotype information about the victims. We propose an efficient message passing algorithm to consider all aforementioned background information for the inference. We show that the proposed framework improves inference with significantly less information compared to existing work.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2709740Construction of Signaling Pathways with RNAi Data and Multiple Reference Networks
https://www.computer.org/csdl/trans/tb/2018/04/07936619-abs.html
Signaling networks are involved in almost all major diseases such as cancer. As a result of this, understanding how signaling networks function is vital for finding new treatments for many diseases. Using gene knockdown assays such as RNA interference (RNAi) technology, many genes involved in these networks can be identified. However, determining the interactions between these genes in the signaling networks using only experimental techniques is very challenging, as performing extensive experiments is very expensive and sometimes, even impractical. Construction of signaling networks from RNAi data using computational techniques have been proposed as an alternative way to solve this challenging problem. However, the earlier approaches are either not scalable to large scale networks, or their accuracy levels are not satisfactory. In this study, we integrate RNAi data given on a target network with multiple reference signaling networks and phylogenetic trees to construct the topology of the target signaling network. In our work, the network construction is considered as finding the minimum number of edit operations on given multiple reference networks, in which their contributions are weighted by their phylogenetic distances to the target network. The edit operations on the reference networks lead to a target network that satisfies the RNAi knockdown observations. Here, we propose two new reference-based signaling network construction methods that provide optimal results and scale well to large-scale signaling networks of hundreds of components. We compare the performance of these approaches to the state-of-the-art reference-based network construction method SiNeC on synthetic, semi-synthetic, and real datasets. Our analyses show that the proposed methods outperform SiNeC method in terms of accuracy. Furthermore, we show that our methods function well even if evolutionarily distant reference networks are used. Application of our methods to the Apoptosis and Wnt signaling pathways recovers the known protein-protein interactions and suggests additional relevant interactions that can be tested experimentally.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2710129A Self-Training Subspace Clustering Algorithm under Low-Rank Representation for Cancer Classification on Gene Expression Data
https://www.computer.org/csdl/trans/tb/2018/04/07940088-abs.html
Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation (LRR) is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering (SSC) method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes (RNF114, HLA-DRB5, USP9Y, and PTPN20) were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2712607A Two-Stage Biomedical Event Trigger Detection Method Integrating Feature Selection and Word Embeddings
https://www.computer.org/csdl/trans/tb/2018/04/07947109-abs.html
Extracting biomedical events from biomedical literature plays an important role in the field of biomedical text mining, and the trigger detection is a key step in biomedical event extraction. We propose a two-stage method for trigger detection, which divides trigger detection into recognition stage and classification stage, and different features are selected in each stage. In the first stage, we select the features which are more suitable for recognition, and in the second stage, the features that are more helpful to classification are adopted. Furthermore, we integrate word embeddings to represent words semantically and syntactically. On the multi-level event extraction (MLEE) corpus test dataset, our method achieves an F-score of 79.75 percent, which outperforms the state-of-the-art systems.08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2715016NewGOA: Predicting New GO Annotations of Proteins by Bi-Random Walks on a Hybrid Graph
https://www.computer.org/csdl/trans/tb/2018/04/07949034-abs.html
A remaining key challenge of modern biology is annotating the functional roles of proteins. Various computational models have been proposed for this challenge. Most of them assume the annotations of annotated proteins are complete. But in fact, many of them are incomplete. We proposed a method called NewGOA to predict new Gene Ontology (GO) annotations for incompletely annotated proteins and for completely un-annotated ones. NewGOA employs a hybrid graph, composed of two types of nodes (proteins and GO terms), to encode interactions between proteins, hierarchical relationships between terms and available annotations of proteins. To account for structural difference between GO terms subgraph and proteins subgraph, NewGOA applies a bi-random walks algorithm, which executes asynchronous random walks on the hybrid graph, to predict new GO annotations of proteins. Experimental study on archived GO annotations of two model species (H. Sapiens and S. cerevisiae) shows that NewGOA can more accurately and efficiently predict new annotations of proteins than other related methods. Experimental results also indicate the bi-random walks can explore and further exploit the structural difference between GO terms subgraph and proteins subgraph. The supplementary files and codes of NewGOA are available at: <uri>http://mlda.swu.edu.cn/codes.php?name=NewGOA</uri>.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2715842A Model-Based Tool for the Analysis and Design of Gene Regulatory Networks
https://www.computer.org/csdl/trans/tb/2018/04/07953529-abs.html
Computational and mathematical models have significantly contributed to the rapid progress in the study of gene regulatory networks (GRN), but researchers still lack a reliable model-based framework for computer-aided analysis and design. Such tool should both reveal the relation between network structure and dynamics and find parameter values and/or constraints that enable the simulated dynamics to reproduce specific behaviors. This paper addresses these issues and proposes a computational framework that facilitates network analysis or design. It follows a modeling cycle that alternates phases of hypothesis testing with parameter space refinement to ultimately propose a network that exhibits specified behaviors with the highest probability. Hypothesis testing is performed via qualitative simulation of GRNs modeled by a class of nonlinear and temporal multiscale ODEs, where regulation functions are expressed by steep sigmoid functions and incompletely known parameter values by order relations only. Parameter space refinement, grounded on a method that considers the intrinsic stochasticity of regulation by expressing network uncertainty with fluctuations in parameter values only, optimizes parameter stochastic values initialized by probability distributions with large variances. The power and ease of our framework is demonstrated by working out a benchmark synthetic network to get a synthetic oscillator.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2716942Modeling Methylation Patterns with Long Read Sequencing Data
https://www.computer.org/csdl/trans/tb/2018/04/07964739-abs.html
Variation in cytosine methylation at CpG dinucleotides is often observed in genomic regions, and analysis typically focuses on estimating the proportion of methylated sites observed in a given region and comparing these levels across samples to determine association with conditions of interest. While sites are tacitly treated as independent, when observed at the level of individual molecules methylation patterns exhibit strong evidence of local spatial dependence. We previously developed a neighboring sites model to account for correlation and clustering behavior observed in two tandem repeat regions in a collection of ovarian carcinomas. We now introduce extensions of the model that account for the effect of distance between sites as well as asymmetric correlation in <italic>de novo</italic> methylation and demethylation rates. We apply our models to published data from a whole genome bisulfite sequencing experiment using long reads, estimating model parameters for a selection of CpG-dense regions spanning between 21 and 67 sites. Our methods detect evidence of local spatial correlation as a function of site-to-site distance and demonstrate the added value of employing long read sequencing data in epigenetic research.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2721943Reduction of Qualitative Models of Biological Networks for Transient Dynamics Analysis
https://www.computer.org/csdl/trans/tb/2018/04/08026154-abs.html
Qualitative models of dynamics of signalling pathways and gene regulatory networks allow for the capturing of temporal properties of biological networks while requiring few parameters. However, these discrete models typically suffer from the so-called state space explosion problem which makes the formal assessment of their potential behaviors very challenging. In this paper, we describe a method to reduce a qualitative model for enhancing the tractability of analysis of transient reachability properties. The reduction does not change the dimension of the model, but instead limits its degree of freedom, therefore reducing the set of states and transitions to consider. We rely on a transition-centered specification of qualitative models by the mean of automata networks. Our framework encompasses the usual asynchronous Boolean and multi-valued network, as well as 1-bounded Petri nets. Applied to different large-scale biological networks from the litterature, we show that the reduction can lead to a drastic improvement for the scalability of verification methods.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2749225PREMER: A Tool to Infer Biological Networks
https://www.computer.org/csdl/trans/tb/2018/04/08057785-abs.html
Inferring the structure of unknown cellular networks is a main challenge in computational biology. Data-driven approaches based on information theory can determine the existence of interactions among network nodes automatically. However, the elucidation of certain features—such as distinguishing between direct and indirect interactions or determining the direction of a causal link—requires estimating information-theoretic quantities in a multidimensional space. This can be a computationally demanding task, which acts as a bottleneck for the application of elaborate algorithms to large-scale network inference problems. The computational cost of such calculations can be alleviated by the use of compiled programs and parallelization. To this end, we have developed PREMER (Parallel Reverse Engineering with Mutual information & Entropy Reduction), a software toolbox that can run in parallel and sequential environments. It uses information theoretic criteria to recover network topology and determine the strength and causality of interactions, and allows incorporating prior knowledge, imputing missing data, and correcting outliers. PREMER is a free, open source software tool that does not require any commercial software. Its core algorithms are programmed in FORTRAN 90 and implement OpenMP directives. It has user interfaces in Python and MATLAB/Octave, and runs on Windows, Linux, and OSX (<uri>https://sites.google.com/site/premertoolbox/</uri>).08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2758786ASSA-PBN: A Toolbox for Probabilistic Boolean Networks
https://www.computer.org/csdl/trans/tb/2018/04/08107541-abs.html
As a well-established computational framework, probabilistic Boolean networks (PBNs) are widely used for modelling, simulation, and analysis of biological systems. To analyze the steady-state dynamics of PBNs is of crucial importance to explore the characteristics of biological systems. However, the analysis of large PBNs, which often arise in systems biology, is prone to the infamous state-space explosion problem. Therefore, the employment of statistical methods often remains the only feasible solution. We present <inline-formula><tex-math notation="LaTeX"> ${\mathsf{ASSA-PBN}}$</tex-math><alternatives><inline-graphic xlink:href="mizera-ieq1-2773477.gif"/></alternatives> </inline-formula>, a software toolbox for modelling, simulation, and analysis of PBNs. <inline-formula> <tex-math notation="LaTeX">${\mathsf{ASSA-PBN}}$</tex-math><alternatives> <inline-graphic xlink:href="mizera-ieq2-2773477.gif"/></alternatives></inline-formula> provides efficient statistical methods with three parallel techniques to speed up the computation of steady-state probabilities. Moreover, particle swarm optimisation (PSO) and differential evolution (DE) are implemented for the estimation of PBN parameters. Additionally, we implement in-depth analyses of PBNs, including long-run influence analysis, long-run sensitivity analysis, computation of one-parameter profile likelihoods, and the visualization of one-parameter profile likelihoods. A PBN model of apoptosis is used as a case study to illustrate the main functionalities of <inline-formula><tex-math notation="LaTeX">${\mathsf{ASSA-PBN}}$</tex-math><alternatives> <inline-graphic xlink:href="mizera-ieq3-2773477.gif"/></alternatives></inline-formula> and to demonstrate the capabilities of <inline-formula><tex-math notation="LaTeX">${\mathsf{ASSA-PBN}}$</tex-math><alternatives> <inline-graphic xlink:href="mizera-ieq4-2773477.gif"/></alternatives></inline-formula> to effectively analyse biological systems modelled as PBNs.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2773477Moment-Based Parameter Estimation for Stochastic Reaction Networks in Equilibrium
https://www.computer.org/csdl/trans/tb/2018/04/08115212-abs.html
Calibrating parameters is a crucial problem within quantitative modeling approaches to reaction networks. Existing methods for stochastic models rely either on statistical sampling or can only be applied to small systems. Here, we present an inference procedure for stochastic models in equilibrium that is based on a moment matching scheme with optimal weighting and that can be used with high-throughput data like the one collected by flow cytometry. Our method does not require an approximation of the underlying equilibrium probability distribution and, if reaction rate constants have to be learned, the optimal values can be computed by solving a linear system of equations. We discuss important practical issues such as the selection of the moments and evaluate the effectiveness of the proposed approach on three case studies.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2775219A Sparse Regression Method for Group-Wise Feature Selection with False Discovery Rate Control
https://www.computer.org/csdl/trans/tb/2018/04/08166765-abs.html
The method of <italic>Sorted L-One Penalized Estimation</italic>, or <italic>SLOPE</italic>, is a sparse regression method recently introduced by Bogdan et. al. <xref ref-type="bibr" rid="ref1">[1]</xref> . It can be used to identify significant predictor variables in a linear model that may have more unknown parameters than observations. When the correlations between predictor variables are small, the SLOPE method is shown to successfully control the false discovery rate (the expected proportion of the irrelevant among all selected predictors) at a user specified level. However, the requirement for nearly uncorrelated predictors is too restrictive for genomic data, as demonstrated in our recent study <xref ref-type="bibr" rid="ref2">[2]</xref> by an application of SLOPE to realistic simulated DNA sequence data. A possible solution is to divide the predictor variables into nearly uncorrelated groups, and to modify the procedure to select entire groups with an overall significant group effect, rather than individual predictors. Following this motivation, we extend SLOPE in the spirit of Group LASSO to <italic>Group SLOPE</italic>, a method that can handle group structures between the predictor variables, which are ubiquitous in real genomic data. Our theoretical results show that Group SLOPE controls the group-wise false discovery rate (gFDR), when groups are orthogonal to each other. For use in non-orthogonal settings, we propose two types of Monte Carlo based heuristics, which lead to gFDR control with Group SLOPE in simulations based on real SNP data. As an illustration of the merits of this method, an application of Group SLOPE to a dataset from the Framingham Heart Study results in the identification of some known DNA sequence regions associated with bone health, as well as some new candidate regions. The novel methods are implemented in the R package <monospace>grpSLOPEMC </monospace>, which is publicly available at <uri>https://github.com/agisga/grpSLOPEMC</uri>.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2780106Structural Target Controllability of Linear Networks
https://www.computer.org/csdl/trans/tb/2018/04/08267489-abs.html
Computational analysis of the structure of intra-cellular molecular interaction networks can suggest novel therapeutic approaches for systemic diseases like cancer. Recent research in the area of network science has shown that network control theory can be a powerful tool in the understanding and manipulation of such bio-medical networks. In 2011, Liu et al. developed a polynomial time algorithm computing the size of the minimal set of nodes controlling a linear network. In 2014, Gao et al. generalized the problem for target control, minimizing the set of nodes controlling a target within a linear network. The authors developed a Greedy approximation algorithm while leaving open the complexity of the optimization problem. We prove here that the target controllability problem is NP-hard in all practical setups, i.e., when the control power of any individual input is bounded by some constant. We also show that the algorithm provided by Gao et al. fails to provide a valid solution in some special cases, and an additional validation step is required. We fix and improve their algorithm using several heuristics, obtaining in the end an up to 10-fold decrease in running time and also a decrease in the size of solutions.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2797271Organisation-Oriented Coarse Graining and Refinement of Stochastic Reaction Networks
https://www.computer.org/csdl/trans/tb/2018/04/08288662-abs.html
Chemical organisation theory is a framework developed to simplify the analysis of long-term behaviour of chemical systems. In this work, we build on these ideas to develop novel techniques for formal quantitative analysis of chemical reaction networks, using discrete stochastic models represented as continuous-time Markov chains. We propose methods to identify organisations, and to study quantitative properties regarding movements between these organisations. We then construct and formalise a coarse-grained Markov chain model of hierarchic organisations for a given reaction network, which can be used to approximate the behaviour of the original reaction network. As an application of the coarse-grained model, we predict the behaviour of the reaction network systems over time via the master equation. Experiments show that our predictions can mimic the main pattern of the concrete behaviour in the long run, but the precision varies for different models and reaction rule rates. Finally, we propose an algorithm to selectively refine the coarse-grained models and show experiments demonstrating that the precision of the prediction has been improved.08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2804395Influence Networks Compared with Reaction Networks: Semantics, Expressivity and Attractors
https://www.computer.org/csdl/trans/tb/2018/04/08290943-abs.html
Biochemical reaction networks are one of the most widely used formalisms in systems biology to describe the molecular mechanisms of high-level cell processes. However, modellers also reason with influence diagrams to represent the positive and negative influences between molecular species and may find an influence network useful in the process of building a reaction network. In this paper, we introduce a formalism of influence networks with forces, and equip it with a hierarchy of Boolean, Petri net, stochastic and differential semantics, similarly to reaction networks with rates. We show that the expressive power of influence networks is the same as that of reaction networks under the differential semantics, but weaker under the discrete semantics. Furthermore, the hierarchy of semantics leads us to consider a (positive) Boolean semantics that cannot test the absence of a species, that we compare with the (negative) Boolean semantics with test for absence of a species in gene regulatory networks <italic>à la</italic> Thomas. We study the monotonicity properties of the positive semantics and derive from them an algorithm to compute attractors in both the positive and negative Boolean semantics. We illustrate our results on models of the literature about the p53/Mdm2 DNA damage repair system, the circadian clock, and the influence of MAPK signaling on cell-fate decision in urinary bladder cancer.08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2805686Local Traces: An Over-Approximation of the Behavior of the Proteins in Rule-Based Models
https://www.computer.org/csdl/trans/tb/2018/04/08306662-abs.html
Thanks to rule-based modelling languages, we can assemble large sets of mechanistic protein-protein interactions within integrated models. Our goal would be to understand how the behavior of these systems emerges from these low-level interactions. Yet, this is a quite long term challenge and it is desirable to offer intermediary levels of abstraction, so as to get a better understanding of the models and to increase our confidence within our mechanistic assumptions. To this extend, static analysis can be used to derive various abstractions of the semantics, each of them offering new perspectives on the models. We propose an abstract interpretation of the behavior of each protein, in isolation. Given a model written in Kappa, this abstraction computes for each kind of proteins a transition system that describes which conformations this protein may take and how a protein may pass from one conformation to another one. Then, we use simplicial complexes to abstract away the interleaving order of the transformations between conformations that commute. As a result, we get a compact summary of the potential behavior of each protein of the model.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2812195A Distributed Classifier for MicroRNA Target Prediction with Validation Through TCGA Expression Data
https://www.computer.org/csdl/trans/tb/2018/04/08341495-abs.html
<italic>Background</italic>: MicroRNAs (miRNAs) are approximately 22-nucleotide long regulatory RNA that mediate RNA interference by binding to cognate mRNA target regions. Here, we present a distributed kernel SVM-based binary classification scheme to predict miRNA targets. It captures the spatial profile of miRNA-mRNA interactions via smooth B-spline curves. This is accomplished separately for various input features, such as thermodynamic and sequence-based features. Further, we use a principled approach to uniformly model both canonical and non-canonical seed matches, using a novel seed enrichment metric. Finally, we verify our miRNA-mRNA pairings using an Elastic Net-based regression model on TCGA expression data for four cancer types to estimate the miRNAs that together regulate any given mRNA. <italic>Results</italic>: We present a suite of algorithms for miRNA target prediction, under the banner <italic> Avishkar</italic>, with superior prediction performance over the competition. Specifically, our final kernel SVM model, with an Apache Spark backend, achieves an average true positive rate (TPR) of more than 75 percent, when keeping the false positive rate of 20 percent, for non-canonical human miRNA target sites. This is an improvement of over 150 percent in the TPR for non-canonical sites, over the best-in-class algorithm. We are able to achieve such superior performance by representing the thermodynamic and sequence profiles of miRNA-mRNA interaction as curves, devising a novel seed enrichment metric, and learning an ensemble of miRNA family-specific kernel SVM classifiers. We provide an easy-to-use system for large-scale interactive analysis and prediction of miRNA targets. All operations in our system, namely candidate set generation, feature generation and transformation, training, prediction, and computing performance metrics are fully distributed and are scalable. <italic>Conclusions</italic>: We have developed an efficient SVM-based model for miRNA target prediction using recent CLIP-seq data, demonstrating superior performance, evaluated using ROC curves for different species (human or mouse), or different target types (canonical or non-canonical). We analyzed the agreement between the target pairings using CLIP-seq data and using expression data from four cancer types. To the best of our knowledge, we provide the first distributed framework for miRNA target prediction based on Apache Hadoop and Spark. <italic>Availability</italic>: All source code and sample data are publicly available at <uri>https://bitbucket.org/cellsandmachines/avishkar</uri>. Our scalable implementation of kernel SVM using Apache Spark, which can be used to solve large-scale non-linear binary classification problems, is available at <uri>https://bitbucket.org/cellsandmachines/kernelsvmspark</uri>.08/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2828305Guest Editorial for Special Section on the Sixth National Conference on Bioinformatics and System Biology of China
https://www.computer.org/csdl/trans/tb/2018/04/08428577-abs.html
08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2838498Great Lakes Bioinformatics Conference (GLBIO) 2015 Special Section Editorial
https://www.computer.org/csdl/trans/tb/2018/04/08428581-abs.html
08/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2849800Guest Editors’ Introduction to the Special Section on the 14th International Conference on Computational Methods in Systems Biology (CMSB 2016)
https://www.computer.org/csdl/trans/tb/2018/04/08428603-abs.html
08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2816979Guest Editorial
https://www.computer.org/csdl/trans/tb/2018/04/08428604-abs.html
08/07/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2856658A Global Network Alignment Method Using Discrete Particle Swarm Optimization
https://www.computer.org/csdl/trans/tb/2018/03/07592892-abs.html
Molecular interactions data increase exponentially with the advance of biotechnology. This makes it possible and necessary to comparatively analyze the different data at a network level. Global network alignment is an important network comparison approach to identify conserved subnetworks and get insight into evolutionary relationship across species. Network alignment which is analogous to subgraph isomorphism is known to be an NP-hard problem. In this paper, we introduce a novel heuristic Particle-Swarm-Optimization based Network Aligner (PSONA), which optimizes a weighted global alignment model considering both protein sequence similarity and interaction conservations. The particle statuses and status updating rules are redefined in a discrete form by using permutation. A seed-and-extend strategy is employed to guide the searching for the superior alignment. The proposed initialization method “seeds” matches with high sequence similarity into the alignment, which guarantees the functional coherence of the mapping nodes. A greedy local search method is designed as the “extension” procedure to iteratively optimize the edge conservations. PSONA is compared with several state-of-art methods on ten network pairs combined by five species. The experimental results demonstrate that the proposed aligner can map the proteins with high functional coherence and can be used as a booster to effectively refine the well-studied aligners.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2618380GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites
https://www.computer.org/csdl/trans/tb/2018/03/07737072-abs.html
In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2625793From Optimization to Mapping: An Evolutionary Algorithm for Protein Energy Landscapes
https://www.computer.org/csdl/trans/tb/2018/03/07744601-abs.html
Stochastic search is often the only viable option to address complex optimization problems. Recently, evolutionary algorithms have been shown to handle challenging continuous optimization problems related to protein structure modeling. Building on recent work in our laboratories, we propose an evolutionary algorithm for efficiently mapping the multi-basin energy landscapes of dynamic proteins that switch between thermodynamically stable or semi-stable structural states to regulate their biological activity in the cell. The proposed algorithm balances computational resources between exploration and exploitation of the nonlinear, multimodal landscapes that characterize multi-state proteins via a novel combination of global and local search to generate a dynamically-updated, information-rich map of a protein's energy landscape. This new mapping-oriented EA is applied to several dynamic proteins and their disease-implicated variants to illustrate its ability to map complex energy landscapes in a computationally feasible manner. We further show that, given the availability of such maps, comparison between the maps of wildtype and variants of a protein allows for the formulation of a structural and thermodynamic basis for the impact of sequence mutations on dysfunction that may prove useful in guiding further wet-laboratory investigations of dysfunction and molecular interventions.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2628745Conformational Sampling of a Biomolecular Rugged Energy Landscape
https://www.computer.org/csdl/trans/tb/2018/03/07762847-abs.html
The protein structure refinement using conformational sampling is important in hitherto protein studies. In this paper, we examined the protein structure refinement by means of potential energy minimization using immune computing as a method of sampling conformations. The method was tested on the x-ray structure and 30 decoys of the mutant of [Leu]Enkephalin, a paradigmatic example of the biomolecular multiple-minima problem. In order to score the refined conformations, we used a standard potential energy function with the OPLSAA force field. The effectiveness of the search was assessed using a variety of methods. The robustness of sampling was checked by the energy yield function which measures quantitatively the number of the peptide decoys residing in an energetic funnel. Furthermore, the potential energy-dependent Pareto fronts were calculated to elucidate dissimilarities between peptide conformations and the native state as observed by x-ray crystallography. Our results showed that the probed potential energy landscape of [Leu]Enkephalin is self-similar on different metric scales and that the local potential energy minima of the peptide decoys are metastable, thus they can be refined to conformations whose potential energy is decreased by approximately <inline-formula><tex-math notation="LaTeX">$-$</tex-math><alternatives> <inline-graphic xlink:href="jakubowski-ieq1-2634008.gif"/></alternatives></inline-formula>250 kJ/mol.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2634008A Memetic Algorithm for 3D Protein Structure Prediction Problem
https://www.computer.org/csdl/trans/tb/2018/03/07765097-abs.html
Memetic Algorithms are population-based metaheuristics intrinsically concerned with exploiting all available knowledge about the problem under study. The incorporation of problem domain knowledge is not an optional mechanism, but a fundamental feature of the Memetic Algorithms. In this paper, we present a Memetic Algorithm to tackle the three-dimensional protein structure prediction problem. The method uses a structured population and incorporates a Simulated Annealing algorithm as a local search strategy, as well as ad-hoc crossover and mutation operators to deal with the problem. It takes advantage of structural knowledge stored in the <italic>Protein Data Bank</italic>, by using an Angle Probability List that helps to reduce the search space and to guide the search strategy. The proposed algorithm was tested on 19 protein sequences of amino acid residues, and the results show the ability of the algorithm to find native-like protein structures. Experimental results have revealed that the proposed algorithm can find good solutions regarding <italic>root-mean-square deviation</italic> and <italic>global distance total score test</italic> in comparison with the experimental protein structures. We also show that our results are comparable in terms of folding organization with state-of-the-art prediction methods, corroborating the effectiveness of our proposal.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2635143Prediction of HIV Drug Resistance by Combining Sequence and Structural Properties
https://www.computer.org/csdl/trans/tb/2018/03/07782418-abs.html
Drug resistance is a major obstacle faced by therapist in treating HIV infected patients. The reason behind these phenomena is either protein mutation or the changes in gene expression level that induces resistance to drug treatments. These mutations affect the drug binding activity, hence resulting in failure of treatment. Therefore, it is necessary to conduct resistance testing in order to carry out HIV effective therapy. This study combines both sequence and structural features for predicting HIV resistance by applying SVM and Random Forests classifiers. The model was tested on the mutants of HIV-1 protease and reverse transcriptase. Taken together the features we have used in our method, total contact energies among multiple mutations have a strong impact in predicting resistance as they are crucial in understanding the interactions of HIV mutants. The combination of sequence-structure features offers high accuracy with support vector machines as compared to Random Forests classifier. Both single and acquisition of multiple mutations are important in predicting HIV resistance to certain drug treatments. We have discovered the practicality of these features; hence, these can be used in the future to predict resistance for other complex diseases.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2638821Efficient Quartet Representations of Trees and Applications to Supertree and Summary Methods
https://www.computer.org/csdl/trans/tb/2018/03/07782719-abs.html
Quartet trees displayed by larger phylogenetic trees have long been used as inputs for species tree and supertree reconstruction. Computational constraints prevent the use of all displayed quartets in many practical problems with large numbers of taxa. We introduce the notion of an Efficient Quartet System (EQS) to represent a phylogenetic tree with a subset of the quartets displayed by the tree. We show mathematically that the set of quartets obtained from a tree via an EQS contains all of the combinatorial information of the tree itself. Using performance tests on simulated datasets, we also demonstrate that using an EQS to reduce the number of quartets in both summary method pipelines for species tree inference as well as methods for supertree inference results in only small reductions in accuracy.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2638911The Discovery of Mutated Driver Pathways in Cancer: Models and Algorithms
https://www.computer.org/csdl/trans/tb/2018/03/07784744-abs.html
The pathogenesis of cancer in human is still poorly understood. With the rapid development of high-throughput sequencing technologies, huge volumes of cancer genomics data have been generated. Deciphering that data poses great opportunities and challenges to computational biologists. One of such key challenges is to distinguish driver mutations, genes as well as pathways from passenger ones. Mutual exclusivity of gene mutations (each patient has no more than one mutation in the gene set) has been observed in various cancer types and thus has been used as an important property of a driver gene set or pathway. In this article, we aim to review the recent development of computational models and algorithms for discovering driver pathways or modules in cancer with the focus on mutual exclusivity-based ones.06/05/2018 4:46 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2640963Network-Regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery
https://www.computer.org/csdl/trans/tb/2018/03/07784786-abs.html
Molecular profiling data (e.g., gene expression) has been used for clinical risk prediction and biomarker discovery. However, it is necessary to integrate other prior knowledge like biological pathways or gene interaction networks to improve the predictive ability and biological interpretability of biomarkers. Here, we first introduce a general regularized Logistic Regression (LR) framework with regularized term <inline-formula><tex-math notation="LaTeX"> $\lambda \Vert \boldsymbol {w}\Vert _1 + \eta \boldsymbol {w}^T\boldsymbol {M}\boldsymbol {w}$</tex-math><alternatives> <inline-graphic xlink:href="zhang-ieq1-2640303.gif"/></alternatives></inline-formula>, which can reduce to different penalties, including Lasso, elastic net, and network-regularized terms with different <inline-formula> <tex-math notation="LaTeX">$\boldsymbol {M}$</tex-math><alternatives> <inline-graphic xlink:href="zhang-ieq2-2640303.gif"/></alternatives></inline-formula>. This framework can be easily solved in a unified manner by a cyclic coordinate descent algorithm which can avoid inverse matrix operation and accelerate the computing speed. However, if those estimated <inline-formula><tex-math notation="LaTeX">$\boldsymbol {w}_i$</tex-math><alternatives><inline-graphic xlink:href="zhang-ieq3-2640303.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\boldsymbol {w}_j$</tex-math><alternatives> <inline-graphic xlink:href="zhang-ieq4-2640303.gif"/></alternatives></inline-formula> have opposite signs, then the traditional network-regularized penalty may not perform well. To address it, we introduce a novel network-regularized sparse LR model with a new penalty <inline-formula><tex-math notation="LaTeX">$\lambda \Vert \boldsymbol {w}\Vert _1 + \eta |\boldsymbol {w}|^T\boldsymbol {M}|\boldsymbol {w}|$</tex-math><alternatives> <inline-graphic xlink:href="zhang-ieq5-2640303.gif"/></alternatives></inline-formula> to consider the difference between the absolute values of the coefficients. We develop two efficient algorithms to solve it. Finally, we test our methods and compare them with the related ones using simulated and real data to show their efficiency.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2640303Discovering Perturbation of Modular Structure in HIV Progression by Integrating Multiple Data Sources Through Non-Negative Matrix Factorization
https://www.computer.org/csdl/trans/tb/2018/03/07792127-abs.html
Detecting perturbation in modular structure during HIV-1 disease progression is an important step to understand stage specific infection pattern of HIV-1 virus in human cell. In this article, we proposed a novel methodology on integration of multiple biological information to identify such disruption in human gene module during different stages of HIV-1 infection. We integrate three different biological information: gene expression information, protein-protein interaction information, and gene ontology information in single gene meta-module, through non negative matrix factorization (NMF). As the identified meta-modules inherit those information so, detecting perturbation of these, reflects the changes in expression pattern, in PPI structure and in functional similarity of genes during the infection progression. To integrate modules of different data sources into strong meta-modules, NMF based clustering is utilized here. Perturbation in meta-modular structure is identified by investigating the topological and intramodular properties and putting rank to those meta-modules using a rank aggregation algorithm. We have also analyzed the preservation structure of significant GO terms in which the human proteins of the meta-modules participate. Moreover, we have performed an analysis to show the change of coregulation pattern of identified transcription factors (TFs) over the HIV progression stages.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2642184Evolutionary Graph Clustering for Protein Complex Identification
https://www.computer.org/csdl/trans/tb/2018/03/07792218-abs.html
This paper presents a graph clustering algorithm, called EGCPI, to discover protein complexes in protein-protein interaction (PPI) networks. In performing its task, EGCPI takes into consideration both network topologies and attributes of interacting proteins, both of which have been shown to be important for protein complex discovery. EGCPI formulates the problem as an optimization problem and tackles it with evolutionary clustering. Given a PPI network, EGCPI first annotates each protein with corresponding attributes that are provided in Gene Ontology database. It then adopts a similarity measure to evaluate how similar the connected proteins are taking into consideration the network topology. Given this measure, EGCPI then discovers a number of graph clusters within which proteins are densely connected, based on an evolutionary strategy. At last, EGCPI identifies protein complexes in each discovered cluster based on the homogeneity of attributes performed by pairwise proteins. EGCPI has been tested with several real data sets and the experimental results show EGCPI is very effective on protein complex discovery, and the evolutionary clustering is helpful to identify protein complexes in PPI networks. The software of EGCPI can be downloaded via: <uri> https://github.com/hetiantian1985/EGCPI</uri>.06/05/2018 4:46 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2642107Dynamics in Epistasis Analysis
https://www.computer.org/csdl/trans/tb/2018/03/07817774-abs.html
Finding regulatory relationships between genes, including the direction and nature of influence between them, is a fundamental challenge in the field of molecular genetics. One classical approach to this problem is epistasis analysis. Broadly speaking, epistasis analysis infers the regulatory relationships between a pair of genes in a genetic pathway by considering the patterns of change in an observable trait resulting from single and double deletion of genes. While classical epistasis analysis has yielded deep insights on numerous genetic pathways, it is not without limitations. Here, we explore the possibility of dynamic epistasis analysis, in which, in addition to performing genetic perturbations of a pathway, we drive the pathway by a time-varying upstream signal. We explore the theoretical power of dynamical epistasis analysis by conducting an identifiability analysis of Boolean models of genetic pathways, comparing static and dynamic approaches. We find that even relatively simple input dynamics greatly increases the power of epistasis analysis to discriminate alternative network structures. Further, we explore the question of experiment design, and show that a subset of short time-varying signals, which we call dynamic primitives, allow maximum discriminative power with a reduced number of experiments.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2653110A Just-in-Time Learning Based Monitoring and Classification Method for Hyper/Hypocalcemia Diagnosis
https://www.computer.org/csdl/trans/tb/2018/03/07828026-abs.html
This study focuses on the classification and pathological status monitoring of hyper/hypo-calcemia in the calcium regulatory system. By utilizing the Independent Component Analysis (ICA) mixture model, samples from healthy patients are collected, diagnosed, and subsequently classified according to their underlying behaviors, characteristics, and mechanisms. Then, a Just-in-Time Learning (JITL) has been employed in order to estimate the diseased status dynamically. In terms of JITL, for the purpose of the construction of an appropriate similarity index to identify relevant datasets, a novel similarity index based on the ICA mixture model is proposed in this paper to improve online model quality. The validity and effectiveness of the proposed approach have been demonstrated by applying it to the calcium regulatory system under various hypocalcemic and hypercalcemic diseased conditions.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2655522Regularized Non-Negative Matrix Factorization for Identifying Differentially Expressed Genes and Clustering Samples: A Survey
https://www.computer.org/csdl/trans/tb/2018/03/07845582-abs.html
Non-negative Matrix Factorization (NMF), a classical method for dimensionality reduction, has been applied in many fields. It is based on the idea that negative numbers are physically meaningless in various data-processing tasks. Apart from its contribution to conventional data analysis, the recent overwhelming interest in NMF is due to its newly discovered ability to solve challenging data mining and machine learning problems, especially in relation to gene expression data. This survey paper mainly focuses on research examining the application of NMF to identify differentially expressed genes and to cluster samples, and the main NMF models, properties, principles, and algorithms with its various generalizations, extensions, and modifications are summarized. The experimental results demonstrate the performance of the various NMF algorithms in identifying differentially expressed genes and clustering samples.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2665557A Combined PLS and Negative Binomial Regression Model for Inferring Association Networks from Next-Generation Sequencing Count Data
https://www.computer.org/csdl/trans/tb/2018/03/07845626-abs.html
A major challenge of genomics data is to detect interactions displaying functional associations from large-scale observations. In this study, a new cPLS-algorithm combining partial least squares approach with negative binomial regression is suggested to reconstruct a genomic association network for high-dimensional next-generation sequencing count data. The suggested approach is applicable to the raw counts data, without requiring any further pre-processing steps. In the settings investigated, the cPLS-algorithm outperformed the two widely used comparative methods, graphical lasso, and weighted correlation network analysis. In addition, cPLS is able to estimate the full network for thousands of genes without major computational load. Finally, we demonstrate that cPLS is capable of finding biologically meaningful associations by analyzing an example data set from a previously published study to examine the molecular anatomy of the craniofacial development.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2665495An Organelle Correlation-Guided Feature Selection Approach for Classifying Multi-Label Subcellular Bio-Images
https://www.computer.org/csdl/trans/tb/2018/03/07870637-abs.html
Nowadays, with the advances in microscopic imaging, accurate classification of bioimage-based protein subcellular location pattern has attracted as much attention as ever. One of the basic challenging problems is how to select the useful feature components among thousands of potential features to describe the images. This is not an easy task especially considering there is a high ratio of multi-location proteins. Existing feature selection methods seldom take the correlation among different cellular compartments into consideration, and thus may miss some features that will be co-important for several subcellular locations. To deal with this problem, we make use of the important structural correlation among different cellular compartments and propose an organelle structural correlation regularized feature selection method CSF (Common-Sets of Features) in this paper. We formulate the multi-label classification problem by adopting a group-sparsity regularizer to select common subsets of relevant features from different cellular compartments. In addition, we also add a cell structural correlation regularized Laplacian term, which utilizes the prior biological structural information to capture the intrinsic dependency among different cellular compartments. The CSF provides a new feature selection strategy for multi-label bio-image subcellular pattern classifications, and the experimental results also show its superiority when comparing with several existing algorithms.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2677907Constructing DNA Barcode Sets Based on Particle Swarm Optimization
https://www.computer.org/csdl/trans/tb/2018/03/07873270-abs.html
Following the completion of the human genome project, a large amount of high-throughput bio-data was generated. To analyze these data, massively parallel sequencing, namely next-generation sequencing, was rapidly developed. DNA barcodes are used to identify the ownership between sequences and samples when they are attached at the beginning or end of sequencing reads. Constructing DNA barcode sets provides the candidate DNA barcodes for this application. To increase the accuracy of DNA barcode sets, a particle swarm optimization (PSO) algorithm has been modified and used to construct the DNA barcode sets in this paper. Compared with the extant results, some lower bounds of DNA barcode sets are improved. The results show that the proposed algorithm is effective in constructing DNA barcode sets.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2679004PerPAS: Topology-Based Single Sample Pathway Analysis Method
https://www.computer.org/csdl/trans/tb/2018/03/07874138-abs.html
Identification of intracellular pathways that play key roles in cancer progression and drug resistance is a prerequisite for developing targeted cancer treatments. The era of personalized medicine calls for computational methods that can function with one sample or a very small set of samples. Developing such methods is challenging because standard statistical approaches pose several limiting assumptions, such as number of samples, that prevent their application when <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives> <inline-graphic xlink:href="liu-ieq1-2679745.gif"/></alternatives></inline-formula> approaches to one. We have developed a novel pathway analysis method called PerPAS to estimate pathway activity at a single sample level by integrating pathway topology and transcriptomics data. In addition, PerPAS is able to identify altered pathways between cancer and control samples as well as to identify key nodes that contribute to the pathway activity. In our case study using breast cancer data, we show that PerPAS can identify highly altered pathways that are associated with patient survival. PerPAS identified four pathways that were associated with patient survival and were successfully validated in three independent breast cancer cohorts. In comparison to two other pathway analysis methods that function at a single sample level, PerPAS had superior performance in both synthetic and breast cancer expression datasets. PerPAS is a free R package (<uri>http://csbi.ltdk.helsinki.fi/pub/czliu/perpas/</uri>).06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2679745A Comparative Study for Identifying the Chromosome-Wide Spatial Clusters from High-Throughput Chromatin Conformation Capture Data
https://www.computer.org/csdl/trans/tb/2018/03/07882716-abs.html
In the past years, the high-throughput sequencing technologies have enabled massive insights into genomic annotations. In contrast, the full-scale three-dimensional arrangements of genomic regions are relatively unknown. Thanks to the recent breakthroughs in High-throughput Chromosome Conformation Capture (Hi-C) techniques, non-negative matrix factorization (NMF) has been adopted to identify local spatial clusters of genomic regions from Hi-C data. However, such non-negative matrix factorization entails a high-dimensional non-convex objective function to be optimized with non-negative constraints. We propose and compare more than ten optimization algorithms to improve the identification of local spatial clusters via NMF. To circumvent and optimize the high-dimensional, non-convex, and constrained objective function, we draw inspiration from the nature to perform <italic>in silico</italic> evolution. The proposed algorithms consist of a population of candidates to be evolved while the NMF acts as local search during the evolutions. The population based optimization algorithm coordinates and guides the non-negative matrix factorization toward global optima. Experimental results show that the proposed algorithms can improve the quality of non-negative matrix factorization over the recent state-of-the-arts. The effectiveness and robustness of the proposed algorithms are supported by comprehensive performance benchmarking on chromosome-wide Hi-C contact maps of yeast and human. In addition, time complexity analysis, convergence analysis, parameter analysis, biological case studies, and gene ontology similarity analysis are conducted to demonstrate the robustness of the proposed methods from different perspectives.06/05/2018 4:46 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2684800Sparse Pathway-Induced Dynamic Network Biomarker Discovery for Early Warning Signal Detection in Complex Diseases
https://www.computer.org/csdl/trans/tb/2018/03/07887681-abs.html
In many complex diseases, the transition process from the healthy stage to the catastrophic stage does not occur gradually. Recent studies indicate that the initiation and progression of such diseases are comprised of three steps including healthy stage, pre-disease stage, and disease stage. It has been demonstrated that a certain set of trajectories can be observed in the genetic signatures at the molecular level, which might be used to detect the pre-disease stage and to take necessary medical interventions. In this paper, we propose two optimization-based algorithms for extracting the dynamic network biomarkers responsible for catastrophic transition into the disease stage, and to open new horizons to reverse the disease progression at an early stage through pinpointing molecular signatures provided by high-throughput microarray data. The first algorithm relies on meta-heuristic intelligent search to characterize dynamic network biomarkers represented as a complete graph. The second algorithm induces sparsity on the adjacency matrix of the genes by taking into account the biological signaling and metabolic pathways, since not all the genes in the ineractome are biologically linked. Comprehensive numerical and meta-analytical experiments verify the effectiveness of the results of the proposed approaches in terms of network size, biological meaningfulness, and verifiability.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2687925BRANE Clust: Cluster-Assisted Gene Regulatory Network Inference Refinement
https://www.computer.org/csdl/trans/tb/2018/03/07888506-abs.html
Discovering meaningful gene interactions is crucial for the identification of novel regulatory processes in cells. Building accurately the related graphs remains challenging due to the large number of possible solutions from available data. Nonetheless, enforcing <italic>a priori</italic> on the graph structure, such as modularity, may reduce network indeterminacy issues. BRANE Clust (Biologically-Related A priori Network Enhancement with Clustering) refines gene regulatory network (GRN) inference thanks to cluster information. It works as a post-processing tool for inference methods (i.e., CLR, GENIE3). In BRANE Clust, the clustering is based on the inversion of a system of linear equations involving a graph-Laplacian matrix promoting a modular structure. Our approach is validated on DREAM4 and DREAM5 datasets with objective measures, showing significant comparative improvements. We provide additional insights on the discovery of novel regulatory or co-expressed links in the inferred <italic>Escherichia coli</italic> network evaluated using the STRING database. The comparative pertinence of clustering is discussed computationally (SIMoNe, WGCNA, X-means) and biologically (RegulonDB). BRANE Clust software is available at: <uri> http://www-syscom.univ-mlv.fr/~pirayre/Codes-GRN-BRANE-clust.html</uri>.06/05/2018 4:46 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2688355Assessment of Semantic Similarity between Proteins Using Information Content and Topological Properties of the Gene Ontology Graph
https://www.computer.org/csdl/trans/tb/2018/03/07890833-abs.html
The semantic similarity between two interacting proteins can be estimated by combining the similarity scores of the GO terms associated with the proteins. Greater number of similar GO annotations between two proteins indicates greater interaction affinity. Existing semantic similarity measures make use of the GO graph structure, the information content of GO terms, or a combination of both. In this paper, we present a hybrid approach which utilizes both the topological features of the GO graph and information contents of the GO terms. More specifically, we 1) consider a fuzzy clustering of the GO graph based on the level of association of the GO terms, 2) estimate the GO term memberships to each cluster center based on the respective shortest path lengths, and 3) assign weightage to GO term pairs on the basis of their dissimilarity with respect to the cluster centers. We test the performance of our semantic similarity measure against seven other previously published similarity measures using benchmark protein-protein interaction datasets of <italic>Homo sapiens</italic> and <italic>Saccharomyces cerevisiae</italic> based on sequence similarity, Pfam similarity, area under ROC curve, and <inline-formula><tex-math notation="LaTeX">$F_1$</tex-math> <alternatives> <inline-graphic xlink:href="basu-ieq1-2689762.gif" xlink:type="simple"/></alternatives></inline-formula> measure.06/07/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2689762Optimization of Gene Set Annotations Using Robust Trace-Norm Multitask Learning
https://www.computer.org/csdl/trans/tb/2018/03/07891559-abs.html
Gene set enrichment (GSE) is a useful tool for analyzing and interpreting large molecular datasets generated by modern biomedical science. The accuracy and reproducibility of GSE analysis are heavily affected by the quality and integrity of gene sets annotations. In this paper, we propose a novel method, robust trace-norm multitask learning, to solve the optimization problem of gene set annotations. Inspired by the binary nature of annotations, we convert the optimization of gene set annotations into a weakly supervised classification problem and use discriminative logistic regression to fit these datasets. Then, the output of logistic regression can be used to measure the probability of the existence of annotations. In addition, the optimization of each row of the annotation matrix can be treated as an independent weakly classification task, and we use the multitask learning approach with trace-norm regularization to optimize all rows of annotation matrix simultaneously. Finally, the experiments on simulated and real data demonstrate the effectiveness and good performance of the proposed method.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2690427A Spatial-Temporal Method to Detect Global Influenza Epidemics Using Heterogeneous Data Collected from the Internet
https://www.computer.org/csdl/trans/tb/2018/03/07891963-abs.html
The 2009 influenza pandemic teaches us how fast the influenza virus could spread globally within a short period of time. To address the challenge of timely global influenza surveillance, this paper presents a spatial-temporal method that incorporates heterogeneous data collected from the Internet to detect influenza epidemics in real time. Specifically, the influenza morbidity data, the influenza-related Google query data and news data, and the international air transportation data are integrated in a multivariate hidden Markov model, which is designed to describe the intrinsic temporal-geographical correlation of influenza transmission for surveillance purpose. Respective models are built for 106 countries and regions in the world. Despite that the WHO morbidity data are not always available for most countries, the proposed method achieves 90.26 to 97.10 percent accuracy on average for real-time detection of global influenza epidemics during the period from January 2005 to December 2015. Moreover, experiment shows that, the proposed method could even predict an influenza epidemic before it occurs with 89.20 percent accuracy on average. Timely international surveillance results may help the authorities to prevent and control the influenza disease at the early stage of a global influenza pandemic.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2690631Comparison of Machine Learning Approaches for Prediction of Advanced Liver Fibrosis in Chronic Hepatitis C Patients
https://www.computer.org/csdl/trans/tb/2018/03/07891989-abs.html
<italic>Background/Aim:</italic> Using machine learning approaches as non-invasive methods have been used recently as an alternative method in staging chronic liver diseases for avoiding the drawbacks of biopsy. This study aims to evaluate different machine learning techniques in prediction of advanced fibrosis by combining the serum bio-markers and clinical information to develop the classification models. <italic>Methods:</italic> A prospective cohort of 39,567 patients with chronic hepatitis C was divided into two sets—one categorized as mild to moderate fibrosis (F0-F2), and the other categorized as advanced fibrosis (F3-F4) according to METAVIR score. Decision tree, genetic algorithm, particle swarm optimization, and multi-linear regression models for advanced fibrosis risk prediction were developed. Receiver operating characteristic curve analysis was performed to evaluate the performance of the proposed models. <italic>Results:</italic> Age, platelet count, AST, and albumin were found to be statistically significant to advanced fibrosis. The machine learning algorithms under study were able to predict advanced fibrosis in patients with HCC with AUROC ranging between 0.73 and 0.76 and accuracy between 66.3 and 84.4 percent. <italic>Conclusions:</italic> Machine-learning approaches could be used as alternative methods in prediction of the risk of advanced liver fibrosis due to chronic hepatitis C.06/05/2018 4:46 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2690848LMMO: A Large Margin Approach for Refining Regulatory Motifs
https://www.computer.org/csdl/trans/tb/2018/03/07892855-abs.html
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, they usually have to sacrifice accuracy and may fail to fully leverage the potential of large datasets. Recently, it has been demonstrated that the motifs identified by DMDs can be significantly improved by maximizing the receiver-operating characteristic curve (AUC) metric, which has been widely used in the literature to rank the performance of elicited motifs. However, existing approaches for motif refinement choose to directly maximize the non-convex and discontinuous AUC itself, which is known to be difficult and may lead to suboptimal solutions. In this paper, we propose Large Margin Motif Optimizer (LMMO), a large-margin-type algorithm for refining regulatory motifs. By relaxing the AUC cost function with the surrogate convex hinge loss, we show that the resultant learning problem can be cast as an instance of difference-of-convex (DC) programs, and solve it iteratively using constrained concave-convex procedure (CCCP). To further save computational time, we combine LMMO with existing techniques for improving the scalability of large-margin-type algorithms, such as cutting plane method. Experimental evaluations on synthetic and real data illustrate the performance of the proposed approach. The code of LMMO is freely available at: <uri>https://github.com/ekffar/LMMO</uri>.06/05/2018 4:46 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2691325Multiple-Swarm Ensembles: Improving the Predictive Power and Robustness of Predictive Models and Its Use in Computational Biology
https://www.computer.org/csdl/trans/tb/2018/03/07892919-abs.html
Machine learning is an integral part of computational biology, and has already shown its use in various applications, such as prognostic tests. In the last few years in the non-biological machine learning community, ensembling techniques have shown their power in data mining competitions such as the Netflix challenge; however, such methods have not found wide use in computational biology. In this work, we endeavor to show how ensembling techniques can be applied to practical problems, including problems in the field of bioinformatics, and how they often outperform other machine learning techniques in both predictive power and robustness. Furthermore, we develop a methodology of ensembling, Multi-Swarm Ensemble (MSWE) by using multiple particle swarm optimizations and demonstrate its ability to further enhance the performance of ensembles.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2691329Investigating Correlation between Protein Sequence Similarity and Semantic Similarity Using Gene Ontology Annotations
https://www.computer.org/csdl/trans/tb/2018/03/07903601-abs.html
Sequence similarity is a commonly used measure to compare proteins. With the increasing use of ontologies, semantic (function) similarity is getting importance. The correlation between these measures has been applied in the evaluation of new semantic similarity methods, and in protein function prediction. In this research, we investigate the relationship between the two similarity methods. The results suggest absence of a strong correlation between sequence and semantic similarities. There is a large number of proteins with low sequence similarity and high semantic similarity. We observe that Pearson's correlation coefficient is not sufficient to explain the nature of this relationship. Interestingly, the term semantic similarity values above 0 and below 1 do not seem to play a role in improving the correlation. That is, the correlation coefficient depends only on the number of common GO terms in proteins under comparison, and the semantic similarity measurement method does not influence it. Semantic similarity and sequence similarity have a distinct behavior. These findings are of significant effect for future works on protein comparison, and will help understand the semantic similarity between proteins in a better way.06/05/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2695542Novelty Indicator for Enhanced Prioritization of Predicted Gene Ontology Annotations
https://www.computer.org/csdl/trans/tb/2018/03/07903615-abs.html
Biomolecular controlled annotations have become pivotal in computational biology, because they allow scientists to analyze large amounts of biological data to better understand test results, and to infer new knowledge. Yet, biomolecular annotation databases are incomplete by definition, like our knowledge of biology, and might contain errors and inconsistent information. In this context, machine-learning algorithms able to predict and prioritize new annotations are both effective and efficient, especially if compared with time-consuming trials of biological validation. To limit the possibility that these techniques predict obvious and trivial high-level features, and to help prioritize their results, we introduce a new element that can improve accuracy and relevance of the results of an annotation prediction and prioritization pipeline. We propose a novelty indicator able to state the level of “originality” of the annotations predicted for a specific gene to Gene Ontology (GO) terms. This indicator, joint with our previously introduced prediction steps, helps by prioritizing the most <italic>novel</italic> <italic>interesting</italic> annotations predicted. We performed an accurate biological functional analysis of the prioritized annotations predicted with high accuracy by our indicator and previously proposed methods. The relevance of our biological findings proves effectiveness and trustworthiness of our indicator and of its prioritization of predicted annotations.06/05/2018 4:46 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2695459A Bipartite Network and Resource Transfer-Based Approach to Infer lncRNA-Environmental Factor Associations
https://www.computer.org/csdl/trans/tb/2018/03/07903695-abs.html
Phenotypes and diseases are often determined by the complex interactions between genetic factors and environmental factors (EFs). However, compared with protein-coding genes and microRNAs, there is a paucity of computational methods for understanding the associations between long non-coding RNAs (lncRNAs) and EFs. In this study, we focused on the associations between lncRNA and EFs. By using the common miRNA partners of any pair of lncRNA and EF, based on the competing endogenous RNA (ceRNA) hypothesis and the technique of resources transfer within the experimentally-supported lncRNA-miRNA and miRNA-EF association bipartite networks, we propose an algorithm for predicting new lncRNA-EF associations. Results show that, compared with another recently-proposed method, our approach is capable of predicting more credible lncRNA-EF associations. These results support the validity of our approach to predict biologically significant associations, which could lead to a better understanding of the molecular processes.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2695187DNA Assembly with De Bruijn Graphs Using an FPGA Platform
https://www.computer.org/csdl/trans/tb/2018/03/07906528-abs.html
This paper presents an FPGA implementation of a DNA assembly algorithm, called Ray, initially developed to run on parallel CPUs. The OpenCL language is used and the focus is placed on modifying and optimizing the original algorithm to better suit the new parallelization tool and the radically different hardware architecture. The results show that the execution time is roughly one fourth that of the CPU and factoring energy consumption yields a tenfold savings.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2696522Aggregation for Computing Multi-Modal Stationary Distributions in 1-D Gene Regulatory Networks
https://www.computer.org/csdl/trans/tb/2018/03/07913654-abs.html
This paper proposes aggregation-based, three-stage algorithms to overcome the numerical problems encountered in computing stationary distributions and mean first passage times for multi-modal birth-death processes of large state space sizes. The considered birth-death processes which are defined by Chemical Master Equations are used in modeling stochastic behavior of gene regulatory networks. Computing stationary probabilities for a multi-modal distribution from Chemical Master Equations is subject to have numerical problems due to the probability values running out of the representation range of the standard programming languages with the increasing size of the state space. The aggregation is shown to provide a solution to this problem by analyzing first reduced size subsystems in isolation and then considering the transitions between these subsystems. The proposed algorithms are applied to study the bimodal behavior of the lac operon of <italic>E. coli</italic> described with a one-dimensional birth-death model. Thus, the determination of the entire parameter range of bimodality for the stochastic model of lac operon is achieved.06/05/2018 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2699177NAHAL-Flex: A Numerical and Alphabetical Hinge Detection Algorithm for Flexible Protein Structure Alignment
https://www.computer.org/csdl/trans/tb/2018/03/07930520-abs.html
Flexible proteins are proteins that have conformational changes in their structures. Protein flexibility analysis is critical for classifying and understanding protein functionality. For that analysis, the hinge areas where proteins show flexibility must be detected. To detect the location of the hinges, previous methods have utilized the three-dimensional (3D) structure of proteins, which is highly computational. To reduce the computational complexity, this study proposes a novel text-based method using structural alphabets (SAs) for detecting the hinge position, called NAHAL-Flex. Protein structures were encoded to a particular type of SA called the protein folding shape code (PFSC), which remains unaffected by location, scale, and rotation. The flexible regions of the proteins are the only places in which letter sequences can be distorted. With this knowledge, it is possible to find the longest alignment path of two letter sequences using a dynamic programming (DP) algorithm. Then, the proposed method looks for regions where the alphabet sequence is distorted to find the most probable hinge positions. In order to reduce the number of hinge positions, a genetic algorithm (GA) was utilized to find the best candidate hinge points. To evaluate the method's effectiveness, four different flexible and rigid protein databases, including two small datasets and two large datasets, were utilized. For the small dataset, the NAHAL-Flex method was comparable to state-of-the-art structural flexible alignment methods. The result for the large datasets show that NAHAL-Flex outperforms some well-known alignment methods, e.g., DaliLite, Matt, DeepAlign, and TM-align; the speed of NAHAL-Flex was faster and its result was more accurate than the other methods.06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2705080Advances in the Application and Development of Non-Linear Global Optimization Techniques in Computational Structural Biology
https://www.computer.org/csdl/trans/tb/2018/03/08371204-abs.html
06/05/2018 4:47 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2817267Correction to “Detecting Essential Proteins Based on Network Topology, Gene Expression Data, and Gene Ontology Information”
https://www.computer.org/csdl/trans/tb/2018/03/08371727-abs.html
06/04/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2813918A Copula-Based Granger Causality Measure for the Analysis of Neural Spike Train Data
https://www.computer.org/csdl/trans/tb/2018/02/07001041-abs.html
In systems neuroscience, it is becoming increasingly common to record the activity of hundreds of neurons simultaneously via electrode arrays. The ability to accurately measure the causal interactions among multiple neurons in the brain is crucial to understanding how neurons work in concert to generate specific brain functions. The development of new statistical methods for assessing causal influence between spike trains is still an active field of neuroscience research. Here, we suggest a copula-based Granger causality measure for the analysis of neural spike train data. This method is built upon our recent work on copula Granger causality for the analysis of continuous-valued time series by extending it to point-process neural spike train data. The proposed method is therefore able to reveal nonlinear and high-order causality in the spike trains while retaining all the computational advantages such as model-free, efficient estimation, and variability assessment of Granger causality. The performance of our algorithm can be further boosted with time-reversed data. Our method performed well on extensive simulations, and was then demonstrated on neural activity simultaneously recorded from primary visual cortex of a monkey performing a contour detection task.04/03/2018 3:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2014.2388311Modeling the Geometry and Dynamics of the Endoplasmic Reticulum Network
https://www.computer.org/csdl/trans/tb/2018/02/07006715-abs.html
The endoplasmic reticulum (ER) is an intricate network that pervades the entire cortex of plant cells and its geometric shape undergoes drastic changes. This paper proposes a mathematical model to reconstruct geometric network dynamics by combining the node movements within the network and topological changes engendered by these nodes. The network topology in the model is determined by a modified optimization procedure from the work (Lemarchand, et al. 2014) which minimizes the total length taking into account both degree and angle constraints, beyond the conditions of connectedness and planarity. A novel feature for solving our optimization problem is the use of “lifted” angle constraints, which allows one to considerably reduce the solution runtimes. Using this optimization technique and a Langevin approach for the branching node movement, the simulated network dynamics represent the ER network dynamics observed under latrunculin B treated condition and recaptures features such as the appearance/disappearance of loops within the ER under the native condition. The proposed modeling approach allows quantitative comparison of networks between the model and experimental data based on topological changes induced by node dynamics. An increased temporal resolution of experimental data will allow a more detailed comparison of network dynamics using this modeling approach.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2389226Optimal Fault Detection and Diagnosis in Transcriptional Circuits Using Next-Generation Sequencing
https://www.computer.org/csdl/trans/tb/2018/02/07045509-abs.html
We propose a methodology for model-based fault detection and diagnosis for stochastic Boolean dynamical systems indirectly observed through a single time series of transcriptomic measurements using Next Generation Sequencing (NGS) data. The fault detection consists of an innovations filter followed by a fault certification step, and requires no knowledge about the possible system faults. The innovations filter uses the optimal Boolean state estimator, called the Boolean Kalman Filter (BKF). In the presence of knowledge about the possible system faults, we propose an additional step of fault diagnosis based on a multiple model adaptive estimation (MMAE) method consisting of a bank of BKFs running in parallel. Performance is assessed by means of false detection and misdiagnosis rates, as well as average times until correct detection and diagnosis. The efficacy of the proposed methodology is demonstrated via numerical experiments using a <italic>p53-MDM2</italic> negative feedback loop Boolean network with stuck-at faults that model molecular events commonly found in cancer.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2404819MeTDiff: A Novel Differential RNA Methylation Analysis for MeRIP-Seq Data
https://www.computer.org/csdl/trans/tb/2018/02/07052329-abs.html
N6-Methyladenosine (m<sup>6</sup>A) transcriptome methylation is an exciting new research area that just captures the attention of research community. We present in this paper, MeTDiff, a novel computational tool for predicting differential m<sup>6</sup>A methylation sites from Methylated RNA immunoprecipitation sequencing (MeRIP-Seq) data. Compared with the existing algorithm exomePeak, the advantages of MeTDiff are that it explicitly models the reads variation in data and also devices a more power likelihood ratio test for differential methylation site prediction. Comprehensive evaluation of MeTDiff's performance using both simulated and real datasets showed that MeTDiff is much more robust and achieved much higher sensitivity and specificity over exomePeak. The R package “MeTDiff” and additional details are available at: <uri>https://github.com/compgenomics/MeTDiff</uri>.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2403355Computational Prediction of Pathogenic Network Modules in Fusarium verticillioides
https://www.computer.org/csdl/trans/tb/2018/02/07118172-abs.html
<italic>Fusarium verticillioides</italic> is a fungal pathogen that triggers stalk rots and ear rots in maize. In this study, we performed a comparative analysis of wild type and loss-of-virulence mutant <italic>F. verticillioides </italic> co-expression networks to identify subnetwork modules that are associated with its pathogenicity. We constructed the <italic>F. verticillioides</italic> co-expression networks from RNA-Seq data and searched through these networks to identify subnetwork modules that are differentially activated between the wild type and mutant <italic>F. verticillioides</italic>, which considerably differ in terms of pathogenic potentials. A greedy seed-and-extend approach was utilized in our search, where we also used an efficient branch-out technique for reliable prediction of functional subnetwork modules in the fungus. Through our analysis, we identified four potential pathogenicity-associated subnetwork modules, each of which consists of interacting genes with coordinated expression patterns, but whose activation level is significantly different in the wild type and the mutant. The predicted modules were comprised of functionally coherent genes and topologically cohesive. Furthermore, they contained several orthologs of known pathogenic genes in other fungi, which may play important roles in the fungal pathogenesis.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2440232Bayesian Multiresolution Variable Selection for Ultra-High Dimensional Neuroimaging Data
https://www.computer.org/csdl/trans/tb/2018/02/07126946-abs.html
Ultra-high dimensional variable selection has become increasingly important in analysis of neuroimaging data. For example, in the Autism Brain Imaging Data Exchange (ABIDE) study, neuroscientists are interested in identifying important biomarkers for early detection of the autism spectrum disorder (ASD) using high resolution brain images that include hundreds of thousands voxels. However, most existing methods are not feasible for solving this problem due to their extensive computational costs. In this work, we propose a novel multiresolution variable selection procedure under a Bayesian probit regression framework. It recursively uses posterior samples for coarser-scale variable selection to guide the posterior inference on finer-scale variable selection, leading to very efficient Markov chain Monte Carlo (MCMC) algorithms. The proposed algorithms are computationally feasible for ultra-high dimensional data. Also, our model incorporates two levels of structural information into variable selection using Ising priors: the spatial dependence between voxels and the functional connectivity between anatomical brain regions. Applied to the resting state functional magnetic resonance imaging (R-fMRI) data in the ABIDE study, our methods identify voxel-level imaging biomarkers highly predictive of the ASD, which are biologically meaningful and interpretable. Extensive simulations also show that our methods achieve better performance in variable selection compared to existing methods.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2440244Examining De Novo Transcriptome Assemblies via a Quality Assessment Pipeline
https://www.computer.org/csdl/trans/tb/2018/02/07126949-abs.html
New <italic>de novo</italic> transcriptome assembly and annotation methods provide an incredible opportunity to study the transcriptome of organisms that lack an assembled and annotated genome. There are currently a number of <italic>de novo</italic> transcriptome assembly methods, but it has been difficult to evaluate the quality of these assemblies. In order to assess the quality of the transcriptome assemblies, we composed a workflow of multiple quality check measurements that in combination provide a clear evaluation of the assembly performance. We presented novel transcriptome assemblies and functional annotations for Pacific Whiteleg Shrimp (<italic>Litopenaeus vannamei</italic> ), a mariculture species with great national and international interest, and no solid transcriptome/genome reference. We examined Pacific Whiteleg transcriptome assemblies via multiple metrics, and provide an improved gene annotation. Our investigations show that assessing the quality of an assembly purely based on the assembler's statistical measurements can be misleading; we propose a hybrid approach that consists of statistical quality checks and further biological-based evaluations.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2446478Gender Identification of Human Brain Image with A Novel 3D Descriptor
https://www.computer.org/csdl/trans/tb/2018/02/07130613-abs.html
Determining gender by examining the human brain is not a simple task because the spatial structure of the human brain is complex, and no obvious differences can be seen by the naked eyes. In this paper, we propose a novel three-dimensional feature descriptor, the three-dimensional weighted histogram of gradient orientation (3D WHGO) to describe this complex spatial structure. The descriptor combines local information for signal intensity and global three-dimensional spatial information for the whole brain. We also improve a framework to address the classification of three-dimensional images based on MRI. This framework, three-dimensional spatial pyramid, uses additional information regarding the spatial relationship between features. The proposed method can be used to distinguish gender at the individual level. We examine our method by using the gender identification of individual magnetic resonance imaging (MRI) scans of a large sample of healthy adults across four research sites, resulting in up to individual-level accuracies under the optimized parameters for distinguishing between females and males. Compared with previous methods, the proposed method obtains higher accuracy, which suggests that this technology has higher discriminative power. With its improved performance in gender identification, the proposed method may have the potential to inform clinical practice and aid in research on neurological and psychiatric disorders.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2448081Cortical Thinning and Cognitive Impairment in Parkinson's Disease without Dementia
https://www.computer.org/csdl/trans/tb/2018/02/07214255-abs.html
Parkinson's disease (PD) is a progressive neurodegenerative disorder characterized clinically by motor dysfunction (bradykinesia, rigidity, tremor, and postural instability), and pathologically by the loss of dopaminergic neurons in the substantia nigra of the basal ganglia. Growing literature supports that cognitive deficits may also be present in PD, even in non-demented patients. Gray matter (GM) atrophy has been reported in PD and may be related to cognitive decline. This study investigated cortical thickness in non-demented PD subjects and elucidated its relationship to cognitive impairment using high-resolution T1-weighted brain MRI and comprehensive cognitive function scores from 71 non-demented PD and 48 control subjects matched for age, gender, and education. Cortical thickness was compared between groups using a flexible hierarchical multivariate Bayesian model, which accounts for correlations between brain regions. Correlation analyses were performed among brain areas and cognitive domains as well, which showed significant group differences in the PD population. Compared to Controls, PD subjects demonstrated significant age-adjusted cortical thinning predominantly in inferior and superior parietal areas and extended to superior frontal, superior temporal, and precuneus areas (posterior probability >0.9). Cortical thinning was also found in the left precentral and lateral occipital, and right postcentral, middle frontal, and fusiform regions (posterior probability >0.9). PD patients showed significantly reduced cognitive performance in executive function, including set shifting (p = 0.005) and spontaneous flexibility (p = 0.02), which were associated with the above cortical thinning regions (p < 0.05).04/03/2018 3:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2465951A GRASP-Based Heuristic for the Sorting by Length-Weighted Inversions Problem
https://www.computer.org/csdl/trans/tb/2018/02/07229299-abs.html
Genome Rearrangements are large-scale mutational events that affect genomes during the evolutionary process. Therefore, these mutations differ from punctual mutations. They can move genes from one place to the other, change the orientation of some genes, or even change the number of chromosomes. In this work, we deal with inversion events which occur when a segment of DNA sequence in the genome is reversed. In our model, each inversion costs the number of elements in the reversed segment. We present a new algorithm for this problem based on the metaheuristic called Greedy Randomized Adaptive Search Procedure (GRASP) that has been routinely used to find solutions for combinatorial optimization problems. In essence, we implemented an iterative process in which each iteration receives a feasible solution whose neighborhood is investigated. Our analysis shows that we outperform any other approach by significant margin. We also use our algorithm to build phylogenetic trees for a subset of species in the <italic>Yersinia</italic> genus and we compared our trees to other results in the literature.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2474400Complex Network Measures in Autism Spectrum Disorders
https://www.computer.org/csdl/trans/tb/2018/02/07239574-abs.html
Recent studies have suggested abnormal brain network organization in subjects with Autism Spectrum Disorders (ASD). Here we applied spectral clustering algorithm, diverse centrality measures (betweenness (BC), clustering (CC), eigenvector (EC), and degree (DC)), and also the network entropy (NE) to identify brain sub-systems associated with ASD. We have found that BC increases in the following ASD clusters: in the somatomotor, default-mode, cerebellar, and fronto-parietal. On the other hand, CC, EC, and DC decrease in the somatomotor, default-mode, and cerebellar clusters. Additionally, NE decreases in ASD in the cerebellar cluster. These findings reinforce the hypothesis of under-connectivity in ASD and suggest that the difference in the network organization is more prominent in the cerebellar system. The cerebellar cluster presents reduced NE in ASD, which relates to a more regular organization of the networks. These results might be important to improve current understanding about the etiological processes and the development of potential tools supporting diagnosis and therapeutic interventions.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2476787Detecting Multivariate Gene Interactions in RNA-Seq Data Using Optimal Bayesian Classification
https://www.computer.org/csdl/trans/tb/2018/02/07286795-abs.html
Differential gene expression testing is an analysis commonly applied to RNA-Seq data. These statistical tests identify genes that are significantly different across phenotypes. We extend this testing paradigm to multivariate gene interactions from a classification perspective with the goal to detect novel gene interactions for the phenotypes of interest. This is achieved through our novel computational framework comprised of a hierarchical statistical model of the RNA-Seq processing pipeline and the corresponding optimal Bayesian classifier. Through Markov Chain Monte Carlo sampling and Monte Carlo integration, we compute quantities where no analytical formulation exists. The performance is then illustrated on an expression dataset from a dietary intervention study where we identify gene pairs that have low classification error yet were not identified as differentially expressed. Additionally, we have released the software package to perform OBC classification on RNA-Seq data under an open source license and is available at <uri> http://bit.ly/obc_package</uri>.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2485223SparseNCA: Sparse Network Component Analysis for Recovering Transcription Factor Activities with Incomplete Prior Information
https://www.computer.org/csdl/trans/tb/2018/02/07308032-abs.html
Network component analysis (NCA) is an important method for inferring transcriptional regulatory networks (TRNs) and recovering transcription factor activities (TFAs) using gene expression data, and the prior information about the connectivity matrix. The algorithms currently available crucially depend on the completeness of this prior information. However, inaccuracies in the measurement process may render incompleteness in the available knowledge about the connectivity matrix. Hence, computationally efficient algorithms are needed to overcome the possible incompleteness in the available data. We present a sparse network component analysis algorithm (sparseNCA), which incorporates the effect of incompleteness in the estimation of TRNs by imposing an additional sparsity constraint using the <inline-formula><tex-math notation="LaTeX">$\ell _1$</tex-math><alternatives> <inline-graphic xlink:href="serpedin-ieq1-2495224.gif"/></alternatives></inline-formula> norm, which results in a greater estimation accuracy. In order to improve the computational efficiency, an iterative re-weighted <inline-formula><tex-math notation="LaTeX">$\ell _2$</tex-math><alternatives> <inline-graphic xlink:href="serpedin-ieq2-2495224.gif"/></alternatives></inline-formula> method is proposed for the NCA problem which not only promotes sparsity but is hundreds of times faster than the <inline-formula> <tex-math notation="LaTeX">$\ell _1$</tex-math><alternatives><inline-graphic xlink:href="serpedin-ieq3-2495224.gif"/> </alternatives></inline-formula> norm based solution. The performance of sparseNCA is rigorously compared to that of FastNCA and NINCA using synthetic data as well as real data. It is shown that sparseNCA outperforms the existing state-of-the-art algorithms both in terms of estimation accuracy and consistency with the added advantage of low computational complexity. The performance of sparseNCA compared to its predecessors is particularly pronounced in case of incomplete prior information about the sparsity of the network. Subnetwork analysis is performed on the E.coli data which reiterates the superior consistency of the proposed algorithm.04/03/2018 3:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2015.2495224hc-OTU: A Fast and Accurate Method for Clustering Operational Taxonomic Units Based on Homopolymer Compaction
https://www.computer.org/csdl/trans/tb/2018/02/07420667-abs.html
To assess the genetic diversity of an environmental sample in metagenomics studies, the amplicon sequences of 16s rRNA genes need to be clustered into operational taxonomic units (OTUs). Many existing tools for OTU clustering trade off between accuracy and computational efficiency. We propose a novel OTU clustering algorithm, hc-OTU, which achieves high accuracy and fast runtime by exploiting homopolymer compaction and k-mer profiling to significantly reduce the computing time for pairwise distances of amplicon sequences. We compare the proposed method with other widely used methods, including UCLUST, CD-HIT, MOTHUR, ESPRIT, ESPRIT-TREE, and CLUSTOM, comprehensively, using nine different experimental datasets and many evaluation metrics, such as normalized mutual information, adjusted Rand index, measure of concordance, and F-score. Our evaluation reveals that the proposed method achieves a level of accuracy comparable to the respective accuracy levels of MOTHUR and ESPRIT-TREE, two widely used OTU clustering methods, while delivering orders-of-magnitude speedups.04/03/2018 3:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2535326Predicting the Absorption Potential of Chemical Compounds Through a Deep Learning Approach
https://www.computer.org/csdl/trans/tb/2018/02/07420679-abs.html
The human colorectal carcinoma cell line (Caco-2) is a commonly used in-vitro test that predicts the absorption potential of orally administered drugs. In-silico prediction methods, based on the Caco-2 assay data, may increase the effectiveness of the high-throughput screening of new drug candidates. However, previously developed in-silico models that predict the Caco-2 cellular permeability of chemical compounds use handcrafted features that may be dataset-specific and induce over-fitting problems. Deep Neural Network (DNN) generates high-level features based on non-linear transformations for raw features, which provides high discriminant power and, therefore, creates a good generalized model. We present a DNN-based binary Caco-2 permeability classifier. Our model was constructed based on 663 chemical compounds with in-vitro Caco-2 apparent permeability data. Two hundred nine molecular descriptors are used for generating the high-level features during DNN model generation. Dropout regularization is applied to solve the over-fitting problem and the non-linear activation. The Rectified Linear Unit (ReLU) is adopted to reduce the vanishing gradient problem. The results demonstrate that the high-level features generated by the DNN are more robust than handcrafted features for predicting the cellular permeability of structurally diverse chemical compounds in Caco-2 cell lines.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.25352333D Genome Reconstruction with ShRec3D+ and Hi-C Data
https://www.computer.org/csdl/trans/tb/2018/02/07422026-abs.html
Hi-C technology, a chromosome conformation capture (3C) based method, has been developed to capture genome-wide interactions at a given resolution. The next challenge is to reconstruct 3D structure of genome from the 3C-derived data computationally. Several existing methods have been proposed to obtain a consensus structure or ensemble structures. These methods can be categorized as probabilistic models or restraint-based models. In this paper, we propose a method, named ShRec3D+, to infer a consensus 3D structure of a genome from Hi-C data. The method is a two-step algorithm which is based on ChromSDE and ShRec3D methods. First, correct the conversion factor by golden section search for converting interaction frequency data to a distance weighted graph. Second, apply shortest-path algorithm and multi-dimensional scaling (MDS) algorithm to compute the 3D coordinates of a set of genomic loci from the distance graph. We validate ShRec3D+ accuracy on both simulation data and publicly Hi-C data. Our test results indicate that our method successfully corrects the parameter with a given resolution, is more accurate than ShRec3D, and is more efficient and robust than ChromSDE.04/02/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2535372Autumn Algorithmâ€”Computation of Hybridization Networks for Realistic Phylogenetic Trees
https://www.computer.org/csdl/trans/tb/2018/02/07423752-abs.html
A minimum hybridization network is a rooted phylogenetic network that displays two given rooted phylogenetic trees using a minimum number of reticulations. Previous mathematical work on their calculation has usually assumed the input trees to be bifurcating, correctly rooted, or that they both contain the same taxa. These assumptions do not hold in biological studies and “realistic” trees have multifurcations, are difficult to root, and rarely contain the same taxa. We present a new algorithm for computing minimum hybridization networks for a given pair of “realistic” rooted phylogenetic trees. We also describe how the algorithm might be used to improve the rooting of the input trees. We introduce the concept of “autumn trees”, a nice framework for the formulation of algorithms based on the mathematics of “maximum acyclic agreement forests”. While the main computational problem is hard, the run-time depends mainly on how different the given input trees are. In biological studies, where the trees are reasonably similar, our parallel implementation performs well in practice. The algorithm is available in our open source program Dendroscope 3, providing a platform for biologists to explore rooted phylogenetic networks. We demonstrate the utility of the algorithm using several previously studied data sets.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2537326DTL-RnB: Algorithms and Tools for Summarizing the Space of DTL Reconciliations
https://www.computer.org/csdl/trans/tb/2018/02/07423775-abs.html
Phylogenetic tree reconciliation is an important technique for reconstructing the evolutionary histories of species and genes and other dependent entities. Reconciliation is typically performed in a maximum parsimony framework and the number of optimal reconciliations can grow exponentially with the size of the trees, making it difficult to understand the solution space. This paper demonstrates how a small number of reconciliations can be found that collectively contain the most highly supported events in the solution space. While we show that the formal problem is NP-complete, we give a <inline-formula><tex-math notation="LaTeX">$1-\frac{1}{e}$</tex-math><alternatives> <inline-graphic xlink:href="libeskindhadas-ieq1-2537319.gif"/></alternatives></inline-formula> approximation algorithm, experimental results that indicate its effectiveness, and the new DTL-RnB software tool that uses our algorithms to summarize the space of optimal reconciliations (<monospace><uri>www.cs.hmc.edu/dtlrnb</uri></monospace>).04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2537319Codon Context Optimization in Synthetic Gene Design
https://www.computer.org/csdl/trans/tb/2018/02/07439780-abs.html
Advances in de novo synthesis of DNA and computational gene design methods make possible the customization of genes by direct manipulation of features such as codon bias and mRNA secondary structure. Codon context is another feature significantly affecting mRNA translational efficiency, but existing methods and tools for evaluating and designing novel optimized protein coding sequences utilize untested heuristics and do not provide quantifiable guarantees on design quality. In this study we examine statistical properties of codon context measures in an effort to better understand the phenomenon. We analyze the computational complexity of codon context optimization and design exact and efficient heuristic gene recoding algorithms under reasonable constraint models. We also present a web-based tool for evaluating codon context bias in the appropriate context.04/03/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2542808Algorithms for Pedigree Comparison
https://www.computer.org/csdl/trans/tb/2018/02/07447733-abs.html
Reconstruction of ancestral relationships among genera, species, and populations is a core task in evolutionary biology. At the population level, pedigrees have been commonly used. Reconstruction of pedigree is required in practice due to legal or medical reasons. Pedigrees are very important to geneticists for inferring haplotype segments, recombination, and allele sharing status with which disease loci can be identified. Evaluating reconstruction methods requires comparing the inferred pedigree and the known pedigrees. Moreover, comparison of pedigrees is required in studying relationships among crops such as maize, wheat and barley, etc. In this paper, we discuss three models for comparison of pedigrees, the maximum pedigree isomorphism problem, the maximum paternal-path-preserved mapping problem, and the minimum edge-cutting mapping problem. For the maximum pedigree isomorphism problem, we prove that the problem is NP-hard and give a fixed-parameter algorithm for the problem. For the maximum paternal-path-preserved mapping problem, we give a dynamic-programming algorithm to find the mapping that preserves the maximum number of paternal paths between the two input pedigrees. For the minimum edge-cutting mapping problem, we prove that the problem is NP-hard and give a fixed-parameter algorithm with running time <inline-formula> <tex-math notation="LaTeX">$O(n(1+\sqrt{2})^k)$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq1-2550434.gif"/></alternatives></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$n$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq2-2550434.gif"/> </alternatives></inline-formula> is the number of vertices in the two input pedigrees and <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq3-2550434.gif"/> </alternatives></inline-formula> is the number of edges to be cut. This algorithm is useful in practice when comparing two similar pedigrees.04/02/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2550434Algorithmic Mapping and Characterization of the Drug-Induced Phenotypic-Response Space of Parasites Causing Schistosomiasis
https://www.computer.org/csdl/trans/tb/2018/02/07448842-abs.html
Neglected tropical diseases, especially those caused by helminths, constitute some of the most common infections of the world's poorest people. Amongst these, schistosomiasis (bilharzia or ‘snail fever’), caused by blood flukes of the genus Schistosoma, ranks second only to malaria in terms of human impact: two hundred million people are infected and close to 800 million are at risk of infection. Drug screening against helminths poses unique challenges: the parasite cannot be cloned and is difficult to target using gene knockouts or RNAi. Consequently, both lead identification and validation involve phenotypic screening, where parasites are exposed to compounds whose effects are determined through the analysis of the ensuing phenotypic responses. The efficacy of leads thus identified derives from one or more or even unknown molecular mechanisms of action. The two most immediate and significant challenges that confront the state-of-the-art in this area are: the development of automated and quantitative phenotypic screening techniques and the mapping and quantitative characterization of the totality of phenotypic responses of the parasite. In this paper, we investigate and propose solutions for the latter problem in terms of the following: (1) mathematical formulation and algorithms that allow rigorous representation of the phenotypic response space of the parasite, (2) application of graph-theoretic and network analysis techniques for quantitative modeling and characterization of the phenotypic space, and (3) application of the aforementioned methodology to analyze the phenotypic space of <italic>S. mansoni</italic> – one of the etiological agents of schistosomiasis, induced by compounds that target its polo-like kinase 1 (PLK 1) gene – a recently validated drug target. In our approach, first, bio-image analysis algorithms are used to quantify the phenotypic responses of different drugs. Next, these responses are linearly mapped into a low- dimensional space using Principle Component Analysis (PCA). The phenotype space is modeled using neighborhood graphs which are used to represent the similarity amongst the phenotypes. These graphs are characterized and explored using network analysis algorithms. We present a number of results related to both the nature of the phenotypic space of the <italic>S. mansoni</italic> parasite as well as algorithmic issues encountered in constructing and analyzing the phenotypic-response space. In particular, the phenotype distribution of the parasite was found to have a distinct shape and topology. We have also quantitatively characterized the phenotypic space by varying critical model parameters. Finally, these maps of the phenotype space allows visualization and reasoning about complex relationships between putative drugs and their system-wide effects and can serve as a highly efficient paradigm for assimilating and unifying information from phenotypic screens both during lead identification and lead optimization.04/05/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2550444Fuzzy-Rough Entropy Measure and Histogram Based Patient Selection for miRNA Ranking in Cancer
https://www.computer.org/csdl/trans/tb/2018/02/07727962-abs.html
MicroRNAs (miRNAs) are known as an important indicator of cancers. The presence of cancer can be detected by identifying the responsible miRNAs. A fuzzy-rough entropy measure (FREM) is developed which can rank the miRNAs and thereby identify the relevant ones. FREM is used to determine the relevance of a miRNA in terms of separability between normal and cancer classes. While computing the FREM for a miRNA, fuzziness takes care of the overlapping between normal and cancer expressions, whereas rough lower approximation determines their class sizes. MiRNAs are sorted according to the highest relevance (i.e., the capability of class separation) and a percentage among them is selected from the top ranked ones. FREM is also used to determine the redundancy between two miRNAs and the redundant ones are removed from the selected set, as per the necessity. A histogram based patient selection method is also developed which can help to reduce the number of patients to be dealt during the computation of FREM, while compromising very little with the performance of the selected miRNAs for most of the data sets. The superiority of the FREM as compared to some existing methods is demonstrated extensively on six data sets in terms of sensitivity, specificity, and <inline-formula><tex-math notation="LaTeX">$F$</tex-math><alternatives> <inline-graphic xlink:href="pal-ieq1-2623605.gif"/></alternatives></inline-formula> score. While for these data sets the <inline-formula><tex-math notation="LaTeX">$F$</tex-math><alternatives> <inline-graphic xlink:href="pal-ieq2-2623605.gif"/></alternatives></inline-formula> score of the miRNAs selected by our method varies from 0.70 to 0.91 using SVM, those results vary from 0.37 to 0.90 for some other methods. Moreover, all the selected miRNAs corroborate with the findings of biological investigations or pathway analysis tools. The source code of FREM is available at <uri>http://www.jayanta.droppages.com/FREM.html</uri>.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2623605Extracting Stage-Specific and Dynamic Modules Through Analyzing Multiple Networks Associated with Cancer Progression
https://www.computer.org/csdl/trans/tb/2018/02/07737047-abs.html
Determining the dynamics of pathways associated with cancer progression is critical for understanding the etiology of diseases. Advances in biological technology have facilitated the simultaneous genomic profiling of multiple patients at different clinical stages, thus generating the dynamic genomic data for cancers. Such data provide enable investigation of the dynamics of related pathways. However, methods for integrative analysis of dynamic genomic data are inadequate. In this study, we develop a novel nonnegative matrix factorization algorithm for dynamic modules ( <italic>NMF-DM</italic>), which simultaneously analyzes multiple networks for the identification of stage-specific and dynamic modules. NMF-DM applies the temporal smoothness framework by balancing the networks at the current stage and the previous stage. Experimental results indicate that the NMF-DM algorithm is more accurate than the state-of-the-art methods in artificial dynamic networks. In breast cancer networks, NMF-DM reveals the dynamic modules that are important for cancer stage transitions. Furthermore, the stage-specific and dynamic modules have distinct topological and biochemical properties. Finally, we demonstrate that the stage-specific modules significantly improve the accuracy of cancer stage prediction. The proposed algorithm provides an effective way to explore the time-dependent cancer genomic data.04/03/2018 3:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2625791Enumerating Substituted Benzene Isomers of Tree-Like Chemical Graphs
https://www.computer.org/csdl/trans/tb/2018/02/07744506-abs.html
Enumeration of chemical structures is useful for drug design, which is one of the main targets of computational biology and bioinformatics. A chemical graph <inline-formula><tex-math notation="LaTeX">$G$</tex-math><alternatives> <inline-graphic xlink:href="akutsu-ieq1-2628888.gif"/></alternatives></inline-formula> with no other cycles than benzene rings is called <italic>tree-like</italic>, and becomes a tree <inline-formula><tex-math notation="LaTeX">$T$ </tex-math><alternatives><inline-graphic xlink:href="akutsu-ieq2-2628888.gif"/></alternatives></inline-formula> possibly with multiple edges if we contract each benzene ring into a single virtual atom of valence 6. All tree-like chemical graphs with a given tree representation <inline-formula><tex-math notation="LaTeX">$T$</tex-math> <alternatives><inline-graphic xlink:href="akutsu-ieq3-2628888.gif"/></alternatives></inline-formula> are called the <italic>substituted benzene isomers</italic> of <inline-formula><tex-math notation="LaTeX">$T$</tex-math><alternatives> <inline-graphic xlink:href="akutsu-ieq4-2628888.gif"/></alternatives></inline-formula>. When we replace each virtual atom in <inline-formula><tex-math notation="LaTeX">$T$</tex-math><alternatives> <inline-graphic xlink:href="akutsu-ieq5-2628888.gif"/></alternatives></inline-formula> with a benzene ring to obtain a substituted benzene isomer, distinct isomers of <inline-formula><tex-math notation="LaTeX">$T$</tex-math><alternatives> <inline-graphic xlink:href="akutsu-ieq6-2628888.gif"/></alternatives></inline-formula> are caused by the difference in arrangements of atom groups around a benzene ring. In this paper, we propose an efficient algorithm that enumerates all substituted benzene isomers of a given tree representation <inline-formula><tex-math notation="LaTeX">$T$ </tex-math><alternatives><inline-graphic xlink:href="akutsu-ieq7-2628888.gif"/></alternatives></inline-formula>. Our algorithm first counts the number <inline-formula><tex-math notation="LaTeX">$f$</tex-math><alternatives> <inline-graphic xlink:href="akutsu-ieq8-2628888.gif"/></alternatives></inline-formula> of all the isomers of the tree representation by a dynamic programming method. To enumerate all the isomers, for each <inline-formula> <tex-math notation="LaTeX">$k=1,2,\ldots, f$</tex-math><alternatives> <inline-graphic xlink:href="akutsu-ieq9-2628888.gif"/></alternatives></inline-formula>, our algorithm then generates the <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="akutsu-ieq10-2628888.gif"/></alternatives></inline-formula>th isomer by backtracking the counting phase of the dynamic programming. We also implemented our algorithm for computational experiments.04/03/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2628888A Review on Methods for Detecting SNP Interactions in High-Dimensional Genomic Data
https://www.computer.org/csdl/trans/tb/2018/02/07765022-abs.html
In this era of genome-wide association studies (GWAS), the quest for understanding the genetic architecture of complex diseases is rapidly increasing more than ever before. The development of high throughput genotyping and next generation sequencing technologies enables genetic epidemiological analysis of large scale data. These advances have led to the identification of a number of single nucleotide polymorphisms (SNPs) responsible for disease susceptibility. The interactions between SNPs associated with complex diseases are increasingly being explored in the current literature. These interaction studies are mathematically challenging and computationally complex. These challenges have been addressed by a number of data mining and machine learning approaches. This paper reviews the current methods and the related software packages to detect the SNP interactions that contribute to diseases. The issues that need to be considered when developing these models are addressed in this review. The paper also reviews the achievements in data simulation to evaluate the performance of these models. Further, it discusses the future of SNP interaction analysis.04/03/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2635125Classification of Alzheimer's Disease Using Whole Brain Hierarchical Network
https://www.computer.org/csdl/trans/tb/2018/02/07765123-abs.html
Regions of interest (ROIs) based classification has been widely investigated for analysis of brain magnetic resonance imaging (MRI) images to assist the diagnosis of Alzheimer's disease (AD) including its early warning and developing stages, e.g., mild cognitive impairment (MCI) including MCI converted to AD (MCIc) and MCI not converted to AD (MCInc). Since an ROI representation of brain structures is obtained either by pre-definition or by adaptive parcellation, the corresponding ROI in different brains can be measured. However, due to noise and small sample size of MRI images, representations generated from single or multiple ROIs may not be sufficient to reveal the underlying anatomical differences between the groups of disease-affected patients and health controls (HC). In this paper, we employ a whole brain hierarchical network (WBHN) to represent each subject. The whole brain of each subject is divided into 90, 54, 14, and 1 regions based on Automated Anatomical Labeling (AAL) atlas. The connectivity between each pair of regions is computed in terms of Pearson's correlation coefficient and used as classification feature. Then, to reduce the dimensionality of features, we select the features with higher <inline-formula> <tex-math notation="LaTeX">$F-$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq1-2635144.gif"/> </alternatives></inline-formula> scores. Finally, we use multiple kernel boosting (MKBoost) algorithm to perform the classification. Our proposed method is evaluated on MRI images of 710 subjects (200 AD, 120 MCIc, 160 MCInc, and 230 HC) from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. The experimental results show that our proposed method achieves an accuracy of 94.65 percent and an area under the receiver operating characteristic (ROC) curve (AUC) of 0.954 for AD/HC classification, an accuracy of 89.63 percent and an AUC of 0.907 for AD/MCI classification, an accuracy of 85.79 percent and an AUC of 0.826 for MCI/HC classification, and an accuracy of 72.08 percent and an AUC of 0.716 for MCIc/MCInc classification, respectively. Our results demonstrate that our proposed method is efficient and promising for clinical applications for the diagnosis of AD via MRI images.04/03/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2635144Application of Fractal Theory on Motifs Counting in Biological Networks
https://www.computer.org/csdl/trans/tb/2018/02/07775030-abs.html
Motifs in complex biological, technological, and social networks, or in other types of networks are connected to patterns that occur at significantly higher frequency compared to similar random networks. Finding motifs helps scientists to know more about networks’ structure and function, and this goal cannot be achieved without efficient algorithms. Existing methods for counting network motifs are extremely costly in CPU time and memory consumption. In addition, they restrict to the larger motifs. In this paper, a new algorithm called FraMo is presented based on ‘fractal theory’. This method consists of three phases: at first, a complex network is converted to a multifractal network. Then, using maximum likelihood estimation, distribution parameters is estimated for the multifractal network, and at last the approximate number of network motifs is calculated. Experimental results on several benchmark datasets show that our algorithm can efficiently approximate the number of motifs in any size in undirected networks and compare its performance favorably with similar existing algorithms in terms of CPU time and memory usage.04/03/2018 3:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2636215Integrating Multiple Data Sources for Combinatorial Marker Discovery: A Study in Tumorigenesis
https://www.computer.org/csdl/trans/tb/2018/02/07775073-abs.html
Identification of combinatorial markers from multiple data sources is a challenging task in bioinformatics. Here, we propose a novel computational framework for identifying significant combinatorial markers (<inline-formula> <tex-math notation="LaTeX">$SCM$</tex-math><alternatives><inline-graphic xlink:href="bandyopadhyay-ieq1-2636207.gif"/> </alternatives></inline-formula>s) using both gene expression and methylation data. The gene expression and methylation data are integrated into a single continuous data as well as a (post-discretized) boolean data based on their intrinsic (i.e., inverse) relationship. A novel combined score of methylation and expression data (viz., <inline-formula><tex-math notation="LaTeX">$CoMEx$</tex-math><alternatives> <inline-graphic xlink:href="bandyopadhyay-ieq2-2636207.gif"/></alternatives></inline-formula>) is introduced which is computed on the integrated continuous data for identifying initial non-redundant set of genes. Thereafter, (maximal) frequent closed homogeneous genesets are identified using a well-known biclustering algorithm applied on the integrated boolean data of the determined non-redundant set of genes. A novel sample-based weighted support ( <inline-formula><tex-math notation="LaTeX">$WS$</tex-math><alternatives> <inline-graphic xlink:href="bandyopadhyay-ieq3-2636207.gif"/></alternatives></inline-formula>) is then proposed that is consecutively calculated on the integrated boolean data of the determined non-redundant set of genes in order to identify the non-redundant significant genesets. The top few resulting genesets are identified as potential <inline-formula><tex-math notation="LaTeX">$SCM$</tex-math><alternatives> <inline-graphic xlink:href="bandyopadhyay-ieq4-2636207.gif"/></alternatives></inline-formula>s. Since our proposed method generates a smaller number of significant non-redundant genesets than those by other popular methods, the method is much faster than the others. Application of the proposed technique on an expression and a methylation data for Uterine tumor or Prostate Carcinoma produces a set of significant combination of markers. We expect that such a combination of markers will produce lower false positives than individual markers.04/02/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2636207A New Efficient Algorithm for the Frequent Gene Team Problem
https://www.computer.org/csdl/trans/tb/2018/02/07778237-abs.html
The focus of this paper is the frequent gene team problem. Given a quorum parameter μ and a set of <italic>m </italic> genomes, the problem is to find gene teams that occur in at least μ of the given genomes. In this paper, a new algorithm is presented. Previous solutions are efficient only when μ is small. Unlike previous solutions, the presented algorithm does not rely on examining every combination of μ genomes. Its time complexity is independent of μ. Under some realistic assumptions, the practical running time is estimated to be <inline-formula> <tex-math notation="LaTeX">$O(m^{2}n^{2}\; {\mathrm{lg}}\;n)$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq1-2637346.gif"/></alternatives></inline-formula>, where <italic>n</italic> is the maximum length of the input genomes. Experiments showed that the presented algorithm is extremely efficient. For any μ, it takes less than 1 second to process 100 bacterial genomes and takes only 10 minutes to process 2,000 genomes. The presented algorithm can be used as an effective tool for large scale genome analyses.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2637346Optimal Block-Based Trimming for Next Generation Sequencing
https://www.computer.org/csdl/trans/tb/2018/02/07906525-abs.html
Read trimming is a fundamental first step of the analysis of next generation sequencing (NGS) data. Traditionally, it is performed heuristically, and algorithmic work in this area has been neglected. Here, we address this topic and formulate three optimization problems for block-based trimming (truncating the same low-quality positions at both ends for all reads and removing low-quality truncated reads). We find that all problems are NP-hard. Hence, we investigate the approximability of the problems. Two of them are NP-hard to approximate. However, the non-random distribution of quality scores in NGS data sets makes it tempting to speculate that quality constraints for read positions are typically satisfied by fulfilling quality constraints for reads. Thus, we propose three relaxed problems and develop efficient polynomial-time algorithms for them including heuristic speed-up techniques and parallelizations. We apply these <italic>optimized block trimming</italic> algorithms to 12 data sets from three species, four sequencers, and read lengths ranging from 36 to 101 bp and find that (i) the omitted constraints are indeed almost always satisfied, (ii) the optimized read trimming algorithms typically yield a higher number of untrimmed bases than traditional heuristics, and (iii) these results can be generalized to alternative objective functions beyond counting the number of untrimmed bases.04/03/2018 3:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2696525151