IEEE/ACM Transactions on Computational Biology and Bioinformatics
https://www.computer.org/csdl/trans/tb/index.html
The IEEE/ACM Transactions on Computational Biology and Bioinformatics is a new quarterly that will publish archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development and optimization of biological databases; and important biological results that are obtained from the use of these methods, programs, and databases.en-ushttps://s3.amazonaws.com/ieeecs.cdn.common/images/computer.org/CSLogo_dark.pngIEEE Computer Society Digital LibraryList of 100 recently published journal articles.
https://www.computer.org/csdl
Clustering-Based Compression for Population DNA Sequences
https://www.computer.org/csdl/trans/tb/2019/01/08066379-abs.html
Due to the advancement of DNA sequencing techniques, the number of sequenced individual genomes has experienced an exponential growth. Thus, effective compression of this kind of sequences is highly desired. In this work, we present a novel compression algorithm called Reference-based Compression algorithm using the concept of Clustering (RCC). The rationale behind RCC is based on the observation about the existence of substructures within the population sequences. To utilize these substructures, <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="cheng-ieq1-2762302.gif"/></alternatives></inline-formula>-means clustering is employed to partition sequences into clusters for better compression. A reference sequence is then constructed for each cluster so that sequences in that cluster can be compressed by referring to this reference sequence. The reference sequence of each cluster is also compressed with reference to a sequence which is derived from all the reference sequences. Experiments show that RCC can further reduce the compressed size by up to 91.0 percent when compared with state-of-the-art compression approaches. There is a compromise between compressed size and processing time. The current implementation in Matlab has time complexity in a factor of thousands higher than the existing algorithms implemented in C/C++. Further investigation is required to improve processing time in future.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2762302Classification of Single-Cell Gene Expression Trajectories from Incomplete and Noisy Data
https://www.computer.org/csdl/trans/tb/2019/01/08070339-abs.html
This paper studies classification of gene-expression trajectories coming from two classes, healthy and mutated (cancerous) using Boolean networks with perturbation (BNps) to model the dynamics of each class at the state level. Each class has its own BNp, which is partially known based on gene pathways. We employ a Gaussian model at the observation level to show the expression values of the genes given the hidden binary states at each time point. We use expectation maximization (EM) to learn the BNps and the unknown model parameters, derive closed-form updates for the parameters, and propose a learning algorithm. After learning, a plug-in Bayes classifier is used to classify unlabeled trajectories, which can have missing data. Measuring gene expressions at different times yields trajectories only when measurements come from a single cell. In multiple-cell scenarios, the expression values are averages over many cells with possibly different states. Via the central-limit theorem, we propose another model for expression data in multiple-cell scenarios. Simulations demonstrate that single-cell trajectory data can outperform multiple-cell average expression data relative to classification error, especially in high-noise situations. We also consider data generated via a mammalian cell-cycle network, both the wild-type and with a common mutation affecting p27.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2763946Protein Remote Homology Detection and Fold Recognition Based on Sequence-Order Frequency Matrix
https://www.computer.org/csdl/trans/tb/2019/01/08078207-abs.html
Protein remote homology detection and fold recognition are two critical tasks for the studies of protein structures and functions. Currently, the profile-based methods achieve the state-of-the-art performance in these fields. However, the widely used sequence profiles, like position-specific frequency matrix (PSFM) and position-specific scoring matrix (PSSM), ignore the sequence-order effects along protein sequence. In this study, we have proposed a novel profile, called sequence-order frequency matrix (SOFM), to extract the sequence-order information of neighboring residues from multiple sequence alignment (MSA). Combined with two profile feature extraction approaches, top-n-grams and the Smith–Waterman algorithm, the SOFMs are applied to protein remote homology detection and fold recognition, and two predictors called SOFM-Top and SOFM-SW are proposed. Experimental results show that SOFM contains more information content than other profiles, and these two predictors outperform other state-of-the-art methods. It is anticipated that SOFM will become a very useful profile in the studies of protein structures and functions.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2765331An Integrated Approach for Identification of Functionally Similar MicroRNAs in Colorectal Cancer
https://www.computer.org/csdl/trans/tb/2019/01/08078229-abs.html
Colorectal cancer (CRC) is one of the most prevalent cancers around the globe. However, the molecular reasons for pathogenesis of CRC are still poorly understood. Recently, the role of microRNAs or miRNAs in the initiation and progression of CRC has been studied. MicroRNAs are small, endogenous noncoding RNAs found in plants, animals, and some viruses, which function in RNA silencing and posttranscriptional regulation of gene expression. Their role in CRC development is studied and they are found to be potential biomarkers in diagnosis and treatment of CRC. Therefore, identification of functionally similar CRC related miRNAs may help in the development of a prognostic tool. In this regard, this paper presents a new algorithm, called <inline-formula><tex-math notation="LaTeX">$\mu$</tex-math><alternatives><inline-graphic xlink:href="paul-ieq1-2765332.gif"/></alternatives></inline-formula>Sim. It is an integrative approach for identification of functionally similar miRNAs associated with CRC. It integrates judiciously the information of miRNA expression data and miRNA-miRNA functionally synergistic network data. The functional similarity is calculated based on both miRNA expression data and miRNA-miRNA functionally synergistic network data. The effectiveness of the proposed method in comparison to other related methods is shown on four CRC miRNA data sets. The proposed method selected more significant miRNAs related to CRC as compared to other related methods.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2765332Switched Latent Force Models for Reverse-Engineering Transcriptional Regulation in Gene Expression Data
https://www.computer.org/csdl/trans/tb/2019/01/08078278-abs.html
To survive environmental conditions, cells transcribe their response activities into encoded mRNA sequences in order to produce certain amounts of protein concentrations. The external conditions are mapped into the cell through the activation of special proteins called transcription factors (TFs). Due to the difficult task to measure experimentally TF behaviors, and the challenges to capture their quick-time dynamics, different types of models based on differential equations have been proposed. However, those approaches usually incur in costly procedures, and they present problems to describe sudden changes in TF regulators. In this paper, we present a switched dynamical latent force model for reverse-engineering transcriptional regulation in gene expression data which allows the exact inference over latent TF activities driving some observed gene expressions through a linear differential equation. To deal with discontinuities in the dynamics, we introduce an approach that switches between different TF activities and different dynamical systems. This creates a versatile representation of transcription networks that can capture discrete changes and non-linearities. We evaluate our model on both simulated data and real data (e.g., microaerobic shift in <italic>E. coli</italic>, yeast respiration), concluding that our framework allows for the fitting of the expression data while being able to infer continuous-time TF profiles.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2764908Structured Penalized Logistic Regression for Gene Selection in Gene Expression Data Analysis
https://www.computer.org/csdl/trans/tb/2019/01/08089386-abs.html
In gene expression data analysis, the problems of cancer classification and gene selection are closely related. Successfully selecting informative genes will significantly improve the classification performance. To identify informative genes from a large number of candidate genes, various methods have been proposed. However, the gene expression data may include some important correlation structures, and some of the genes can be divided into different groups based on their biological pathways. Many existing methods do not take into consideration the exact correlation structure within the data. Therefore, from both the knowledge discovery and biological perspectives, an ideal gene selection method should take this structural information into account. Moreover, the better generalization performance can be obtained by discovering correlation structure within data. In order to discover structure information among data and improve learning performance, we propose a structured penalized logistic regression model which simultaneously performs feature selection and model learning for gene expression data analysis. An efficient coordinate descent algorithm has been developed to optimize the model. The numerical simulation studies demonstrate that our method is able to select the highly correlated features. In addition, the results from real gene expression datasets show that the proposed method performs competitively with respect to previous approaches.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2767589A Mixed-Norm Laplacian Regularized Low-Rank Representation Method for Tumor Samples Clustering
https://www.computer.org/csdl/trans/tb/2019/01/08094872-abs.html
Tumor samples clustering based on biomolecular data is a hot issue of cancer classifications discovery. How to extract the valuable information from high dimensional genomic data is becoming an urgent problem in tumor samples clustering. In this paper, we introduce manifold regularization into low-rank representation model and present a novel method named Mixed-norm Laplacian regularized Low-Rank Representation (MLLRR) to identify the differentially expressed genes for tumor clustering based on gene expression data. Then, in order to advance the accuracy and stability of tumor clustering, we establish the clustering model based on Penalized Matrix Decomposition (PMD) and propose a novel cluster method named MLLRR-PMD. In this method, the cancer clustering research includes three steps. First, the matrix of gene expression data is decomposed into a low rank representation matrix and a sparse matrix by MLLRR. Second, the differentially expressed genes are identified based on the sparse matrix. Finally, the PMD is applied to cluster the samples based on the differentially expressed genes. The experiment results on simulation data and real genomic data illustrate that MLLRR method enhances the robustness to outliers and achieves remarkable performance in the extraction of differentially expressed genes.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2769647ToBio: Global Pathway Similarity Search Based on Topological and Biological Features
https://www.computer.org/csdl/trans/tb/2019/01/08094921-abs.html
Pathway similarity search plays a vital role in the post-genomics era. Unfortunately, pathway similarity search involves the graph isomorphism problem which is NP-complete. Therefore, efficient search algorithms are desirable. In this work, we propose a novel global pathway similarity search approach named ToBio, which considers both topological and biological features for effective global pathway similarity search. Specifically, as motivated from nature, various topological and biological features including subgraph signature similarities, sequence similarities, and gene ontology similarities are considered in ToBio. Since different features carry different functional importance and dependences, we report three schemes of ToBio using different sets of features. In addition, to enhance the existing search algorithms for rigorous comparisons, post-processing pipelines are also proposed to investigate how different features can contribute to the search performance. ToBio and other state-of-the-art methods are benchmarked on the gold-standard pathway datasets from three species. The results demonstrate the competitive edges of ToBio over the state-of-the-arts ranging from the topological aspects to the biological aspects. Case studies have been conducted to reveal mechanistic insights into the unique search performance of ToBio.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2769642Disease Gene Prediction by Integrating PPI Networks, Clinical RNA-Seq Data and OMIM Data
https://www.computer.org/csdl/trans/tb/2019/01/08097006-abs.html
Disease gene prediction is a challenging task that has a variety of applications such as early diagnosis and drug development. The existing machine learning methods suffer from the imbalanced sample issue because the number of known disease genes (positive samples) is much less than that of unknown genes which are typically considered to be negative samples. In addition, most methods have not utilized clinical data from patients with a specific disease to predict disease genes. In this study, we propose a disease gene prediction algorithm (called dgSeq) by combining protein-protein interaction (PPI) network, clinical RNA-Seq data, and Online Mendelian Inheritance in Man (OMIN) data. Our dgSeq constructs differential networks based on rewiring information calculated from clinical RNA-Seq data. To select balanced sets of non-disease genes (negative samples), a disease-gene network is also constructed from OMIM data. After features are extracted from the PPI networks and differential networks, the logistic regression classifiers are trained. Our dgSeq obtains AUC values of 0.88, 0.83, and 0.80 for identifying breast cancer genes, thyroid cancer genes, and Alzheimer's disease genes, respectively, which indicates its superiority to other three competing methods. Both gene set enrichment analysis and predicted results demonstrate that dgSeq can effectively predict new disease genes.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2770120Quasi-Newton Stochastic Optimization Algorithm for Parameter Estimation of a Stochastic Model of the Budding Yeast Cell Cycle
https://www.computer.org/csdl/trans/tb/2019/01/08107569-abs.html
Parameter estimation in discrete or continuous deterministic cell cycle models is challenging for several reasons, including the nature of what can be observed, and the accuracy and quantity of those observations. The challenge is even greater for stochastic models, where the number of simulations and amount of empirical data must be even larger to obtain statistically valid parameter estimates. The two main contributions of this work are (1) stochastic model parameter estimation based on directly matching multivariate probability distributions, and (2) a new quasi-Newton algorithm class QNSTOP for stochastic optimization problems. QNSTOP directly uses the random objective function value samples rather than creating ensemble statistics. QNSTOP is used here to directly match empirical and simulated joint probability distributions rather than matching summary statistics. Results are given for a current state-of-the-art stochastic cell cycle model of budding yeast, whose predictions match well some summary statistics and one-dimensional distributions from empirical data, but do not match well the empirical joint distributions. The nature of the mismatch provides insight into the weakness in the stochastic model.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2773083DNRLMF-MDA:Predicting microRNA-Disease Associations Based on Similarities of microRNAs and Diseases
https://www.computer.org/csdl/trans/tb/2019/01/08118134-abs.html
MicroRNAs (miRNAs) are a class of non-coding RNAs about <inline-formula><tex-math notation="LaTeX">$\sim$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq1-2776101.gif"/></alternatives></inline-formula>22nt nucleotides. Studies have proven that miRNAs play key roles in many human complex diseases. Therefore, discovering miRNA-disease associations is beneficial to understanding disease mechanisms, developing drugs, and treating complex diseases. It is well known that it is a time-consuming and expensive process to discover the miRNA-disease associations via biological experiments. Alternatively, computational models could provide a low-cost and high-efficiency way for predicting miRNA-disease associations. In this study, we propose a method (called DNRLMF-MDA) to predict miRNA-disease associations based on dynamic neighborhood regularized logistic matrix factorization. DNRLMF-MDA integrates known miRNA-disease associations, functional similarity and Gaussian Interaction Profile (GIP) kernel similarity of miRNAs, and functional similarity and GIP kernel similarity of diseases. Especially, positive observations (known miRNA-disease associations) are assigned higher importance levels than negative observations (unknown miRNA-disease associations).DNRLMF-MDA computes the probability that a miRNA would interact with a disease by a logistic matrix factorization method, where latent vectors of miRNAs and diseases represent the properties of miRNAs and diseases, respectively, and further improve prediction performance via dynamic neighborhood regularized. The 5-fold cross validation is adopted to assess the performance of our DNRLMF-MDA, as well as other competing methods for comparison. The computational experiments show that DNRLMF-MDA outperforms the state-of-art method PBMDA. The AUC values of DNRLMF-MDA on three datasets are 0.9357, 0.9411, and 0.9416, respectively, which are superior to the PBMDA's results of 0.9218, 0.9187, and 0.9262. The average computation times per 5-fold cross validation of DNRLMF-MDA on three datasets are 38, 46, and 50 seconds, which are shorter than the PBMDA's average computation times of 10869, 916, and 8448 seconds, respectively. DNRLMF-MDA also can predict potential diseases for new miRNAs. Furthermore, case studies illustrate that DNRLMF-MDA is an effective method to predict miRNA-disease associations.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2776101Efficient Detection of Communities in Biological Bipartite Networks
https://www.computer.org/csdl/trans/tb/2019/01/08118141-abs.html
Methods to efficiently uncover and extract community structures are required in a number of biological applications where networked data and their interactions can be modeled as graphs, and observing tightly-knit groups of vertices (“communities”) can offer insights into the structural and functional building blocks of the underlying network. Classical applications of community detection have largely focused on unipartite networks — i.e., graphs built out of a single type of objects. However, due to increased availability of biological data from various sources, there is now an increasing need for handling heterogeneous networks which are built out of multiple types of objects. In this paper, we address the problem of identifying communities from biological <italic>bipartite networks</italic> — i.e., networks where interactions are observed between <italic>two different types</italic> of objects (e.g., genes and diseases, drugs and protein complexes, plants and pollinators, and hosts and pathogens). Toward detecting communities in such bipartite networks, we make the following contributions: i) (<italic>metric</italic>) we propose a variant of bipartite modularity; ii) (<italic>algorithms</italic>) we present an efficient algorithm called <italic>biLouvain</italic> that implements a set of heuristics toward fast and precise community detection in bipartite networks (<uri>https://github.com/paolapesantez/biLouvain</uri>); and iii) (<italic>experiments</italic>) we present a thorough experimental evaluation of our algorithm including comparison to other state-of-the-art methods to identify communities in bipartite networks. Experimental results show that our <italic>biLouvain</italic> algorithm identifies communities that have a comparable or better quality (as measured by bipartite modularity) than existing methods, while significantly reducing the time-to-solution between one and four orders of magnitude.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2765319Elucidating Genome-Wide Protein-RNA Interactions Using Differential Evolution
https://www.computer.org/csdl/trans/tb/2019/01/08118178-abs.html
RNA-binding proteins (RBPs) play an important role in the post-transcriptional control of RNAs, such as splicing, polyadenylation, mRNA stabilization, mRNA localization, and translation. Thanks to the recent breakthrough, non-negative matrix factorization (NMF) has been developed to combine multiple data sources to discover non-overlapping and class-specific RNA binding patterns. However, several challenges still exist in determining the number of latent dimensions in the factorization steps. In most circumstances, it is often assumed that the number of latent dimensions (or components) is given. Such trial-and-error procedures can be tedious in practice. In order to address this problem, differential evolution algorithm is proposed as the model selection method to choose the suitable number of ranks, which can adaptively decompose the input protein-RNA data matrix into different nonnegative components. Experimental results demonstrate that the proposed algorithms can improve the factorization quality over the recent state-of-the-arts. The effectiveness of the proposed algorithms are supported by comprehensive performance benchmarking on 31 genome-wide cross-linking immunoprecipitation (CLIP) coupled with high-throughput sequencing (CLIP-seq) datasets. In addition, time complexity analysis and parameter analysis are conducted to demonstrate the robustness of the proposed methods.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2776224Meta-Path Methods for Prioritizing Candidate Disease miRNAs
https://www.computer.org/csdl/trans/tb/2019/01/08118187-abs.html
MicroRNAs (miRNAs) play critical roles in regulating gene expression at post-transcriptional levels. Numerous experimental studies indicate that alterations and dysregulations in miRNAs are associated with important complex diseases, especially cancers. Predicting potential miRNA—disease association is beneficial not only to explore the pathogenesis of diseases, but also to understand biological processes. In this work, we propose two methods that can effectively predict potential miRNA—disease associations using our reconstructed miRNA and disease similarity networks, which are based on the latest experimental data. We reconstruct a miRNA functional similarity network using the following biological information: the miRNA family information, miRNA cluster information, experimentally valid miRNA—target association and disease—miRNA information. We also reconstruct a disease similarity network using disease functional information and disease semantic information. We present Katz with specific weights and Katz with machine learning, on the comprehensive heterogeneous network. These methods, which achieve corresponding AUC values of 0.897 and 0.919, exhibit performance superior to the existing methods. Comprehensive data networks and reasonable considerations guarantee the high performance of our methods. Contrary to several methods, which cannot work in such situations, the proposed methods also predict associations for diseases without any known related miRNAs. A web service for the download and prediction of relationships between diseases and miRNAs is available at <uri>http://lab.malab.cn/soft/MDPredict/</uri>.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2776280Early Diagnosis of Alzheimer's Disease Based on Resting-State Brain Networks and Deep Learning
https://www.computer.org/csdl/trans/tb/2019/01/08119531-abs.html
Computerized healthcare has undergone rapid development thanks to the advances in medical imaging and machine learning technologies. Especially, recent progress on deep learning opens a new era for multimedia based clinical decision support. In this paper, we use deep learning with brain network and clinical relevant text information to make early diagnosis of Alzheimer's Disease (AD). The clinical relevant text information includes age, gender, and <inline-formula><tex-math notation="LaTeX">$ApoE$</tex-math><alternatives><inline-graphic xlink:href="ju-ieq1-2776910.gif"/></alternatives></inline-formula> gene of the subject. The brain network is constructed by computing the functional connectivity of brain regions using resting-state functional magnetic resonance imaging (R-fMRI) data. A targeted autoencoder network is built to distinguish normal aging from mild cognitive impairment, an early stage of AD. The proposed method reveals discriminative brain network features effectively and provides a reliable classifier for AD detection. Compared to traditional classifiers based on R-fMRI time series data, about 31.21 percent improvement of the prediction accuracy is achieved by the proposed deep learning method, and the standard deviation reduces by 51.23 percent in the best case that means our prediction model is more stable and reliable compared to the traditional methods. Our work excavates deep learning's advantages of classifying high-dimensional multimedia data in medical services, and could help predict and prevent AD at an early stage.02/07/2019 5:46 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2776910A Comparative Study of Network Motifs in the Integrated Transcriptional Regulation and Protein Interaction Networks of <italic>Shewanella</italic>
https://www.computer.org/csdl/trans/tb/2019/01/08288571-abs.html
The <italic>Shewanella</italic> species shows a remarkable respiratory versatility with a great variety of extracellular electron acceptors (termed Extracellular Electron Transfer, EET). To explore relevant mechanisms from the network motif view, we constructed the integrated networks that combined transcriptional regulation interactions (TRIs) and protein-protein interactions (PPIs) for 13 <italic>Shewanella</italic> species, identified and compared the network motifs in these integrated networks. We found that the network motifs were evolutionary conserved in these integrated networks. The functional significance of the highly conserved motifs was discussed, especially the important ones that were potentially involved in the <italic>Shewanella</italic> EET processes. More importantly, we found that: 1) the motif co-regulated PPI took a role in the “standby mode” of protein utilization, which will be helpful for cells to rapidly response to environmental changes; and 2) the type II cofactors, which involved in the motif TRI interacting with a third protein, mainly carried out a signalling role in <italic>Shewanella oneidensis</italic> MR-1.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2804393An Efficient Mixed-Model for Screening Differentially Expressed Genes of Breast Cancer Based on LR-RF
https://www.computer.org/csdl/trans/tb/2019/01/08345249-abs.html
To screen differentially expressed genes quickly and efficiently in breast cancer, two gene microarray datasets of breast cancer, GSE15852 and GSE45255, were downloaded from GEO. By combining the Logistic Regression and Random Forest algorithm, this paper proposed a novel method named LR-RF to select differentially expressed genes of breast cancer on microarray data by the Bonferroni test of FWER error measure. Comparing with Logistic Regression and Random Forest, our study shows that LR-FR has a great facility in selecting differentially expressed genes. The average prediction accuracy of the proposed LR-RF from replicating random test 10 times surprisingly reaches <inline-formula><tex-math notation="LaTeX">${{93.11}}$</tex-math><alternatives><inline-graphic xlink:href="tang-ieq1-2829519.gif"/></alternatives></inline-formula> percent with variance as low as <inline-formula><tex-math notation="LaTeX">${{0.00045}}$</tex-math><alternatives><inline-graphic xlink:href="tang-ieq2-2829519.gif"/></alternatives></inline-formula>. The prediction accuracy rate reaches a maximum 95.57 percent when threshold value <inline-formula><tex-math notation="LaTeX">$\alpha = 0.2$</tex-math><alternatives><inline-graphic xlink:href="tang-ieq3-2829519.gif"/></alternatives></inline-formula> in the random forest algorithm process of ranking genes’ importance score, and the differentially expressed genes are relatively few in number. In addition, through analyzing the gene interaction networks, most of the top 20 genes we selected were found to involve in the development of breast cancer. All of these results demonstrate the reliability and efficiency of LR-RF. It is anticipated that LR-RF would provide new knowledge and method for biologists, medical scientists, and cognitive computing researchers to identify disease-related genes of breast cancer.02/05/2019 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2829519SAFETY: Secure gwAs in Federated Environment through a hYbrid Solution
https://www.computer.org/csdl/trans/tb/2019/01/08345622-abs.html
Recent studies demonstrate that effective healthcare can benefit from using the human genomic information. Consequently, many institutions are using statistical analysis of genomic data, which are mostly based on genome-wide association studies (GWAS). GWAS analyze genome sequence variations in order to identify genetic risk factors for diseases. These studies often require pooling data from different sources together in order to unravel statistical patterns, and relationships between genetic variants and diseases. Here, the primary challenge is to fulfill one major objective: accessing multiple genomic data repositories for collaborative research in a privacy-preserving manner. Due to the privacy concerns regarding the genomic data, multi-jurisdictional laws and policies of cross-border genomic data sharing are enforced among different countries. In this article, we present SAFETY, a hybrid framework, which can securely perform GWAS on federated genomic datasets using homomorphic encryption and recently introduced secure hardware component of Intel Software Guard Extensions to ensure high efficiency and privacy at the same time. Different experimental settings show the efficacy and applicability of such hybrid framework in secure conduction of GWAS. To the best of our knowledge, this hybrid use of homomorphic encryption along with Intel SGX is not proposed to this date. SAFETY is up to 4.82 times faster than the best existing secure computation technique.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2829760DrPOCS: Drug Repositioning Based on Projection Onto Convex Sets
https://www.computer.org/csdl/trans/tb/2019/01/08350090-abs.html
Drug repositioning, i.e., identifying new indications for known drugs, has attracted a lot of attentions recently and is becoming an effective strategy in drug development. In literature, several computational approaches have been proposed to identify potential indications of old drugs based on various types of data sources. In this paper, by formulating the drug-disease associations as a low-rank matrix, we propose a novel method, namely DrPOCS, to identify candidate indications of old drugs based on projection onto convex sets (POCS). With the integration of drug structure and disease phenotype information, DrPOCS predicts potential associations between drugs and diseases with matrix completion. Benchmarking results demonstrate that our proposed approach outperforms popular existing approaches with high accuracy. In addition, a number of novel predicted indications are validated with various types of evidences, indicating the predictive power of our proposed approach.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2830384Hardness of Covering Alignment: Phase Transition in Post-Sequence Genomics
https://www.computer.org/csdl/trans/tb/2019/01/08352774-abs.html
Covering alignment problems arise from recent developments in genomics; so called pan-genome graphs are replacing reference genomes, and advances in haplotyping enable full content of diploid genomes to be used as basis of sequence analysis. In this paper, we show that the computational complexity will change for natural extensions of alignments to pan-genome representations and to diploid genomes. More broadly, our approach can also be seen as a minimal extension of sequence alignment to labelled directed acyclic graphs (labeled DAGs). Namely, we show that finding a <italic>covering alignment</italic> of two labeled DAGs is NP-hard even on binary alphabets. A covering alignment asks for two paths <inline-formula><tex-math notation="LaTeX">$R_1$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq1-2831691.gif"/></alternatives></inline-formula> (red) and <inline-formula><tex-math notation="LaTeX">$G_1$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq2-2831691.gif"/></alternatives></inline-formula> (green) in DAG <inline-formula><tex-math notation="LaTeX">$D_1$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq3-2831691.gif"/></alternatives></inline-formula> and two paths <inline-formula><tex-math notation="LaTeX">$R_2$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq4-2831691.gif"/></alternatives></inline-formula> (red) and <inline-formula><tex-math notation="LaTeX">$G_2$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq5-2831691.gif"/></alternatives></inline-formula> (green) in DAG <inline-formula><tex-math notation="LaTeX">$D_2$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq6-2831691.gif"/></alternatives></inline-formula> that cover the nodes of the graphs and maximize the sum of the global alignment scores: <inline-formula><tex-math notation="LaTeX">$\mathsf {as}(\mathsf {sp}(R_1),\mathsf {sp}(R_2))+\mathsf {as}(\mathsf {sp}(G_1),\mathsf {sp}(G_2))$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq7-2831691.gif"/></alternatives></inline-formula>, where <inline-formula><tex-math notation="LaTeX">$\mathsf {sp}(P)$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq8-2831691.gif"/></alternatives></inline-formula> is the concatenation of labels on the path <inline-formula><tex-math notation="LaTeX">$P$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq9-2831691.gif"/></alternatives></inline-formula>. Pair-wise alignment of haplotype sequences forming a diploid chromosome can be converted to a two-path coverable labelled DAG, and then the covering alignment models the similarity of two diploids over arbitrary recombinations. We also give a reduction to the other direction, to show that such a recombination-oblivious diploid alignment is NP-hard on alphabets of size 3.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2831691An Exact Algorithm for Sorting by Weighted Preserving Genome Rearrangements
https://www.computer.org/csdl/trans/tb/2019/01/08352816-abs.html
The preserving Genome Sorting Problem (pGSP) asks for a shortest sequence of rearrangement operations that transforms a given gene order into another given gene order by using rearrangement operations that preserve common intervals, i.e., groups of genes that form an interval in both given gene orders. The wpGSP is the weighted version of the problem were each type of rearrangement operation has a weight and a minimum weight sequence of rearrangement operations is sought. An exact algorithm – called <monospace>CREx2</monospace> – is presented, which solves the wpGSP for arbitrary gene orders and the following types of rearrangement operations: inversions, transpositions, inverse transpositions, and tandem duplication random loss operations. <monospace>CREx2</monospace> has a (worst case) exponential runtime, but a linear runtime for problem instances where the common intervals are organized in a linear structure. The efficiency of <monospace>CREx2</monospace> and its usefulness for phylogenetic analysis is shown empirically for gene orders of fungal mitochondrial genomes.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2831661Resource Cut, a New Bounding Procedure to Algorithms for Enumerating Tree-Like Chemical Graphs
https://www.computer.org/csdl/trans/tb/2019/01/08353138-abs.html
Enumerating chemical compounds with given structural properties plays an important role in structure elucidation, with applications such as drug design. We focus on the problem of enumerating tree-like chemical graphs specified by upper and lower bounds on feature vectors, where chemical graphs represent compounds, and a feature vector characterizes frequencies of finite paths in a graph. Building on the branch-and-bound algorithm proposed in earlier work, we propose a new bounding procedure, called <sc>Resource Cut</sc>, to speed up the enumeration process. Tree-like chemical graphs are modeled as vertex-colored trees, colors representing chemical elements. The algorithm is based on a scheme of generating each unique colored tree with a specified number <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives><inline-graphic xlink:href="shurbevski-ieq1-2832061.gif"/></alternatives></inline-formula> of vertices. A colored tree is constructed by repeatedly appending vertices. Given a set <inline-formula><tex-math notation="LaTeX">$\mathcal {R}$</tex-math><alternatives><inline-graphic xlink:href="shurbevski-ieq2-2832061.gif"/></alternatives></inline-formula> of <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives><inline-graphic xlink:href="shurbevski-ieq3-2832061.gif"/></alternatives></inline-formula> colored vertices, we found that the algorithm often constructs trees that cannot be extended to a unique representation of a colored tree no matter how the remaining unused colored vertices in the set <inline-formula><tex-math notation="LaTeX">$\mathcal {R}$</tex-math><alternatives><inline-graphic xlink:href="shurbevski-ieq4-2832061.gif"/></alternatives></inline-formula> are appended. We derive a mathematical condition to detect and discard such trees. Experimental results show that <sc>Resource Cut</sc> significantly reduces the search space. We have been able to obtain exact numbers of chemical graphs with up to 17 vertices excluding hydrogen atoms.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2832061SecureLR: Secure Logistic Regression Model via a Hybrid Cryptographic Protocol
https://www.computer.org/csdl/trans/tb/2019/01/08355587-abs.html
Machine learning applications are intensively utilized in various science fields, and increasingly the biomedical and healthcare sector. Applying predictive modeling to biomedical data introduces privacy and security concerns requiring additional protection to prevent accidental disclosure or leakage of sensitive patient information. Significant advancements in secure computing methods have emerged in recent years, however, many of which require substantial computational and/or communication overheads, which might hinder their adoption in biomedical applications. In this work, we propose SecureLR, a novel framework allowing researchers to leverage both the computational and storage capacity of Public Cloud Servers to conduct learning and predictions on biomedical data without compromising data security or efficiency. Our model builds upon homomorphic encryption methodologies with hardware-based security reinforcement through Software Guard Extensions (SGX), and our implementation demonstrates a practical hybrid cryptographic solution to address important concerns in conducting machine learning with public clouds.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2833463Parallel Computation of the Burrows-Wheeler Transform of Short Reads Using Prefix Parallelism
https://www.computer.org/csdl/trans/tb/2019/01/08360495-abs.html
The Burrows-Wheeler transform (BWT) of short-read data has unexplored potential utilities, such as for efficient and sensitive variation analysis against multiple reference genome sequences, because it does not depend on any particular reference genome sequence, unlike conventional mapping-based methods. However, since the amount of read data is generally much larger than the size of the reference sequence, computation of the BWT of reads is not easy, and this hampers development of potential applications. For the alleviation of this problem, a new method of computing the BWT of reads in parallel is proposed. The BWT, corresponding to a sorted list of suffixes of reads, is constructed incrementally by successively including longer and longer suffixes. The working data is divided into more than 10,000 “blocks” corresponding to sublists of suffixes with the same prefixes. Thousands of groups of blocks can be processed in parallel while making exclusive writes and concurrent reads into a shared memory. Reads and writes are basically sequential, and the read concurrency is limited to two. Thus, a fine-grained parallelism, referred to as <italic>prefix parallelism</italic>, is expected to work efficiently. The time complexity for processing <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives><inline-graphic xlink:href="kimura-ieq1-2837749.gif"/></alternatives></inline-formula> reads of length <inline-formula><tex-math notation="LaTeX">$\ell$</tex-math><alternatives><inline-graphic xlink:href="kimura-ieq2-2837749.gif"/></alternatives></inline-formula> is <inline-formula><tex-math notation="LaTeX">$O(n\ell ^2)$</tex-math><alternatives><inline-graphic xlink:href="kimura-ieq3-2837749.gif"/></alternatives></inline-formula>. On actual biological DNA sequence data of about 100 Gbp with a read length of 100 bp (base pairs), a tentative implementation of the proposed method took less than an hour on a single-node computer; i.e., it was about three times faster than one of the fastest programs developed so far.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2837749An Integrated Reconciliation Framework for Domain, Gene, and Species Level Evolution
https://www.computer.org/csdl/trans/tb/2019/01/08382181-abs.html
The majority of genes in eukaryotes consists of one or more <italic>protein domains</italic> that can be independently lost or gained during evolution. This gain and loss of protein domains, through domain duplications, transfers, or losses, has important evolutionary and functional consequences. Yet, even though it is well understood that domains evolve inside genes and genes inside species, there do not exist any computational frameworks to simultaneously model the evolution of domains, genes, and species and account for their inter-dependency. Here, we develop an integrated model of domain evolution that explicitly captures the interdependence of domain-, gene-, and species-level evolution. Our model extends the classical phylogenetic reconciliation framework, which infers gene family evolution by comparing gene trees and species trees, by explicitly considering domain-level evolution and decoupling domain-level events from gene-level events. In this paper, we (i) introduce the new integrated reconciliation framework, (ii) prove that the associated optimization problem is NP-hard, (iii) devise an efficient heuristic solution for the problem, (iv) apply our algorithm to a large biological dataset, and (v) demonstrate the impact of using our new computational framework compared to existing approaches. The implemented software is freely available from <uri>http://compbio.engr.uconn.edu/software/seadog/</uri>.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2846253Arrhythmia Recognition and Classification Using ECG Morphology and Segment Feature Analysis
https://www.computer.org/csdl/trans/tb/2019/01/08382215-abs.html
In this work, arrhythmia appearing with the presence of abnormal heart electrical activity is efficiently recognized and classified. A novel method is proposed for accurate recognition and classification of cardiac arrhythmias. Firstly, P-QRS-T waves is segmented from ECG waveform; secondly, morphological features are extracted from P-QRS-T waves, and ECG segment features are extracted from the selected ECG segment by using PCA and dynamic time warping(DTW); finally, SVM is applied to the features and automatic diagnosis results is presented. ECG data set used is derived from the MIT-BIH in which ECG signals are divided into the four classes: normal beats(N), supraventricular ectopic beats (SVEBs), ventricular ectopic beats (VEBs) and fusion of ventricular and normal (F). Our proposed method can distinguish N, SVEBs, VEBs and F with an accuracy of 97.80 percent. The sensitivities for the classes N, SVEBs, VEBs and F are 99.27, 87.47, 94.71, and 73.88 percent and the positive predictivities are 98.48, 95.25, 95.22 and 86.09 percent respectively. The detection sensitivity of SVEBs and VEBs has a better performance by combining proposed features than by using the ECG morphology or ECG segment features separately. The proposed method is compared with four selected peer algorithms and delivers solid results.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2846611Computing the Diameter of the Space of Maximum Parsimony Reconciliations in the Duplication-Transfer-Loss Model
https://www.computer.org/csdl/trans/tb/2019/01/08392709-abs.html
Phylogenetic tree reconciliation is widely used in the fields of molecular evolution, cophylogenetics, parasitology, and biogeography to study the evolutionary histories of pairs of entities. In these contexts, reconciliation is often performed using maximum parsimony under the Duplication-Transfer-Loss (DTL) event model. In general, the number of maximum parsimony reconciliations (MPRs) can grow exponentially with the size of the trees. While a number of previous efforts have been made to count the number of MPRs, find representative MPRs, and compute the frequencies of events across the space of MPRs, little is known about the structure of MPR space. In particular, how different are MPRs in terms of the events that they comprise? One way to address this question is to compute the <italic>diameter</italic> of MPR space, defined to be the maximum number of DTL events that distinguish any two MPRs in the solution space. We show how to compute the diameter of MPR space in polynomial time and then apply this algorithm to a large biological dataset to study the variability of events.02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2849732Are My EHRs Private Enough? Event-Level Privacy Protection
https://www.computer.org/csdl/trans/tb/2019/01/08395018-abs.html
Privacy is a major concern in sharing human subject data to researchers for secondary analyses. A simple binary consent (opt-in or not) may significantly reduce the amount of sharable data, since many patients might only be concerned about a few sensitive medical conditions rather than the entire medical records. We propose event-level privacy protection, and develop a feature ablation method to protect event-level privacy in electronic medical records. Using a list of 13 sensitive diagnoses, we evaluate the feasibility and the efficacy of the proposed method. As feature ablation progresses, the identifiability of a sensitive medical condition decreases with varying speeds on different diseases. We find that these sensitive diagnoses can be divided into three categories: (1) five diseases have fast declining identifiability (AUC below 0.6 with less than 400 features excluded); (2) seven diseases with progressively declining identifiability (AUC below 0.7 with between 200 and 700 features excluded); and (3) one disease with slowly declining identifiability (AUC above 0.7 with 1,000 features excluded). The fact that the majority (12 out of 13) of the sensitive diseases fall into the first two categories suggests the potential of the proposed feature ablation method as a solution for event-level record privacy protection.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2850037Natural Language Processing for EHR-Based Computational Phenotyping
https://www.computer.org/csdl/trans/tb/2019/01/08395074-abs.html
This article reviews recent advances in applying natural language processing (NLP) to Electronic Health Records (EHRs) for computational phenotyping. NLP-based computational phenotyping has numerous applications including diagnosis categorization, novel phenotype discovery, clinical trial screening, pharmacogenomics, drug-drug interaction (DDI), and adverse drug event (ADE) detection, as well as genome-wide and phenome-wide association studies. Significant progress has been made in algorithm development and resource construction for computational phenotyping. Among the surveyed methods, well-designed keyword search and rule-based systems often achieve good performance. However, the construction of keyword and rule lists requires significant manual effort, which is difficult to scale. Supervised machine learning models have been favored because they are capable of acquiring both classification patterns and structures from data. Recently, deep learning and unsupervised learning have received growing attention, with the former favored for its performance and the latter for its ability to find novel phenotypes. Integrating heterogeneous data sources have become increasingly important and have shown promise in improving model performance. Often, better performance is achieved by combining multiple modalities of information. Despite these many advances, challenges and opportunities remain for NLP-based computational phenotyping, including better model interpretability and generalizability, and proper characterization of feature relations in clinical narratives.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2849968Taming Asynchrony for Attractor Detection in Large Boolean Networks
https://www.computer.org/csdl/trans/tb/2019/01/08398459-abs.html
Boolean networks is a well-established formalism for modelling biological systems. A vital challenge for analyzing a Boolean network is to identify all the attractors. This becomes more challenging for large asynchronous Boolean networks, due to the asynchronous scheme. Existing methods are prohibited due to the well-known state-space explosion problem in large Boolean networks. In this paper, we tackle this challenge by proposing a SCC-based decomposition method. We prove the correctness of our proposed method and demonstrate its efficiency with two real-life biological networks.02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2850901Adjacent Y-Ion Ratio Distributions and Its Application in Peptide Sequencing
https://www.computer.org/csdl/trans/tb/2019/01/08430587-abs.html
A scoring function plays a critical role in software for peptide identification with mass spectrometry. We present a general scoring feature that can be incorporated in the scoring functions of other peptide identification software. The scoring feature is based on the intensity ratios between two adjacent y-ions in the spectrum. A method is proposed to obtain the probability distributions of such ratios, and to calculate the scoring feature based on the distributions. To demonstrate the performance of the method, the new feature is incorporated with X!Tandem <xref ref-type="bibr" rid="ref1">[1]</xref> , <xref ref-type="bibr" rid="ref2">[2]</xref> and Novor <xref ref-type="bibr" rid="ref3">[3]</xref> and significantly improved the database search and de novo sequencing performances on the testing data, respectively.02/05/2019 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2864647Modeling and Simulation Studies of Complex Biological Systems for Precision Medicine and Healthcare
https://www.computer.org/csdl/trans/tb/2019/01/08635320-abs.html
02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.28500782018 Index IEEE/ACM Transactions on Computational Biology and Bioinformatics Vol. 15
https://www.computer.org/csdl/trans/tb/2019/01/08635413-abs.html
02/05/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2019.2895796Guest Editorial for the 16th Asia Pacific Bioinformatics Conference
https://www.computer.org/csdl/trans/tb/2019/01/08635414-abs.html
02/05/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2856940DiscMLA: An Efficient Discriminative Motif Learning Algorithm over High-Throughput Datasets
https://www.computer.org/csdl/trans/tb/2018/06/07464278-abs.html
The transcription factors (TFs) can activate or suppress gene expression by binding to specific sites, hence are crucial regulatory elements for transcription. Recently, series of discriminative motif finders have been tailored to offering promising strategy for harnessing the power of large quantities of accumulated high-throughput experimental data. However, in order to achieve high speed, these algorithms have to sacrifice accuracy by employing simplified statistical models during the searching process. In this paper, we propose a novel approach named Discriminative Motif Learning via AUC (DiscMLA) to discover motifs on high-throughput datasets. Unlike previous approaches, DiscMLA tries to optimize with a more comprehensive criterion (AUC) during motifs searching. In addition, based on an experimental observation of motif identification on large-scale datasets, some novel procedures are designed to accelerate DiscMLA. The experimental results on 52 real-world datasets demonstrate that our approach substantially outperforms previous methods on discriminative motif learning problems. DiscMLA’ stability, discriminability, and validity will help to exploit high-throughput datasets and answer many fundamental biological questions.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2561930Effectively Identifying Compound-Protein Interactions by Learning from Positive and Unlabeled Examples
https://www.computer.org/csdl/trans/tb/2018/06/07471459-abs.html
Prediction of <italic>compound-protein interactions</italic> (CPIs) is to find new compound-protein pairs where a protein is targeted by at least a compound, which is a crucial step in new drug design. Currently, a number of machine learning based methods have been developed to predict new CPIs in the literature. However, as there is not yet any publicly available set of validated negative CPIs, most existing machine learning based approaches use the unknown interactions (not validated CPIs) selected randomly as the negative examples to train classifiers for predicting new CPIs. Obviously, this is not quite reasonable and unavoidably impacts the CPI prediction performance. In this paper, we simply take the unknown CPIs as unlabeled examples, and propose a new method called <italic>PUCPI</italic> (the abbreviation of <italic>PU</italic> learning for Compound-Protein Interaction identification) that employs biased-SVM (Support Vector Machine) to predict CPIs using only positive and unlabeled examples. PU learning is a class of learning methods that leans from <italic>positive and unlabeled</italic> (PU) samples. To the best of our knowledge, this is the first work that identifies CPIs using only positive and unlabeled examples. We first collect known CPIs as positive examples and then randomly select compound-protein pairs not in the positive set as unlabeled examples. For each CPI/compound-protein pair, we extract protein domains as protein features and compound substructures as chemical features, then take the tensor product of the corresponding compound features and protein features as the feature vector of the CPI/compound-protein pair. After that, biased-SVM is employed to train classifiers on different datasets of CPIs and compound-protein pairs. Experiments over various datasets show that our method outperforms six typical classifiers, including random forest, L1- and L2-regularized logistic regression, naive Bayes, SVM and <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="zhou-ieq1-2570211.gif"/></alternatives></inline-formula>-nearest neighbor (kNN), and three types of existing CPI prediction models. More information can be found at <uri>http://admis.fudan.edu.cn/projects/pucpi.html</uri>.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2570211Improve Biomedical Information Retrieval Using Modified Learning to Rank Methods
https://www.computer.org/csdl/trans/tb/2018/06/07491290-abs.html
In these years, the number of biomedical articles has increased exponentially, which becomes a problem for biologists to capture all the needed information manually. Information retrieval technologies, as the core of search engines, can deal with the problem automatically, providing users with the needed information. However, it is a great challenge to apply these technologies directly for biomedical retrieval, because of the abundance of domain specific terminologies. To enhance biomedical retrieval, we propose a novel framework based on learning to rank. Learning to rank is a series of state-of-the-art information retrieval techniques, and has been proved effective in many information retrieval tasks. In the proposed framework, we attempt to tackle the problem of the abundance of terminologies by constructing ranking models, which focus on not only retrieving the most relevant documents, but also diversifying the searching results to increase the completeness of the resulting list for a given query. In the model training, we propose two novel document labeling strategies, and combine several traditional retrieval models as learning features. Besides, we also investigate the usefulness of different learning to rank approaches in our framework. Experimental results on TREC Genomics datasets demonstrate the effectiveness of our framework for biomedical information retrieval.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2578337Predicting MicroRNA-Disease Associations Based on Improved MicroRNA and Disease Similarities
https://www.computer.org/csdl/trans/tb/2018/06/07505911-abs.html
MicroRNAs (miRNAs) are a type of non-coding RNAs with about ∼22nt nucleotides. Increasing evidences have shown that miRNAs play critical roles in many human diseases. The identification of human disease-related miRNAs is helpful to explore the underlying pathogenesis of diseases. More and more experimental validated associations between miRNAs and diseases have been reported in the recent studies, which provide useful information for new miRNA-disease association discovery. In this study, we propose a computational framework, KBMF-MDI, to predict the associations between miRNAs and diseases based on their similarities. The sequence and function information of miRNAs are used to measure similarity among miRNAs while the semantic and function information of disease are used to measure similarity among diseases, respectively. In addition, the kernelized Bayesian matrix factorization method is employed to infer potential miRNA-disease associations by integrating these data sources. We applied this method to 6,084 known miRNA-disease associations and utilized 5-fold cross validation to evaluate the performance. The experimental results demonstrate that our method can effectively predict unknown miRNA-disease associations.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2586190Structure-Guided Protein Transition Modeling with a Probabilistic Roadmap Algorithm
https://www.computer.org/csdl/trans/tb/2018/06/07506262-abs.html
Proteins are macromolecules in perpetual motion, switching between structural states to modulate their function. A detailed characterization of the precise yet complex relationship between protein structure, dynamics, and function requires elucidating transitions between functionally-relevant states. Doing so challenges both wet and dry laboratories, as protein dynamics involves disparate temporal scales. In this paper, we present a novel, sampling-based algorithm to compute transition paths. The algorithm exploits two main ideas. First, it leverages known structures to initialize its search and define a reduced conformation space for rapid sampling. This is key to address the insufficient sampling issue suffered by sampling-based algorithms. Second, the algorithm embeds samples in a nearest-neighbor graph where transition paths can be efficiently computed via queries. The algorithm adapts the probabilistic roadmap framework that is popular in robot motion planning. In addition to efficiently computing lowest-cost paths between any given structures, the algorithm allows investigating hypotheses regarding the order of experimentally-known structures in a transition event. This novel contribution is likely to open up new venues of research. Detailed analysis is presented on multiple-basin proteins of relevance to human disease. Multiscaling and the AMBER ff14SB force field are used to obtain energetically-credible paths at atomistic detail.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2586044Feature Selection for Optimized High-Dimensional Biomedical Data Using an Improved Shuffled Frog Leaping Algorithm
https://www.computer.org/csdl/trans/tb/2018/06/07551172-abs.html
High dimensional biomedical datasets contain thousands of features which can be used in molecular diagnosis of disease, however, such datasets contain many irrelevant or weak correlation features which influence the predictive accuracy of diagnosis. Without a feature selection algorithm, it is difficult for the existing classification techniques to accurately identify patterns in the features. The purpose of feature selection is to not only identify a feature subset from an original set of features [without reducing the predictive accuracy of classification algorithm] but also reduce the computation overhead in data mining. In this paper, we present our improved shuffled frog leaping algorithm which introduces a chaos memory weight factor, an absolute balance group strategy, and an adaptive transfer factor. Our proposed approach explores the space of possible subsets to obtain the set of features that maximizes the predictive accuracy and minimizes irrelevant features in high-dimensional biomedical data. To evaluate the effectiveness of our proposed method, we have employed the K-nearest neighbor method with a comparative analysis in which we compare our proposed approach with genetic algorithms, particle swarm optimization, and the shuffled frog leaping algorithm. Experimental results show that our improved algorithm achieves improvements in the identification of relevant subsets and in classification accuracy.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2602263Analyzing Differential Regulatory Networks Modulated by Continuous-State Genomic Features in Glioblastoma Multiforme
https://www.computer.org/csdl/trans/tb/2018/06/07769225-abs.html
Gene regulatory networks are a global representation of complex interactions between molecules that dictate cellular behavior. Study of a regulatory network modulated by single or multiple modulators' expression levels, including microRNAs (miRNAs) and transcription factors (TFs), in different conditions can further reveal the modulators’ roles in diseases such as cancers. Existing computational methods for identifying such modulated regulatory networks are typically carried out by comparing groups of samples dichotomized with respect to the modulator status, ignoring the fact that most biological features are intrinsically continuous variables. Here, we devised a sliding window-based regression scheme and proposed the <underline>R</underline>egression-based <underline>I</underline>nference of <underline>M</underline>odulation (RIM) algorithm to infer the dynamic gene regulation modulated by continuous-state modulators. We demonstrated the improvement in performance as well as computation efficiency achieved by RIM. Applying RIM to genome-wide expression profiles of 520 glioblastoma multiforme (GBM) tumors, we investigated miRNA- and TF-modulated gene regulatory networks and showed their association with dynamic cellular processes and brain-related functions in GBM. Overall, the proposed algorithm provides an efficient and robust scheme for comprehensively studying modulated gene regulatory networks.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2635646Network Community Detection Based on the <italic>Physarum</italic>-Inspired Computational Framework
https://www.computer.org/csdl/trans/tb/2018/06/07782378-abs.html
Community detection is a crucial and essential problem in the structure analytics of complex networks, which can help us understand and predict the characteristics and functions of complex networks. Many methods, ranging from the optimization-based algorithms to the heuristic-based algorithms, have been proposed for solving such a problem. Due to the inherent complexity of identifying network structure, how to design an effective algorithm with a higher accuracy and a lower computational cost still remains an open problem. Inspired by the computational capability and positive feedback mechanism in the wake of foraging process of <italic>Physarum</italic>, a kind of slime, a general <italic>Physarum</italic>-based computational framework for community detection is proposed in this paper. Based on the proposed framework, the inter-community edges can be identified from the intra-community edges in a network and the positive feedback of solving process in an algorithm can be further enhanced, which are used to improve the efficiency of original optimization-based and heuristic-based community detection algorithms, respectively. Some typical algorithms (e.g., genetic algorithm, ant colony optimization algorithm, and Markov clustering algorithm) and real-world datasets have been used to estimate the efficiency of our proposed computational framework. Experiments show that the algorithms optimized by <italic>Physarum</italic>-inspired computational framework perform better than the original ones, in terms of accuracy and computational cost.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2016.2638824Environment Sensitivity-Based Cooperative Co-Evolutionary Algorithms for Dynamic Multi-Objective Optimization
https://www.computer.org/csdl/trans/tb/2018/06/07817874-abs.html
Dynamic multi-objective optimization problems (DMOPs) not only involve multiple conflicting objectives, but these objectives may also vary with time, raising a challenge for researchers to solve them. This paper presents a cooperative co-evolutionary strategy based on environment sensitivities for solving DMOPs. In this strategy, a new method that groups decision variables is first proposed, in which all the decision variables are partitioned into two subcomponents according to their interrelation with environment. Adopting two populations to cooperatively optimize the two subcomponents, two prediction methods, i.e., differential prediction and Cauchy mutation, are then employed respectively to speed up their responses on the change of the environment. Furthermore, two improved dynamic multi-objective optimization algorithms, i.e., DNSGAII-CO and DMOPSO-CO, are proposed by incorporating the above strategy into NSGA-II and multi-objective particle swarm optimization, respectively. The proposed algorithms are compared with three state-of-the-art algorithms by applying to seven benchmark DMOPs. Experimental results reveal that the proposed algorithms significantly outperform the compared algorithms in terms of convergence and distribution on most DMOPs.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2652453Swarm Robots Search for Multiple Targets Based on an Improved Grouping Strategy
https://www.computer.org/csdl/trans/tb/2018/06/07878613-abs.html
Swarm robots search for multiple targets in collaboration in unknown environments has been addressed in this paper. An improved grouping strategy based on constriction factors Particle Swarm Optimization is proposed. Robots are grouped under this strategy after several iterations of stochastic movements, which considers the influence range of targets and environmental information they have sensed. The group structure may change dynamically and each group focuses on searching one target. All targets are supposed to be found finally. Obstacle avoidance is considered during the search process. Simulation compared with previous method demonstrates the adaptability, accuracy, and efficiency of the proposed strategy in multiple targets searching.12/14/2018 11:15 am PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2682161Robust Dynamic Multi-Objective Vehicle Routing Optimization Method
https://www.computer.org/csdl/trans/tb/2018/06/07883861-abs.html
For dynamic multi-objective vehicle routing problems, the waiting time of vehicle, the number of serving vehicles, and the total distance of routes were normally considered as the optimization objectives. Except for the above objectives, fuel consumption that leads to the environmental pollution and energy consumption was focused on in this paper. Considering the vehicles’ load and the driving distance, a corresponding carbon emission model was built and set as an optimization objective. Dynamic multi-objective vehicle routing problems with hard time windows and randomly appeared dynamic customers, subsequently, were modeled. In existing planning methods, when the new service demand came up, global vehicle routing optimization method was triggered to find the optimal routes for non-served customers, which was time-consuming. Therefore, a robust dynamic multi-objective vehicle routing method with two-phase is proposed . Three highlights of the novel method are: (i) After finding optimal robust virtual routes for all customers by adopting multi-objective particle swarm optimization in the first phase, static vehicle routes for static customers are formed by removing all dynamic customers from robust virtual routes in next phase. (ii) The dynamically appeared customers append to be served according to their service time and the vehicles’ statues. Global vehicle routing optimization is triggered only when no suitable locations can be found for dynamic customers. (iii) A metric measuring the algorithms robustness is given. The statistical results indicated that the routes obtained by the proposed method have better stability and robustness, but may be sub-optimum. Moreover, time-consuming global vehicle routing optimization is avoided as dynamic customers appear.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2685320Immunological Approach for Full NURBS Reconstruction of Outline Curves from Noisy Data Points in Medical Imaging
https://www.computer.org/csdl/trans/tb/2018/06/07888512-abs.html
Curve reconstruction from data points is an important issue for advanced medical imaging techniques, such as computer tomography (CT) and magnetic resonance imaging (MRI). The most powerful fitting functions for this purpose are the NURBS (non-uniform rational B-splines). Solving the general reconstruction problem with NURBS requires computing all free variables of the problem (data parameters, breakpoints, control points, and their weights). This leads to a very difficult non-convex, nonlinear, high-dimensional, multimodal, and continuous optimization problem. Previous methods simplify the problem by guessing the values for some variables and computing only the remaining ones. As a result, unavoidable approximations errors are introduced. In this paper, we describe the first method in the literature to solve the full NURBS curve reconstruction problem in all its generality. Our method is based on a combination of two techniques: an immunological approach to perform data parameterization, breakpoint placement, and weight calculation, and least squares minimization to compute the control points. This procedure is repeated iteratively (until no further improvement is achieved) for higher accuracy. The method has been applied to reconstruct some outline curves from MRI brain images with satisfactory results. Comparative work shows that our method outperforms the previous related approaches in the literature for all instances in our benchmark.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2688444A Grouping Particle Swarm Optimizer with Personal-Best-Position Guidance for Large Scale Optimization
https://www.computer.org/csdl/trans/tb/2018/06/07919205-abs.html
Particle Swarm Optimization (PSO) is a popular algorithm which is widely investigated and well implemented in many areas. However, the canonical PSO does not perform well in population diversity maintenance so that usually leads to a premature convergence or local optima. To address this issue, we propose a variant of PSO named Grouping PSO with Personal-Best-Position (<inline-formula><tex-math notation="LaTeX">$P_{best}$</tex-math><alternatives><inline-graphic xlink:href="guo-ieq1-2701367.gif"/></alternatives></inline-formula>) Guidance (GPSO-PG) which maintains the population diversity by preserving the diversity of exemplars. On one hand, we adopt uniform random allocation strategy to assign particles into different groups and in each group the losers will learn from the winner. On the other hand, we employ personal historical best position of each particle in social learning rather than the current global best particle. In this way, the exemplars diversity increases and the effect from the global best particle is eliminated. We test the proposed algorithm to the benchmarks in CEC 2008 and CEC 2010, which concern the large scale optimization problems (LSOPs). By comparing several current peer algorithms, GPSO-PG exhibits a competitive performance to maintain population diversity and obtains a satisfactory performance to the problems.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2701367Coevolutionary Structure-Redesigned-Based Bacterial Foraging Optimization
https://www.computer.org/csdl/trans/tb/2018/06/08017472-abs.html
This paper presents a Coevolutionary Structure-Redesigned-Based Bacteria Foraging Optimization (CSRBFO) based on the natural phenomenon that most living creatures tend to cooperate with each other so as to fulfill tasks more effectively. Aiming at lowering computational complexity while maintaining the critical search capability of standard bacterial foraging optimization (BFO), we employ a general loop to replace the nested loop and eliminate the reproduction step of BFO. Hence, the proposed CSRBFO only consists of two main steps: (1) chemotaxis and (2) elimination & dispersal. A coevolutionary strategy by which all bacteria can learn from each other and search for optima cooperatively is incorporated into the chemotactic step to accelerate convergence and facilitate accurate search. In the elimination & dispersal step, the three-stage evolutionary strategy with different learning methods for maintaining diversity is studied. An evaluation of the convergence status is then added to determine whether bacteria should move on to the next stage or not. The combination of coevolutionary strategy and convergence status evaluation is expected to balance exploration and exploitation. Experimental results comparing seven well-known heuristic algorithms on 24 benchmark functions demonstrate that the proposed CSRBFO outperforms the comparison algorithms significantly in most of the cases.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2742946Novel Consensus Gene Selection Criteria for Distributed GPU Partial Least Squares-Based Gene Microarray Analysis in Diffused Large B Cell Lymphoma (DLBCL) and Related Findings
https://www.computer.org/csdl/trans/tb/2018/06/08063355-abs.html
This paper proposes a novel consensus gene selection criteria for partial least squares-based gene microarray analysis. By quantifying the extent of consistency and distinctiveness of the differential gene expressions across different double cross validations (CV) or randomizations in terms of occurrence and randomization <italic>p</italic>-values, the proposed criteria are able to identify a more comprehensive genes associated with the underlying disease. A Distributed GPU implementation has been proposed to accelerate the gene selection problem and about 8-11 times speed up has been achieved based on the microarray datasets considered. Simulation results using various cancer gene microarray datasets show that the proposed approach is able to achieve highly comparable classification accuracy in comparing with many conventional approaches. Furthermore, enrichment analysis on the selected genes for Diffused Large B Cell Lymphoma (DLBCL) and Prostate Cancer datasets and show that only the proposed approach is able to identify gene lists enriched in different pathways with significant <italic>p</italic>-values. In contrast, sufficient statistical significance cannot be found for conventional SVM-RFE and the <italic>t</italic>-test. The reliability in identifying and establishing statistical significance of the gene findings makes the proposed approach an attractive alternative for cancer related researches based on gene expression profiling or other similar data.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2760827Grouped Gene Selection of Cancer via Adaptive Sparse Group Lasso Based on Conditional Mutual Information
https://www.computer.org/csdl/trans/tb/2018/06/08064713-abs.html
This paper deals with the problems of cancer classification and grouped gene selection. The weighted gene co-expression network on cancer microarray data is employed to identify modules corresponding to biological pathways, based on which a strategy of dividing genes into groups is presented. Using the conditional mutual information within each divided group, an integrated criterion is proposed and the data-driven weights are constructed. They are shown with the ability to evaluate both the individual gene significance and the influence to improve correlation of all the other pairwise genes in each group. Furthermore, an adaptive sparse group lasso is proposed, by which an improved blockwise descent algorithm is developed. The results on four cancer data sets demonstrate that the proposed adaptive sparse group lasso can effectively perform classification and grouped gene selection.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2761871Co-Expression Network Approach to Studying the Effects of Botulinum Neurotoxin-A
https://www.computer.org/csdl/trans/tb/2018/06/08070322-abs.html
Botulinum Neurotoxin A (BoNT-A) is a potent neurotoxin with several clinical applications. The goal of this study was to utilize co-expression network theory to analyze temporal transcriptional data from skeletal muscle after BoNT-A treatment. Expression data for 2000 genes (extracted using a ranking heuristic) served as the basis for this analysis. Using weighted gene co-expression network analysis (WGCNA), we identified 19 co-expressed modules, further hierarchically clustered into five groups. Quantifying average expression and co-expression patterns across these groups revealed temporal aspects of muscle's response to BoNT-A. Functional analysis revealed enrichment of group 1 with metabolism; group 5 with contradictory functions of atrophy and cellular recovery; and groups 2 and 3 with extracellular matrix (ECM) and non-fast fiber isoforms. Topological positioning of two highly ranked, significantly expressed genes—<italic>Dclk1</italic> and <italic>Ostalpha</italic>—within group 5 suggested possible mechanistic roles in recovery from BoNT-A induced atrophy. Phenotypic correlations of groups with titin and myosin protein content further emphasized the effect of BoNT-A on the sarcomeric contraction machinery in early phase of chemodenervation. In summary, our approach revealed a hierarchical functional response to BoNT-A induced paralysis with early metabolic and later ECM responses and identified putative biomarkers associated with chemodenervation. Additionally, our results provide an unbiased validation of the response documented in our previous work.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2763949A Bi-Objective RNN Model to Reconstruct Gene Regulatory Network: A Modified Multi-Objective Simulated Annealing Approach
https://www.computer.org/csdl/trans/tb/2018/06/08100952-abs.html
Gene Regulatory Network (GRN) is a virtual network in a cellular context of an organism, comprising a set of genes and their internal relationships to regulate protein production rate (gene expression level) of each other through coded proteins. Computational Reconstruction of GRN from gene expression data is a widely-applied research area. Recurrent Neural Network (RNN) is a useful modeling scheme for GRN reconstruction. In this research, the RNN formulation of GRN reconstruction having single objective function has been modified to incorporate a new objective function. An existing multi-objective meta-heuristic algorithm, called Archived Multi Objective Simulated Annealing (AMOSA), has been modified and applied to this bi-objective RNN formulation. Executing the resulting algorithm (called AMOSA-GRN) on a gene expression dataset, a collection (termed as Archive) of non-dominated GRNs has been obtained. Ensemble averaging has been applied on the archives, and obtained through a sequence of executions of AMOSA-GRN. Accuracy of GRNs in the averaged archive, with respect to gold standard GRN, varies in the range 0.875 – 1.0 (87.5 - 100 percent).12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2771360RF-NR: Random Forest Based Approach for Improved Classification of Nuclear Receptors
https://www.computer.org/csdl/trans/tb/2018/06/08107505-abs.html
The Nuclear Receptor (NR) superfamily plays an important role in key biological, developmental, and physiological processes. Developing a method for the classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study, we developed RF-NR, which is a Random Forest based approach for improved classification of nuclear receptors. The RF-NR can predict whether a query protein sequence belongs to one of the eight NR subfamilies or it is a non-NR sequence. The RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition, and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria, the RF-NR achieves better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamilies. RF-NR is freely available at <uri>http://bcb.ncat.edu/RF_NR</uri><uri>/</uri>.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2773063Computing Minimum Reaction Modifications in a Boolean Metabolic Network
https://www.computer.org/csdl/trans/tb/2018/06/08119863-abs.html
In metabolic network modification, we newly add enzymes or/and knock-out genes to maximize the biomass production with minimum side-effect. Although this problem has been studied for various problem settings via mathematical models including flux balance analysis, elementary mode, and Boolean models, some important problem settings still remain to be studied. In this paper, we consider the Boolean Reaction Modification (BRM) problem, where a host metabolic network and a reference metabolic network are given in the Boolean model. The host network initially produces some toxic compounds and cannot produce some necessary compounds, but the reference network can produce the necessary compounds, and we should minimize the total number of removed reactions from the host network and added reactions from the reference network so that the toxic compounds are not producible, but the necessary compounds are producible in the resulting host network. We developed integer linear programming (ILP)-based methods for BRM, and compared them with OptStrain and SimOptStrain. The results show that our method performed better for reducing the total number of added and removed reactions, while OptStrain and SimOptStrain performed better for optimizing the production of the target compound. Our developed software is freely available at “<uri>http://sunflower.kuicr.kyoto-u.ac.jp/~rogi/solBRM/solBRM.html</uri>”.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2777456SPF-CellTracker: Tracking Multiple Cells with Strongly-Correlated Moves Using a Spatial Particle Filter
https://www.computer.org/csdl/trans/tb/2018/06/08186251-abs.html
Tracking many cells in time-lapse 3D image sequences is an important challenging task of bioimage informatics. Motivated by a study of brain-wide 4D imaging of neural activity in <italic>C. elegans</italic>, we present a new method of multi-cell tracking. Data types to which the method is applicable are characterized as follows: (i) cells are imaged as globular-like objects, (ii) it is difficult to distinguish cells on the basis of shape and size only, (iii) the number of imaged cells in the several-hundred range, (iv) movements of nearly-located cells are strongly correlated, and (v) cells do not divide. We developed a tracking software suite that we call SPF-CellTracker. Incorporating dependency on the cells’ movements into the prediction model is the key for reducing the tracking errors: the cell switching and the coalescence of the tracked positions. We model the target cells’ correlated movements as a Markov random field and we also derive a fast computation algorithm, which we call spatial particle filter. With the live-imaging data of the nuclei of <italic>C. elegans</italic> neurons in which approximately 120 nuclei of neurons were imaged, the proposed method demonstrated improved accuracy compared to the standard particle filter and the method developed by Tokunaga et al. (2014).12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2782255Altering Indispensable Proteins in Controlling Directed Human Protein Interaction Network
https://www.computer.org/csdl/trans/tb/2018/06/08267063-abs.html
The numerous interconnections within complex systems enable us to control networks towards a desired state through a few suitable selected nodes, which are called driver nodes. Recent works analyzed directed human Protein-Protein Interaction (PPI) network based on structural control theory. They found that indispensable proteins, whose removal increase the number of driver nodes, are the primary targets of human viruses and drugs. However, the human PPI network is usually incomplete and may include many false-positive or false-negative interactions. That prompts us to ask whether these indispensable proteins are stable to possible structural changes. Here, we present a method to alter the type of indispensable proteins and thereby investigate the stability of indispensable proteins. By comparing the sets of indispensable proteins before and after structural changes to the network, we find that very few added or removed interactions can change the type of many indispensable nodes. Furthermore, some indispensable proteins are very sensitive to structural changes and have significantly lower interactions than the other indispensable proteins. The results indicate that indispensable proteins are sensitive to structural changes. Therefore, approaches based on structural control theory should be used with caution because of the incomplete nature of these networks.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2796572A Simplified Description of Child Tables for Sequence Similarity Search
https://www.computer.org/csdl/trans/tb/2018/06/08288582-abs.html
Finding related nucleotide or protein sequences is a fundamental, diverse, and incompletely-solved problem in bioinformatics. It is often tackled by seed-and-extend methods, which first find “seed” matches of diverse types, such as spaced seeds, subset seeds, or minimizers. Seeds are usually found using an index of the reference sequence(s), which stores seed positions in a suffix array or related data structure. A child table is a fundamental way to achieve fast lookup in an index, but previous descriptions have been overly complex. This paper aims to provide a more accessible description of child tables, and demonstrate their generality: they apply equally to all the above-mentioned seed types and more. We also show that child tables can be used without LCP (longest common prefix) tables, reducing the memory requirement.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2796064A Novel Computational Approach for Global Alignment for Multiple Biological Networks
https://www.computer.org/csdl/trans/tb/2018/06/08300635-abs.html
Due to the rapid progress of biological networks for modeling biological systems, a lot of biomolecular networks have been producing more and more protein-protein interaction (PPI) data. Analyzing protein-protein interaction networks aims to find regions of topological and functional (dis)similarities between molecular networks of different species. The study of PPI networks has the potential to teach us as much about life process and diseases at the molecular level. Although few methods have been developed for multiple PPI network alignment and thus, new network alignment methods are of a compelling need. In this paper, we propose a novel algorithm for a global alignment of multiple protein-protein interaction networks called MAPPIN. The latter relies on information available for the proteins in the networks, such as sequence, function, and network topology. Our algorithm is perfectly designed to exploit current multi-core CPU architectures, and has been extensively tested on a real data (eight species). Our experimental results show that MAPPIN significantly outperforms NetCoffee in terms of coverage. Nevertheless, MAPPIN is handicapped by the time required to load the gene annotation file. An extensive comparison versus the pioneering PPI methods also show that MAPPIN is often efficient in terms of coverage, mean entropy, or mean normalized.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2808529Identifying Gene Network Rewiring by Integrating Gene Expression and Gene Network Data
https://www.computer.org/csdl/trans/tb/2018/06/08302581-abs.html
Exploring the rewiring pattern of gene regulatory networks between different pathological states is an important task in bioinformatics. Although a number of computational approaches have been developed to infer differential networks from high-throughput data, most of them only focus on gene expression data. The valuable static gene regulatory network data accumulated in recent biomedical researches are neglected. In this study, we propose a new Gaussian graphical model-based method to infer differential networks by integrating gene expression and static gene regulatory network data. We first evaluate the empirical performance of our method by comparing with the state-of-the-art methods using simulation data. We also apply our method to The Cancer Genome Atlas data to identify gene network rewiring between ovarian cancers with different platinum responses, and rewiring between breast cancers of luminal A subtype and basal-like subtype. Hub genes in the estimated differential networks rediscover known genes associated with platinum resistance in ovarian cancer and signatures of the breast cancer intrinsic subtypes.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2809603Using Machine Learning to Improve the Prediction of Functional Outcome in Ischemic Stroke Patients
https://www.computer.org/csdl/trans/tb/2018/06/08305616-abs.html
Ischemic stroke is a leading cause of disability and death worldwide among adults. The individual prognosis after stroke is extremely dependent on treatment decisions physicians take during the acute phase. In the last five years, several scores such as the ASTRAL, DRAGON, and THRIVE have been proposed as tools to help physicians predict the patient functional outcome after a stroke. These scores are rule-based classifiers that use features available when the patient is admitted to the emergency room. In this paper, we apply machine learning techniques to the problem of predicting the functional outcome of ischemic stroke patients, three months after admission. We show that a pure machine learning approach achieves only a marginally superior Area Under the ROC Curve (AUC) (<inline-formula><tex-math notation="LaTeX">$0.808\pm 0.085$</tex-math><alternatives><inline-graphic xlink:href="monteiro-ieq1-2811471.gif"/></alternatives></inline-formula>) than that of the best score (<inline-formula><tex-math notation="LaTeX">$0.771\pm 0.056$</tex-math><alternatives><inline-graphic xlink:href="monteiro-ieq2-2811471.gif"/></alternatives></inline-formula>) when using the features available at admission. However, we observed that by progressively adding features available at further points in time, we can significantly increase the AUC to a value above 0.90. We conclude that the results obtained validate the use of the scores at the time of admission, but also point to the importance of using more features, which require more advanced methods, when possible.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2811471Scaffolding of Ancient Contigs and Ancestral Reconstruction in a Phylogenetic Framework
https://www.computer.org/csdl/trans/tb/2018/06/08316878-abs.html
Ancestral genome reconstruction is an important task to analyze the evolution of genomes. Recent progress in sequencing ancient DNA led to the publication of so-called paleogenomes and allows the integration of this sequencing data in genome evolution analysis. However, the de novo assembly of ancient genomes is usually fragmented due to DNA degradation over time among others. Integrated phylogenetic assembly addresses the issue of genome fragmentation in the ancient DNA assembly while aiming to improve the reconstruction of all ancient genomes in the phylogeny simultaneously. The fragmented assembly of the ancient genome can be represented as an assembly graph, indicating contradicting ordering information of contigs. In this setting, our approach is to compare the ancient data with extant finished genomes. We generalize a reconstruction approach minimizing the Single-Cut-or-Join rearrangement distance towards multifurcating trees and include edge lengths to improve the reconstruction in practice. This results in a polynomial time algorithm that includes additional ancient DNA data at one node in the tree, resulting in consistent reconstructions of ancestral genomes.12/12/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2816034ANTENNA, a Multi-Rank, Multi-Layered Recommender System for Inferring Reliable Drug-Gene-Disease Associations: Repurposing Diazoxide as a Targeted Anti-Cancer Therapy
https://www.computer.org/csdl/trans/tb/2018/06/08318603-abs.html
Existing drug discovery processes follow a reductionist model of “one-drug-one-gene-one-disease,” which is inadequate to tackle complex diseases involving multiple malfunctioned genes. The availability of big omics data offers opportunities to transform drug discovery process into a new paradigm of systems pharmacology that focuses on designing drugs to target molecular interaction networks instead of a single gene. Here, we develop a reliable multi-rank, multi-layered recommender system, ANTENNA, to mine large-scale chemical genomics and disease association data for prediction of novel drug-gene-disease associations. ANTENNA integrates a novel tri-factorization based dual-regularized weighted and imputed One Class Collaborative Filtering (OCCF) algorithm, tREMAP, with a statistical framework based on Random Walk with Restart and assess the reliability of specific predictions. In the benchmark, tREMAP clearly outperforms the single-rank OCCF. We apply ANTENNA to a real-world problem: repurposing old drugs for new clinical indications without effective treatments. We discover that FDA-approved drug diazoxide can inhibit multiple kinase genes responsible for many diseases including cancer and kill triple negative breast cancer (TNBC) cells efficiently <inline-formula><tex-math notation="LaTeX">${\text{(IC}}_{50} = {{0.87}}\,{{\mu}\text{M)}}$</tex-math><alternatives><inline-graphic xlink:href="lim-ieq1-2812189.gif"/></alternatives></inline-formula>. TNBC is a deadly disease without effective targeted therapies. Our finding demonstrates the power of big data analytics in drug discovery and developing a targeted therapy for TNBC.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2812189Robust Gene Circuit Control Design for Time-Delayed Genetic Regulatory Networks Without SUM Regulatory Logic
https://www.computer.org/csdl/trans/tb/2018/06/08335337-abs.html
This paper investigates the gene circuit control design problem of time-delayed genetic regulatory networks. In the genetic regulatory networks, the time delays are unknown constants, and the genetic regulatory is not conventional SUM regulatory logic and can be modeled to be an unknown nonlinear function of the time-delayed states of the other genes in a cell. By Lyapunov stability, a novel adaptive gene circuit control design approach is proposed for the genetic regulatory networks, where the unknown time delays are estimated online by adaptive algorithms and the unknown regulatory functions are approximated by neural networks. The design approach in this paper is delay-dependent and has less conservatism than the delay-independent approach. From theoretical analysis, the closed-loop system is asymptotically stable and all the signals in the system converge to an adjustable neighborhood of the origin. Finally, a numerical example is given to show the effectiveness of the new design approach.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2825445Predicting Hospital Readmission via Cost-Sensitive Deep Learning
https://www.computer.org/csdl/trans/tb/2018/06/08338085-abs.html
With increased use of electronic medical records (EMRs), data mining on medical data has great potential to improve the quality of hospital treatment and increase the survival rate of patients. Early readmission prediction enables early intervention, which is essential to preventing serious or life-threatening events, and act as a substantial contributor to reduce healthcare costs. Existing works on predicting readmission often focus on certain vital signs and diseases by extracting statistical features. They also fail to consider skewness of class labels in medical data and different costs of misclassification errors. In this paper, we recur to the merits of convolutional neural networks (CNN) to automatically learn features from time series of vital sign, and categorical feature embedding to effectively encode feature vectors with heterogeneous clinical features, such as demographics, hospitalization history, vital signs, and laboratory tests. Then, both learnt features via CNN and statistical features via feature embedding are fed into a multilayer perceptron (MLP) for prediction. We use a cost-sensitive formulation to train MLP during prediction to tackle the imbalance and skewness challenge. We validate the proposed approach on two real medical datasets from Barnes-Jewish Hospital, and all data is taken from historical EMR databases and reflects the kinds of data that would realistically be available at the clinical prediction system in hospitals. We find that early prediction of readmission is possible and when compared with state-of-the-art existing methods used by hospitals, our methods perform significantly better. For example, using the general hospital wards data for 30-day readmission prediction, the area under the curve (AUC) for the proposed model was 0.70, significantly higher than all the baseline methods. Based on these results, a system is being deployed in hospital settings with the proposed forecasting algorithms to support treatment.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2827029Computational Models for Trapping Ebola Virus Using Engineered Bacteria
https://www.computer.org/csdl/trans/tb/2018/06/08359015-abs.html
The outbreak of the Ebola virus in recent years has resulted in numerous research initiatives to seek new solutions to contain the virus. A number of approaches that have been investigated include new vaccines to boost the immune system. An alternative post-exposure treatment is presented in this paper. The proposed approach for clearing the Ebola virus can be developed through a microfluidic attenuator, which contains the engineered bacteria that traps Ebola flowing through the blood onto its membrane. The paper presents the analysis of the chemical binding force between the virus and a genetically engineered bacterium considering the opposing forces acting on the attachment point, including hydrodynamic tension and drag force. To test the efficacy of the technique, simulations of bacterial motility within a confined area to trap the virus were performed. More than 60 percent of the displaced virus could be collected within 15 minutes. While the proposed approach currently focuses on <italic>in vitro</italic> environments for trapping the virus, the system can be further developed into a future treatment system whereby blood can be cycled out of the body into a microfluidic device that contains the engineered bacteria to trap viruses.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2836430Bioinformatic Workflow Extraction from Scientific Texts based on Word Sense Disambiguation
https://www.computer.org/csdl/trans/tb/2018/06/08385176-abs.html
This paper introduces a method for automatic workflow extraction from texts using Process-Oriented Case-Based Reasoning (POCBR). While the current workflow management systems implement mostly different complicated graphical tasks based on advanced distributed solutions (e.g., cloud computing and grid computation), workflow knowledge acquisition from texts using case-based reasoning represents more expressive and semantic case representations. We propose in this context, an ontology-based workflow extraction framework to acquire processual knowledge from texts. Our methodology extends the classic NLP techniques to extract and disambiguate complex tasks and relations in texts. Using a graph-based representation of workflows and a domain ontology, our extraction process uses a context-aware approach to recognize workflow components in texts: data and control flows. We applied our framework in a technical domain in bioinformatics: i.e., phylogenetic analyses. An evaluation based on workflow semantic similarities in a gold standard proves that our approach provides promising results in the process extraction domain. Both data and implementation of our framework are available in: <uri>http://labo.bioinfo.uqam.ca/tgowler</uri>.12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2847336A Slice-based <inline-formula><tex-math notation="LaTeX">$^{13}$</tex-math><alternatives><inline-graphic xlink:href="alazmi-ieq1-2849728.gif"/></alternatives></inline-formula>C-detected NMR Spin System Forming and Resonance Assignment Method
https://www.computer.org/csdl/trans/tb/2018/06/08395030-abs.html
Nuclear magnetic resonance (NMR) spectroscopy is attracting more attention in the field of computational structural biology. Till recently, <inline-formula><tex-math notation="LaTeX">$^1$</tex-math><alternatives><inline-graphic xlink:href="alazmi-ieq2-2849728.gif"/></alternatives></inline-formula>H-detected experiments are the dominant NMR technique used due to the high sensitivity of <inline-formula><tex-math notation="LaTeX">$^1$</tex-math><alternatives><inline-graphic xlink:href="alazmi-ieq3-2849728.gif"/></alternatives></inline-formula>H nuclei. However, the current availability of high magnetic fields and cryogenically cooled probe heads allow researchers to overcome the low sensitivity of <inline-formula><tex-math notation="LaTeX">$^{13}$</tex-math><alternatives><inline-graphic xlink:href="alazmi-ieq4-2849728.gif"/></alternatives></inline-formula>C nuclei. Consequently, <inline-formula><tex-math notation="LaTeX">$^{13}$</tex-math><alternatives><inline-graphic xlink:href="alazmi-ieq5-2849728.gif"/></alternatives></inline-formula>C-detected experiments have become a popular technique in different NMR applications especially resonance assignment and structure determination of large proteins. In this paper, we propose the first spin system forming method for <inline-formula><tex-math notation="LaTeX">$^{13}$</tex-math><alternatives><inline-graphic xlink:href="alazmi-ieq6-2849728.gif"/></alternatives></inline-formula>C-detected NMR spectra. Our method is able to accurately form spin systems based on as few as two <inline-formula><tex-math notation="LaTeX">$^{13}$</tex-math><alternatives><inline-graphic xlink:href="alazmi-ieq7-2849728.gif"/></alternatives></inline-formula>C-detected spectra, CBCACON, and CBCANCO. Our method picks slices from the more trusted spectrum and uses them as feedback to direct the slice picking in the less trusted one. This feedback leads to picking the accurate slices that consequently helps to form better spin systems. We tested our method on a real dataset of ‘Ubiquitin’ and a benchmark simulated dataset consisting of 12 proteins. We fed our spin systems as inputs to a genetic algorithm to generate the chemical shift assignment, and obtained 92 percent correct chemical shift assignment for Ubiquitin. For the simulated dataset, we obtained an average recall of 86 percent and an average precision of 88 percent. Finally, our chemical shift assignment of Ubiquitin was given as an input to CS-ROSETTA server that generated structures close to the experimentally determined structure.12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2849728“Super Gene Set” Causal Relationship Discovery from Functional Genomics Data
https://www.computer.org/csdl/trans/tb/2018/06/08417929-abs.html
In this article, we present a computational framework to identify “causal relationships” among super gene sets. For “causal relationships,” we refer to both stimulatory and inhibitory regulatory relationships, regardless of through direct or indirect mechanisms. For super gene sets, we refer to “pathways, annotated lists, and gene signatures,” or PAGs. To identify causal relationships among PAGs, we extend the previous work on identifying PAG-to-PAG regulatory relationships by further requiring them to be significantly enriched with gene-to-gene co-expression pairs across the two PAGs involved. This is achieved by developing a quantitative metric based on <underline>P</underline>AG-to-<underline>P</underline>AG <underline>C</underline>o-expressions (PPC), which we use to infer the likelihood that PAG-to-PAG relationships under examination are causal—either stimulatory or inhibitory. Since true causal relationships are unknown, we approximate the overall performance of inferring causal relationships with the performance of recalling known r-type PAG-to-PAG relationships from causal PAG-to-PAG inference, using a functional genomics benchmark dataset from the GEO database. We report the area-under-curve (AUC) performance for both precision and recall being 0.81. By applying our framework to a myeloid-derived suppressor cells (MDSC) dataset, we further demonstrate that this framework is effective in helping build multi-scale biomolecular systems models with new insights on regulatory and causal links for downstream biological interpretations.12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2858755Guest Editorial for Special Section on BIBM 2015
https://www.computer.org/csdl/trans/tb/2018/06/08573201-abs.html
12/11/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2870626Editorial for Selected Papers of a Joint Conferences, Genome Informatics Workshop/International Conference on Bioinformatics (GIW/InCoB) 2015
https://www.computer.org/csdl/trans/tb/2018/06/08573202-abs.html
12/11/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2880126Special Section on Swarm-Based Algorithms and Applications in Computational Biology and Bioinformatics
https://www.computer.org/csdl/trans/tb/2018/06/08573238-abs.html
12/11/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2879422Multiscale and Multimodal Analysis for Computational Biology
https://www.computer.org/csdl/trans/tb/2018/06/08573239-abs.html
12/11/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2838658Reviving the Two-State Markov Chain Approach
https://www.computer.org/csdl/trans/tb/2018/05/07929387-abs.html
Probabilistic Boolean networks (PBNs) is a well-established computational framework for modelling biological systems. The steady-state dynamics of PBNs is of crucial importance in the study of such systems. However, for large PBNs, which often arise in systems biology, obtaining the steady-state distribution poses a significant challenge. In this paper, we revive the two-state Markov chain approach to solve this problem. This paper contributes in three aspects. First, we identify a problem of generating biased results with the approach and we propose a few heuristics to avoid such a pitfall. Second, we conduct an extensive experimental comparison of the extended two-state Markov chain approach and another approach based on the Skart method. We analyze the results with machine learning techniques and we show that statistically the two-state Markov chain approach has a better performance. Finally, we demonstrate the potential of the extended two-state Markov chain approach on a case study of a large PBN model of apoptosis in hepatocytes.10/08/2018 4:56 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2704592Efficient Algorithms for Genomic Duplication Models
https://www.computer.org/csdl/trans/tb/2018/05/07932523-abs.html
An important issue in evolutionary molecular biology is to discover genomic duplication episodes and their correspondence to the species tree. Existing approaches vary in the two fundamental aspects: the choice of evolutionary scenarios that model allowed locations of duplications in the species tree, and the rules of clustering gene duplications from gene trees into a single multiple duplication event. Here we study the method of clustering called <italic>minimum episodes</italic> for several models of allowed evolutionary scenarios with a focus on interval models in which every gene duplication has an interval consisting of allowed locations in the species tree. We present mathematical foundations for general genomic duplication problems. Next, we propose the first linear time and space algorithm for minimum episodes clustering jointly for any interval model and the algorithm for the most general model in which every evolutionary scenario is allowed. We also present a comparative study of different models of genomic duplication based on simulated and empirical datasets. We provided algorithms and tools that could be applied to solve efficiently minimum episodes clustering problems. Our comparative study helps to identify which model is the most reasonable choice in inferring genomic duplication events.10/08/2018 4:56 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2706679Inferring Gene-Species Assignments in the Presence of Horizontal Gene Transfer
https://www.computer.org/csdl/trans/tb/2018/05/07933200-abs.html
<italic><bold>Background:</bold></italic> Microbial communities from environmental samples show great diversity as bacteria quickly responds to changes in their ecosystems. To assess the scenario of the actual changes, metagenomics experiments aimed at sequencing genomic DNA from such samples are performed. These new obtained sequences together with already known are used to infer phylogenetic trees assessing the taxonomic groups the species with these genes belong to. Here, we propose the first approach to the gene-species assignment problem by using reconciliation with horizontal gene transfer. <italic><bold>Results:</bold></italic> We propose efficient algorithms that search for optimal gene-species mappings taking into account gene duplication, loss and transfer events under two tractable models of HGT reconciliation. <italic><bold>Conclusions:</bold></italic> We calculate both the optimal cost and all possible optimal scenarios. Furthermore as the number of optimal reconstructions can be large, we use a Monte-Carlo method for the inference of approximate distributions of gene-species assignments. We demonstrate the applicability on empirical and simulated datasets.10/08/2018 4:53 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2707083Genome Rearrangement with ILP
https://www.computer.org/csdl/trans/tb/2018/05/07934076-abs.html
The <italic>weighted Genome Sorting Problem (wGSP)</italic> is to find a minimum-weight sequence of rearrangement operations that transforms a given gene order into another given gene order using rearrangement operations that are associated with a predefined weight. This paper presents a polynomial sized Integer Linear Program -called <italic> GeRe-ILP</italic>- for solving the wGSP for the following three types of rearrangement operations: <italic>inversion </italic>, <italic>transposition</italic>, and <italic>inverse transposition</italic>. <italic>GeRe-ILP</italic> uses <inline-formula><tex-math notation="LaTeX">$O(n^3)$</tex-math><alternatives> <inline-graphic xlink:href="hartmann-ieq1-2708121.gif"/></alternatives></inline-formula> variables and <inline-formula> <tex-math notation="LaTeX">$O(n^3)$</tex-math><alternatives><inline-graphic xlink:href="hartmann-ieq2-2708121.gif"/> </alternatives></inline-formula> constraints for gene orders of length <inline-formula><tex-math notation="LaTeX">$n$ </tex-math><alternatives><inline-graphic xlink:href="hartmann-ieq3-2708121.gif"/></alternatives></inline-formula>. It is studied experimentally on simulated data how different weighting schemes influence the reconstructed scenarios. The influences of the length of the gene orders and of the size of the reconstructed scenarios on the runtime of <italic> GeRe-ILP</italic> are studied as well.10/08/2018 4:55 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2708121Biological Event Trigger Identification with Noise Contrastive Estimation
https://www.computer.org/csdl/trans/tb/2018/05/07936538-abs.html
Biological Event Extraction is an important task towards the goal of extracting biomedical knowledge from the scientific publications by capturing biomedical entities and their complex relations from the texts. As a crucial step in event extraction, event trigger identification, assigning words with suitable trigger category, has recently attracted substantial attention. As triggers are scattered in large corpus, traditional linguistic parsers are hard to generate syntactic features from them. Thereby, trigger sparsity problem restricts the model's learning process and becomes one of the main hinder in trigger identification. In this paper, we employ Noise Contrastive Estimation with Multi-Layer Perceptron model for solving triggers’ sparsity problem. Meanwhile, in the light of recent advance in word distributed representation, word-embedding feature generated by language model is utilized for semantic and syntactic information extraction. Finally, experimental study on commonly used MLEE dataset against baseline methods has demonstrated its promising result.10/08/2018 4:55 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2710048Local-Nearest-Neighbors-Based Feature Weighting for Gene Selection
https://www.computer.org/csdl/trans/tb/2018/05/07942061-abs.html
Selecting functional genes is essential for analyzing microarray data. Among many available feature (gene) selection approaches, the ones on the basis of the large margin nearest neighbor receive more attention due to their low computational costs and high accuracies in analyzing the high-dimensional data. Yet, there still exist some problems that hamper the existing approaches in sifting real target genes, including selecting erroneous nearest neighbors, high sensitivity to irrelevant genes, and inappropriate evaluation criteria. Previous pioneer works have partly addressed some of the problems, but none of them are capable of solving these problems simultaneously. In this paper, we propose a new local-nearest-neighbors-based feature weighting approach to alleviate the above problems. The proposed approach is based on the trick of locally minimizing the within-class distances and maximizing the between-class distances with the <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="an-ieq1-2712775.gif"/></alternatives></inline-formula> nearest neighbors rule. We further define a feature weight vector, and construct it by minimizing the cost function with a regularization term. The proposed approach can be applied naturally to the multi-class problems and does not require extra modification. Experimental results on the UCI and the open microarray data sets validate the effectiveness and efficiency of the new approach.10/08/2018 4:55 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2712775Gene Tree Construction and Correction Using SuperTree and Reconciliation
https://www.computer.org/csdl/trans/tb/2018/05/07959594-abs.html
The supertree problem asking for a tree displaying a set of consistent input trees has been largely considered for the reconstruction of species trees. Here, we rather explore this framework for the sake of reconstructing a gene tree from a set of input gene trees on partial data. In this perspective, the phylogenetic tree for the species containing the genes of interest can be used to choose among the many possible compatible “supergenetrees”, the most natural criteria being to minimize a reconciliation cost. We develop a variety of algorithmic solutions for the construction and correction of gene trees using the supertree framework. A dynamic programming supertree algorithm for constructing or correcting gene trees, exponential in the number of input trees, is first developed for the less constrained version of the problem. It is then adapted to gene trees with nodes labeled as duplication or speciation, the additional constraint being to preserve the orthology and paralogy relations between genes. Then, a quadratic time algorithm is developed for efficiently correcting an initial gene tree while preserving a set of “trusted” subtrees, as well as the relative phylogenetic distance between them, in both cases of labeled or unlabeled input trees. By applying these algorithms to the set of Ensembl gene trees, we show that this new correction framework is particularly useful to correct weakly-supported duplication nodes. The C++ source code for the algorithms and simulations described in the paper are available at <uri>https://github.com/UdeM-LBIT/SuGeT</uri>.10/08/2018 4:56 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2720581NGS-FC: A Next-Generation Sequencing Data Format Converter
https://www.computer.org/csdl/trans/tb/2018/05/07967735-abs.html
With the widespread implementation of next-generation sequencing (NGS) technologies, millions of sequences have been produced. A lot of databases were created to store and organize the high-throughput sequencing data. Numerous analysis software programs and tools have been developed over the past years. Most of them use specific formats for data representation and storage. Data interoperability becomes a crucial challenge and many tools have been developed to convert NGS data from one format to another. However, most of them were developed for specific and limited formats. Here, we present NGS-FC (Next-Generation Sequencing Format Converter), which provides a framework to support the conversion between several formats. It supports 14 formats now and provides interfaces to enable users to improve the existing converters and add new ones. Moreover, NGS-FC achieved the overall competitive performance in comparison with some existing converters in terms of RAM usage and running time. The software is written in Java and can be executed standalone. The source code and documentation are freely available at <uri>http://sysbio.suda.edu.cn/NGS-FC</uri>.10/08/2018 4:53 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2722442A Study of Cell-Free DNA Fragmentation Pattern and Its Application in DNA Sample Type Classification
https://www.computer.org/csdl/trans/tb/2018/05/07968314-abs.html
Plasma cell-free DNA (cfDNA) has certain fragmentation patterns, which can bring non-random base content curves of the sequencing data's beginning cycles. We studied the patterns and found that we could determine whether a sample is cfDNA or not by just looking into the first 10 cycles of its base content curves. We analyzed 3,189 FastQ files, including 1,442 plasma cfDNA, 1,234 genomic DNA, 507 FFPE tumour DNA, and 6 urinary cfDNA. By deep analyzing these data, we found the patterns were stable enough to distinguish cfDNA from other kinds of DNA samples. Based on this finding, we built classification models to recognize cfDNA samples by their sequencing data. Pattern recognition models were then trained with different classification algorithms like k-nearest neighbors (KNN), random forest, and support vector machine (SVM). The result of 1,000 iteration .632+ bootstrapping showed that all these classifiers could give an average accuracy higher than 98 percent, indicating that the cfDNA patterns are unique and can make the dataset highly separable. The best result was obtained using a random forest classifier with a 99.89 percent average accuracy (<inline-formula><tex-math notation="LaTeX">$\sigma =0.00068$</tex-math><alternatives> <inline-graphic xlink:href="chen-ieq1-2723388.gif"/></alternatives></inline-formula>). A tool called CfdnaPattern (<uri> http://github.com/OpenGene/CfdnaPattern</uri>) has been developed to train the model and to predict whether a sample is cfDNA or not.10/08/2018 4:53 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2723388Heavy-Tailed Noise Suppression and Derivative Wavelet Scalogram for Detecting DNA Copy Number Aberrations
https://www.computer.org/csdl/trans/tb/2018/05/07970153-abs.html
Most existing array comparative genomic hybridization (array CGH) data processing methods and evaluation models assumed that the probability density function (pdf) of noise in array CGH data is a Gaussian distribution. However, in practice, such noise distribution is peaky and heavy-tailed. Therefore, a Gaussian pdf is not adequate to approximate the noise in array CGH data and hence introduces wrong detections of chromosomal aberrations and leads misunderstanding on disease pathogenesis. A more accurate and sufficient model of noise in array CGH data is necessary and beneficial to the detection of DNA copy number variations. We analyze the real array CGH data from different platforms and show that the distribution of noise in array CGH data is fitted very well by generalized Gaussian distribution (GGD). Based on our new noise model, we propose a novel array CGH processing method combining the advantages of both the smoothing and segmentation approaches. The new method uses generalized Gaussian bivariate shrinkage function and one-directional derivative wavelet scalogram in generalized Gaussian noise. In the smoothing step, with the new generalized Gaussian noise model, we derive the heavy-tailed noise suppression algorithm in stationary wavelet domain. In the segmentation step, the 1D Gaussian derivative wavelet scalogram is employed to detect break points. Both real and simulated array CGH data with different noises (such as Gaussian noise, GGD noise, and real noise) are used in our experiments. We demonstrate that our new method outperforms other state-of-the-art methods, in terms of both root mean squared errors and receiver operating characteristic curves.10/08/2018 4:54 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2723884Combining Supervised and Unsupervised Learning for Improved miRNA Target Prediction
https://www.computer.org/csdl/trans/tb/2018/05/07979555-abs.html
MicroRNAs (miRNAs) are short non-coding RNAs which bind to mRNAs and regulate their expression. MiRNAs have been found to be associated with initiation and progression of many complex diseases. Investigating miRNAs and their targets can thus help develop new therapies by designing anti-miRNA oligonucleotides. While existing computational approaches can predict miRNA targets, these predictions have low accuracy. In this paper, we propose a two-step approach to refine the results of sequence-based prediction algorithms. The first step, which is based on our previous work, uses an ensemble learning approach that combines multiple existing methods. The second step utilizes support vector machine (SVM) classifiers in one- and two-class modes to infer miRNA-mRNA interactions based on both binding features, as well as network features extracted from gene regulatory network. Experimental results using two real data sets from TCGA indicate that the use of two-class SVM classification significantly improves the precision of miRNA-mRNA prediction.10/08/2018 4:55 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2727042Evolutionary Model for the Statistical Divergence of Paralogous and Orthologous Gene Pairs Generated by Whole Genome Duplication and Speciation
https://www.computer.org/csdl/trans/tb/2018/05/07981392-abs.html
We outline a principled approach to the analysis of duplicate gene similarity distributions, based on a model integrating sequence divergence and the process of fractionation of duplicate genes resulting from whole genome duplication (WGD). This model allows us to predict duplicate gene similarity distributions for a series of two or three WGD, for whole genome triplication followed by a WGD, and for triplication, followed by speciation, followed by WGD. We calculate the probabilities of all possible fates of a gene pair as its two members proliferate or are lost, predicting the number of surviving pairs from each event. We discuss how to calculate maximum likelihood estimators for the parameters of these models, illustrating with an analysis of the distribution of paralog similarities in the poplar genome.10/08/2018 4:55 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2712695Hardware Accelerator for the Multifractal Analysis of DNA Sequences
https://www.computer.org/csdl/trans/tb/2018/05/07990225-abs.html
The multifractal analysis has allowed to quantify the genetic variability and non-linear stability along the human genome sequence. It has some implications in explaining several genetic diseases given by some chromosome abnormalities, among other genetic particularities. The multifractal analysis of a genome is carried out by dividing the complete DNA sequence in smaller fragments and calculating the generalized dimension spectrum of each fragment using the chaos game representation and the box-counting method. This is a time consuming process because it involves the processing of large data sets using floating-point representation. In order to reduce the computation time, we designed an application-specific processor, here called multifractal processor, which is based on our proposed hardware-oriented algorithm for calculating efficiently the generalized dimension spectrum of DNA sequences. The multifractal processor was implemented on a low-cost SoC-FPGA and was verified by processing a complete human genome. The execution time and numeric results of the Multifractal processor were compared with the results obtained from the software implementation executed in a 20-core workstation, achieving a speed up of 2.6x and an average error of 0.0003 percent.10/08/2018 4:54 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2731339Improving Alzheimer's Disease Classification by Combining Multiple Measures
https://www.computer.org/csdl/trans/tb/2018/05/07990572-abs.html
Several anatomical magnetic resonance imaging (MRI) markers for Alzheimer's disease (AD) have been identified. Cortical gray matter volume, cortical thickness, and subcortical volume have been used successfully to assist the diagnosis of Alzheimer's disease including its early warning and developing stages, e.g., mild cognitive impairment (MCI) including MCI converted to AD (MCIc) and MCI not converted to AD (MCInc). Currently, these anatomical MRI measures have mainly been used separately. Thus, the full potential of anatomical MRI scans for AD diagnosis might not yet have been used optimally. Meanwhile, most studies currently only focused on morphological features of regions of interest (ROIs) or interregional features without considering the combination of them. To further improve the diagnosis of AD, we propose a novel approach of extracting ROI features and interregional features based on multiple measures from MRI images to distinguish AD, MCI (including MCIc and MCInc), and health control (HC). First, we construct six individual networks based on six different anatomical measures (i.e., CGMV, CT, CSA, CC, CFI, and SV) and Automated Anatomical Labeling (AAL) atlas for each subject. Then, for each individual network, we extract all node (ROI) features and edge (interregional) features, and denoted as node feature set and edge feature set, respectively. Therefore, we can obtain six node feature sets and six edge feature sets from six different anatomical measures. Next, each feature within a feature set is ranked by <inline-formula><tex-math notation="LaTeX">$F$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq1-2731849.gif"/></alternatives></inline-formula>-score in descending order, and the top ranked features of each feature set are applied to MKBoost algorithm to obtain the best classification accuracy. After obtaining the best classification accuracy, we can get the optimal feature subset and the corresponding classifier for each node or edge feature set. Afterwards, to investigate the classification performance with only node features, we proposed a weighted multiple kernel learning (wMKL) framework to combine these six optimal node feature subsets, and obtain a combined classifier to perform AD classification. Similarly, we can obtain the classification performance with only edge features. Finally, we combine both six optimal node feature subsets and six optimal edge feature subsets to further improve the classification performance. Experimental results show that the proposed method outperforms some state-of-the-art methods in AD classification, and demonstrate that different measures contain complementary information.10/08/2018 4:55 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2731849Continuous Petri Nets and microRNA Analysis in Melanoma
https://www.computer.org/csdl/trans/tb/2018/05/07997743-abs.html
Personalized target therapies represent one of the possible treatment strategies to fight the ongoing battle against cancer. New treatment interventions are still needed for an effective and successful cancer therapy. In this scenario, we simulated and analyzed the dynamics of BRAF V600E melanoma patients treated with BRAF inhibitors in order to find potentially interesting targets that may make standard treatments more effective in particularly aggressive tumors that may not respond to selective inhibitor drugs. To this aim, we developed a continuous Petri Net model that simulates fundamental signalling cascades involved in melanoma development, such as MAPK and PI3K/AKT, in order to deeply analyze these complex kinase cascades and predict new crucial nodes involved in melanomagenesis. The model pointed out that some microRNAs, like hsa-mir-132, downregulates expression levels of p120RasGAP: under high concentrations of p120RasGAP, MAPK pathway activation is significantly decreased and consequently also PI3K/PDK1/AKT activation. Furthermore, our analysis carried out through the Genomic Data Commons (GDC) Data Portal shows the evidence that hsa-mir-132 is significantly associated with clinical outcome in melanoma cancer genomic data sets of BRAF-mutated patients. In conclusion, targeting miRNAs through antisense oligonucleotides technology may suggest the way to enhance the action of BRAF-inhibitors.10/08/2018 4:55 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2733529Bijective Diameters of Gene Tree Parsimony Costs
https://www.computer.org/csdl/trans/tb/2018/05/08002594-abs.html
Synthesizing median trees from a collection of gene trees under the biologically motivated gene tree parsimony (GTP) costs has provided credible species tree estimates. GTP costs are defined for each of the classic evolutionary processes. These costs count the minimum number of events necessary to reconcile the gene tree with the species tree where the leaf-genes are mapped to the leaf-species through a function called labeling. To better understand the synthesis of median trees under these costs, there is an increased interest in analyzing their diameters. The diameters of a GTP cost between a gene tree and a species tree are the maximum values of this cost of one or both topologies of the trees involved. We are concerned about the diameters of the GTP costs under bijective labelings. While these diameters are linear time computable for the gene duplication and deep coalescence costs, this has been unknown for the classic gene duplication and loss, and for the loss cost. For the first time, we show how to compute these diameters and proof that this can be achieved in linear time, and thus, completing the computational time analysis for all of the bijective diameters under the GTP costs.10/08/2018 4:53 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2735968Influence of Airway Secretion on Airflow Dynamics of Mechanical Ventilated Respiratory System
https://www.computer.org/csdl/trans/tb/2018/05/08006238-abs.html
Secretions in the airways of mechanical ventilated patients are extremely dangerous to patients’ health. In recent studies, the continuous constant airflow is adopted, however, it is not consistent with a clinical situation. To study respiratory airflow dynamic characteristics with secretion in the airways, a mathematical model based on clinical mechanical ventilation is established in this paper. To illustrate the secretion's influence on the airflow dynamics of mechanical ventilated respiratory system, three key parameters which are cross section area ratio of secretion/ pipe, air-secretion contact area, and secretion viscosity are involved in the study. Through the experimental study, the accuracy and dependability of the model are confirmed. By the simulation study, we find that: based on the model which combines two airways and two model lungs, when one of the airways was covered with secretion, the maximum pressure of the model lung which is attached to the end of this airway maintains constant when the cross section area ratio is less than 66 percent, and then it tends to decline sharply with the ratio increasing, but it remains constant with the augment of air-secretion contact area, the maximum flow declines both with the increasing of cross section area ratio and air-secretion contact area. Furthermore, as for the other airway, the maximum pressure of the model lung has no significant changes with the augment of area ratio and air-secretion contact area, however, along with the increasing of area ratio and air-secretion contact area, the maximum flow rises up. Moreover, the secretion viscosity has barely any influence on airflow dynamics. According to our analysis results, we conclude that the cross section area ratio of secretion/pipe has bigger influence on airflow dynamic characteristics than air-secretion contact area and secretion viscosity. This paper lays the foundation for the further study of efficacy and safety in mechanical ventilation and the secretion clearance of mechanical ventilated patients. In addition, the mathematical model proposed in this paper can also be referred to study on the secretion movement in human airways.10/08/2018 4:54 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2737621Multiple Network Alignment via MultiMAGNA++
https://www.computer.org/csdl/trans/tb/2018/05/08013767-abs.html
Network alignment (NA) aims to find a node mapping that identifies topologically or functionally similar network regions between molecular networks of different species. Analogous to genomic sequence alignment, NA can be used to transfer biological knowledge from well- to poorly-studied species between aligned network regions. Pairwise NA (PNA) finds similar regions between two networks while multiple NA (MNA) can align more than two networks. We focus on MNA. Existing MNA methods aim to maximize total similarity over all aligned nodes (node conservation). Then, they evaluate alignment quality by measuring the amount of conserved edges, but only after the alignment is constructed. Directly optimizing edge conservation during alignment construction in addition to node conservation may result in superior alignments. Thus, we present a novel MNA method called multiMAGNA++ that can achieve this. Indeed, multiMAGNA++ outperforms or is on par with existing MNA methods, while often completing faster than existing methods. That is, multiMAGNA++ scales well to larger network data and can be parallelized effectively. During method evaluation, we also introduce new MNA quality measures to allow for more fair MNA method comparison compared to the existing alignment quality measures. The multiMAGNA++ code is available on the method's web page at <uri> http://nd.edu/~cone/multiMAGNA++/</uri>.10/08/2018 4:55 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2740381Optimizing Phylogenetic Queries for Performance
https://www.computer.org/csdl/trans/tb/2018/05/08016384-abs.html
The vast majority of phylogenetic databases do not support declarative querying using which their contents can be flexibly and conveniently accessed and the template based query interfaces they support do not allow arbitrary speculative queries. They therefore also do not support query optimization leveraging unique phylogeny properties. While a small number of graph query languages such as XQuery, Cypher, and GraphQL exist for computer savvy users, most are too general and complex to be useful for biologists, and too inefficient for large phylogeny querying. In this paper, we discuss a recently introduced visual query language, called <italic>PhyQL</italic>, that leverages phylogeny specific properties to support essential and powerful constructs for a large class of phylogentic queries. We develop a range of pruning aids, and propose a substantial set of query optimization strategies using these aids suitable for large phylogeny querying. A hybrid optimization technique that exploits a set of indices and “graphlet” partitioning is discussed. A “fail soonest” strategy is used to avoid hopeless processing and is shown to produce dividends. Possible novel optimization techniques yet to be explored are also discussed.10/08/2018 4:56 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2743706Integrating Imaging Genomic Data in the Quest for Biomarkers of Schizophrenia Disease
https://www.computer.org/csdl/trans/tb/2018/05/08025561-abs.html
It's increasingly important but difficult to determine potential biomarkers of schizophrenia (SCZ) disease, owing to the complex pathophysiology of this disease. In this study, a network-fusion based framework was proposed to identify genetic biomarkers of the SCZ disease. A three-step feature selection was applied to single nucleotide polymorphisms (SNPs), DNA methylation, and functional magnetic resonance imaging (fMRI) data to select important features, which were then used to construct two gene networks in different states for the SNPs and DNA methylation data, respectively. Two health networks (one is for SNP data and the other is for DNA methylation data) were combined into one health network from which health minimum spanning trees (MSTs) were extracted. Two disease networks also followed the same procedures. Those genes with significant changes were determined as SCZ biomarkers by comparing MSTs in two different states and they were finally validated from five aspects. The effectiveness of the proposed discovery framework was also demonstrated by comparing with other network-based discovery methods. In summary, our approach provides a general framework for discovering gene biomarkers of the complex diseases by integrating imaging genomic data, which can be applied to the diagnosis of the complex diseases in the future.10/08/2018 4:56 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2748944DNA Watermarking Using Codon Postfix Technique
https://www.computer.org/csdl/trans/tb/2018/05/08047268-abs.html
DNA watermarking is a data hiding technique that aims to protect the copyright of DNA sequences and ensures the security of private genetic information. In this paper, we proposed a novel DNA watermarking technique that can be used to embed binary bits into real DNA sequences. The proposed technique mutates the codon postfix according to the embedded bit. Our method was tested for a sample set of DNA sequences and the extracted bits showed robustness against mutation. Furthermore, the proposed DNA watermarking method proved to be secured, undetectable, resistance, and preservative to biological functions.10/08/2018 4:54 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2754496Mutli-Features Prediction of Protein Translational Modification Sites
https://www.computer.org/csdl/trans/tb/2018/05/08052538-abs.html
Post translational modification plays a significiant role in the biological processing. The potential post translational modification is composed of the center sites and the adjacent amino acid residues which are fundamental protein sequence residues. It can be helpful to perform their biological functions and contribute to understanding the molecular mechanisms that are the foundations of protein design and drug design. The existing algorithms of predicting modified sites often have some shortcomings, such as lower stability and accuracy. In this paper, a combination of physical, chemical, statistical, and biological properties of a protein have been ulitized as the features, and a novel framework is proposed to predict a protein's post translational modification sites. The multi-layer neural network and support vector machine are invoked to predict the potential modified sites with the selected features that include the compositions of amino acid residues, the E-H description of protein segments, and several properties from the AAIndex database. Being aware of the possible redundant information, the feature selection is proposed in the propocessing step in this research. The experimental results show that the proposed method has the ability to improve the accuracy in this classification issue.10/08/2018 4:53 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2752703Species Tree Estimation Using ASTRAL: How Many Genes Are Enough?
https://www.computer.org/csdl/trans/tb/2018/05/08053780-abs.html
Species tree reconstruction from genomic data is increasingly performed using methods that account for sources of gene tree discordance such as incomplete lineage sorting. One popular method for reconstructing species trees from unrooted gene tree topologies is ASTRAL. In this paper, we derive theoretical sample complexity results for the number of genes required by ASTRAL to guarantee reconstruction of the correct species tree with high probability. We also validate those theoretical bounds in a simulation study. Our results indicate that ASTRAL requires <inline-formula> <tex-math notation="LaTeX">$O(f^{-2}\log n)$</tex-math><alternatives> <inline-graphic xlink:href="shekhar-ieq1-2757930.gif"/></alternatives></inline-formula> gene trees to reconstruct the species tree correctly with high probability where <inline-formula><tex-math notation="LaTeX">$n$</tex-math> <alternatives><inline-graphic xlink:href="shekhar-ieq2-2757930.gif"/></alternatives></inline-formula> is the number of species and <inline-formula><tex-math notation="LaTeX">$f$</tex-math><alternatives> <inline-graphic xlink:href="shekhar-ieq3-2757930.gif"/></alternatives></inline-formula> is the length of the shortest branch in the species tree. Our simulations, some under the anomaly zone, show trends consistent with the theoretical bounds and also provide some practical insights on the conditions where ASTRAL works well.10/15/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2757930Cell Population Tracking in a Honeycomb Structure Using an IMM Filter Based 3D Local Graph Matching Model
https://www.computer.org/csdl/trans/tb/2018/05/08060531-abs.html
Developing algorithms for plant cell population tracking is very critical for the modeling of plant cell growth pattern and gene expression dynamics. The tracking of plant cells in microscopic image stacks is very challenging for several reasons: (1) plant cells are densely packed in a specific honeycomb structure; (2) they are frequently dividing; and (3) they are imaged in different layers within 3D image stacks. Based on an existing 2D local graph matching algorithm, this paper focuses on building a 3D plant cell matching model, by exploiting the cells’ 3D spatiotemporal context. Furthermore, the Interacting Multi-Model filter (IMM) is combined with the 3D local graph matching model to track the plant cell population simultaneously. Because our tracking algorithm does not require the identification of “tracking seeds”, the tracking stability and efficiency are greatly enhanced. Last, the plant cell lineages are achieved by associating the cell tracklets, using a maximum-a-posteriori (MAP) method. Compared with the 2D matching method, the experimental results on multiple datasets show that our proposed approach does not only greatly improve the tracking accuracy by 18 percent, but also successfully tracks the plant cells located at the high curvature primordial region, which is not addressed in previous work.10/08/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2760300Unified Deep Learning Architecture for Modeling Biology Sequence
https://www.computer.org/csdl/trans/tb/2018/05/08062800-abs.html
Prediction of the spatial structure or function of biological macromolecules based on their sequences remains an important challenge in bioinformatics. When modeling biological sequences using traditional sequencing models, long-range interaction, complicated and variable output of labeled structures, and variable length of biological sequences usually lead to different solutions on a case-by-case basis. This study proposed a unified deep learning architecture based on long short-term memory or a gated recurrent unit to capture long-range interactions. The architecture designs the optional reshape operator to adapt to the diversity of the output labels and implements a training algorithm to support the training of sequence models capable of processing variable-length sequences. The merging and pooling operators enhances the ability of capturing short-range interactions between basic units of biological sequences. The proposed deep-learning architecture and its training algorithm might be capable of solving currently variable biological sequence-modeling problems under a unified framework. We validated the model on one of the most difficult biological sequence-modeling problems, protein residue interaction prediction. The results indicate that the accuracy of obtaining the residue interactions of the model exceeded popular approaches by 10 percent on multiple widely-used benchmarks.10/08/2018 4:54 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2760832Identifying Condition-Specific Modules by Clustering Multiple Networks
https://www.computer.org/csdl/trans/tb/2018/05/08063914-abs.html
Condition-specific modules in multiple networks must be determined to reveal the underlying molecular mechanisms of diseases. Current algorithms exhibit limitations such as low accuracy and high sensitivity to the number of networks because these algorithms discover condition-specific modules in multiple networks by separating specificity and modularity of modules. To overcome these limitations, we characterize condition-specific module as a group of genes whose connectivity is strong in the corresponding network and weak in other networks; this strategy can accurately depict the topological structure of condition-specific modules. We then transform the condition-specific module discovery problem into a clustering problem in multiple networks. We develop an efficient heuristic algorithm for the <underline>S</underline>pecific <underline>M</underline>odules in <underline>M</underline>ultiple <underline>N </underline>etworks (<italic>SMMN</italic>), which discovers the condition-specific modules by considering multiple networks. By using the artificial networks, we demonstrate that SMMN outperforms state-of-the-art methods. In breast cancer networks, stage-specific modules discovered by SMMN are more discriminative in predicting cancer stages than those obtained by other techniques. In pan-cancer networks, cancer-specific modules are more likely to associate with survival time of patients, which is critical for cancer therapy.10/08/2018 4:54 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2761339MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters
https://www.computer.org/csdl/trans/tb/2018/05/08063929-abs.html
In this work, we present <italic>MPIGeneNet</italic>, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool <italic>RMTGeneNet</italic>, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. <italic>MPIGeneNet</italic> returns the same results as <italic> RMTGeneNet</italic> but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that <italic>MPIGeneNet</italic> is significantly faster than <italic>RMTGeneNet</italic>. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. The source code of <italic>MPIGeneNet</italic>, as well as a reference manual, are available at <uri> https://sourceforge.net/projects/mpigenenet/</uri>.10/08/2018 4:53 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2761340fmpRPMF: A Web Implementation for Protein Identification by Robust Peptide Mass Fingerprinting
https://www.computer.org/csdl/trans/tb/2018/05/08067527-abs.html
Peptide mass fingerprinting continues to play an important role in current proteomics studies based on its good performance in sample throughput, specificity for single peptides, and insensitivity to unexpected post-translational modifications as compared with MS<sup>n</sup>. We previously proposed and evaluated the use of feature-matching pattern-based support vector machines (SVMs) for robust protein identification. This approach is now facilitated with an updated web server (fmpRPMF) incorporated with several newly developed or improved modules and workflows allowing identification of proteins from MS<sup>1</sup> data. Development of the latest fmpRPMF web tool successfully provides a rapid and effective strategy for narrowing the range of candidate proteins. First, a mass-scanning procedure screens all candidate proteins matching the theoretical peptide mass at least three times, thereby reducing the number of candidate proteins from tens of thousands to thousands. Second, a crude ranking procedure screens true-positive proteins among the top six ranked times of candidates based on 17 selected features to reduce the number used for SVM prediction from thousands to tens. The improvement of forecasting efficiency met the requirements of multi-user and multi-task identification for web services. The updated fmpRPMF server is freely available at <uri> http://bioinformatics.datawisdom.net/fmp</uri>.10/08/2018 4:54 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TCBB.2017.2762682203