IEEE Transactions on Knowledge & Data Engineering
https://www.computer.org/csdl/trans/tk/index.html
The IEEE Transactions on Knowledge and Data Engineering is an archival journal published monthly. The information published in this Transactions is designed to inform researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area. We are interested in well-defined theoretical results and empirical studies that have potential impact on the acquisition, management, storage, and graceful degeneration of knowledge and data, as well as in provision of knowledge and data services. Specific topics include, but are not limited to: a) artificial intelligence techniques, including speech, voice, graphics, images, and documents; b) knowledge and data engineering tools and techniques; c) parallel and distributed processing; d) real-time distributed; e) system architectures, integration, and modeling; f) database design, modeling and management; g) query design and implementation languages; h) distributed database control; j) algorithms for data and knowledge management; k) performance evaluation of algorithms and systems; l) data communications aspects; m) system applications and experience; n) knowledge-based and expert systems; and, o) integrity, security, and fault tolerance.en-ushttps://s3.amazonaws.com/ieeecs.cdn.common/images/computer.org/CSLogo_dark.pngIEEE Computer Society Digital LibraryList of 100 recently published journal articles.
https://www.computer.org/csdl
Influence Maximization on Social Graphs: A Survey
https://www.computer.org/csdl/trans/tk/2018/10/08295265-abs.html
Influence Maximization (IM), which selects a set of <inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="li-ieq1-2807843.gif"/></alternatives></inline-formula> users (called seed set) from a social network to maximize the expected number of influenced users (called influence spread), is a key algorithmic problem in social influence analysis. Due to its immense application potential and enormous technical challenges, IM has been extensively studied in the past decade. In this paper, we survey and synthesize a wide spectrum of existing studies on IM from an <italic>algorithmic perspective</italic>, with a special focus on the following key aspects: (1) a review of well-accepted diffusion models that capture the information diffusion process and build the foundation of the IM problem, (2) a fine-grained taxonomy to classify existing IM algorithms based on their design objectives, (3) a rigorous theoretical comparison of existing IM algorithms, and (4) a comprehensive study on the applications of IM techniques in combining with novel context features of social networks such as topic, location, and time. Based on this analysis, we then outline the key challenges and research directions to expand the boundary of IM research.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807843Mining Summaries for Knowledge Graph Search
https://www.computer.org/csdl/trans/tk/2018/10/08300649-abs.html
Querying heterogeneous and large-scale knowledge graphs is expensive. This paper studies a graph summarization framework to facilitate knowledge graph search. (1) We introduce a class of <italic>reduced summaries</italic>. Characterized by approximate graph pattern matching, these summaries are capable of summarizing entities in terms of their neighborhood similarity up to a certain hop, using small and informative graph patterns. (2) We study a <italic> diversified graph summarization</italic> problem. Given a knowledge graph, it is to discover top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="song-ieq1-2807442.gif"/> </alternatives></inline-formula> summaries that maximize a bi-criteria function, characterized by both informativeness and diversity. We show that diversified summarization is feasible for large graphs, by developing both sequential and parallel summarization algorithms. (a) We show that there exists a 2-approximation algorithm to discover diversified summaries. We further develop an anytime sequential algorithm which discovers summaries under resource constraints. (b) We present a new parallel algorithm with quality guarantees. The algorithm is parallel scalable, which ensures its feasibility in distributed graphs. (3) We also develop a summary-based query evaluation scheme, which only refers to a small number of summaries. Using real-world knowledge graphs, we experimentally verify the effectiveness and efficiency of our summarization algorithms, and query processing using summaries.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807442Top-<italic>k</italic> Critical Vertices Query on Shortest Path
https://www.computer.org/csdl/trans/tk/2018/10/08300661-abs.html
Shortest path query is one of the most fundamental and classic problems in graph analytics, which returns the complete shortest path between any two vertices. However, in many real-life scenarios, only critical vertices on the shortest path are desirable and it is unnecessary to search for the complete path. This paper investigates the shortest path sketch by defining a top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq1-2808495.gif"/></alternatives></inline-formula> critical vertices (<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="ma-ieq2-2808495.gif"/></alternatives> </inline-formula>CV) query on the shortest path. Given a source vertex <inline-formula><tex-math notation="LaTeX">$s$ </tex-math><alternatives><inline-graphic xlink:href="ma-ieq3-2808495.gif"/></alternatives></inline-formula> and target vertex <inline-formula><tex-math notation="LaTeX">$t$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq4-2808495.gif"/></alternatives></inline-formula> in a graph, <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="ma-ieq5-2808495.gif"/></alternatives> </inline-formula>CV query can return the top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq6-2808495.gif"/></alternatives></inline-formula> significant vertices on the shortest path <inline-formula><tex-math notation="LaTeX">$SP(s,t)$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq7-2808495.gif"/></alternatives></inline-formula>. The significance of the vertices can be predefined. The key strategy for seeking the sketch is to apply off-line preprocessed distance oracle to accelerate on-line real-time queries. This allows us to omit unnecessary vertices and obtain the most representative sketch of the shortest path directly. We further explore a series of methods and optimizations to answer <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq8-2808495.gif"/></alternatives></inline-formula>CV query on both centralized and distributed platforms, using exact and approximate approaches, respectively. We evaluate our methods in terms of time, space complexity and approximation quality. Experiments on large-scale real-world networks validate that our algorithms are of high efficiency and accuracy.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808495Self-Tuned Descriptive Document Clustering Using a Predictive Network
https://www.computer.org/csdl/trans/tk/2018/10/08301532-abs.html
Descriptive clustering consists of automatically organizing data instances into clusters and generating a descriptive summary for each cluster. The description should inform a user about the contents of each cluster without further examination of the specific instances, enabling a user to rapidly scan for relevant clusters. Selection of descriptions often relies on heuristic criteria. We model descriptive clustering as an auto-encoder network that predicts features from cluster assignments and predicts cluster assignments from a subset of features. The subset of features used for predicting a cluster serves as its description. For text documents, the occurrence or count of words, phrases, or other attributes provides a sparse feature representation with interpretable feature labels. In the proposed network, cluster predictions are made using logistic regression models, and feature predictions rely on logistic or multinomial regression models. Optimizing these models leads to a completely self-tuned descriptive clustering approach that automatically selects the number of clusters and the number of features for each cluster. We applied the methodology to a variety of short text documents and showed that the selected clustering, as evidenced by the selected feature subsets, are associated with a meaningful topical organization.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2781721Locality Reconstruction Models for Book Representation
https://www.computer.org/csdl/trans/tk/2018/10/08301545-abs.html
Books, as a representative of lengthy documents, convey rich semantics. Traditional document modeling methods, such as bag-of-words models, have difficulty capturing such rich semantics when only considering term-frequency features. In order to explore term spatial distributions over a book, a tree-structured book representation is investigated in this paper. Moreover, an efficient learning framework, Tree2Vector, is introduced for mapping tree-structured book data into vectorial space. In particular, we present two types of locality reconstruction (LR) models: Euclidean-type and cosine-type, during the transformation process of tree structures into vectorial representations. The LR is used for modeling the reconstruction process, in which each parent node in a tree is supposed to be reconstructed by its child nodes. The prominent advantage of this Tree2Vector framework is that it solely utilizes the local information within a single book tree. In addition, extensive experimental results demonstrate that Tree2Vector is able to deliver comparable or better performance in comparison to methods that consider the information of all trees in a database globally. Experimental results also suggest that cosine-type LR consistently performs better than Euclidean-type LR in applications of book and author recommendations.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808953<inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq1-2808971.gif"/></alternatives></inline-formula>: A Scalable Method for in-Memory <italic>k</italic>NN Search over Moving Objects in Road Networks
https://www.computer.org/csdl/trans/tk/2018/10/08301596-abs.html
Nowadays, many location-based applications require the ability of querying <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="cao-ieq2-2808971.gif"/> </alternatives></inline-formula>-nearest neighbors over a very large scale of moving objects in road networks, e.g., taxi-calling and ride-sharing services. Traditional grid index with equal-sized cells can not adapt to the skewed distribution of moving objects in real scenarios. Thus, to obtain the fast querying response time, the grid needs to be split into more smaller cells which introduces the side-effect of higher memory cost, i.e., maintaining such a large volume of cells requires a much larger memory space at the server side. In this paper, we present <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq3-2808971.gif"/></alternatives></inline-formula>, a scalable and in-memory <italic>k </italic>NN query processing technique. <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math> <alternatives><inline-graphic xlink:href="cao-ieq4-2808971.gif"/></alternatives></inline-formula> is dual-index driven, where we adopt a R-tree to store the topology of the road network and a <italic>hierarchical grid model</italic> to manage the moving objects in non-uniform distribution. To answer a <italic>k</italic>NN query in real time, <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq5-2808971.gif"/></alternatives></inline-formula> adopts the strategy that incrementally enlarges the search area for network distance based nearest neighbor evaluation. It is far from trivial to perform the space expansion within the hierarchical grid index. For a given cell, we first define its neighbors in different directions, then propose a cell communication technique which allows each cell in the hierarchical grid index to be aware of its neighbors at any time. Accordingly, an efficient space expansion algorithm to generate the estimation area is proposed. The experimental evaluation shows that <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives><inline-graphic xlink:href="cao-ieq6-2808971.gif"/></alternatives></inline-formula> outperforms the baseline algorithm in terms of time and memory efficiency.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808971Efficient Parallel Skyline Query Processing for High-Dimensional Data
https://www.computer.org/csdl/trans/tk/2018/10/08302507-abs.html
Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous se of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing, as well as the ensuing high computation cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each compute node partitions the input data points into disjoint subsets, and then performs the skyline computation on each subset to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2809598Tensor-Based Big Data Management Scheme for Dimensionality Reduction Problem in Smart Grid Systems: SDN Perspective
https://www.computer.org/csdl/trans/tk/2018/10/08302840-abs.html
Smart grid (SG) is an integration of traditional power grid with advanced information and communication infrastructure for bidirectional energy flow between grid and end users. A huge amount of data is being generated by various smart devices deployed in SG systems. Such a massive data generation from various smart devices in SG systems may lead to various challenges for the networking infrastructure deployed between users and the grid. Hence, an efficient data transmission technique is required for providing desired QoS to the end users in this environment. Generally, the data generated by smart devices in SG has high dimensions in the form of multiple heterogeneous attributes, values of which are changed with time. The high dimensions of data may affect the performance of most of the designed solutions in this environment. Most of the existing schemes reported in the literature have complex operations for the data dimensionality reduction problem which may deteriorate the performance of any implemented solution for this problem. To address these challenges, in this paper, a tensor-based big data management scheme is proposed for dimensionality reduction problem of big data generated from various smart devices. In the proposed scheme, first the Frobenius norm is applied on high-order tensors (used for data representation) to minimize the reconstruction error of the reduced tensors. Then, an empirical probability-based control algorithm is designed to estimate an optimal path to forward the reduced data using software-defined networks for minimization of the network load and effective bandwidth utilization. The proposed scheme minimizes the transmission delay incurred during the movement of the dimensionally reduced data between different nodes. The efficacy of the proposed scheme has been evaluated using extensive simulations carried out on the data traces using ‘R’ programming and Matlab. The big data traces considered for evaluation consist of more than two million entries (2,075,259) collected at one minute sampling rate having hetrogenous features such as–voltage, energy, frequency, electric signals, etc. Moreover, a comparative study for different data traces and a real SG testbed is also presented to prove the efficacy of the proposed scheme. The results obtained depict the effectiveness of the proposed scheme with respect to the parameters such as- network delay, accuracy, and throughput.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2809747Planning with Spatio-Temporal Search Control Knowledge
https://www.computer.org/csdl/trans/tk/2018/10/08303742-abs.html
Knowledge based approaches developed for AI planning can convert an intractable planning problem to a tractable one. Current techniques often use temporal logics to express Search Control Knowledge (SCK) in logic based planning. However, traditional temporal logics are limited in expressiveness since they are unable to express spatial constraints which are as important as temporal ones in many planning domains. To this end, we propose a two-dimensional (spatial and temporal) logic namely PPTL<inline-formula><tex-math notation="LaTeX">$^{\mathrm{SL}}$ </tex-math><alternatives><inline-graphic xlink:href="lu-ieq1-2810144.gif"/></alternatives></inline-formula> by temporalizing separation logic with PPTL (Propositional Projection Temporal Logic) which is well-suited to specify SCK involving both spatial and temporal constraints in planning. We prove that PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq2-2810144.gif"/> </alternatives></inline-formula> is decidable essentially via an equisatisfiable translation from PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq3-2810144.gif"/> </alternatives></inline-formula> to its restricted form. Moreover, we implement a tool, <italic>S-TSolver</italic>, which effectively computes plans under the guidance of the spatio-temporal SCK expressed by PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq4-2810144.gif"/> </alternatives></inline-formula> formulas. The effectiveness of the tool is evaluated on selected benchmark domains from the International Planning Competition.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810144Semi-Supervised Feature Selection via Insensitive Sparse Regression with Application to Video Semantic Recognition
https://www.computer.org/csdl/trans/tk/2018/10/08304684-abs.html
Feature selection plays a significant role in dealing with high-dimensional data to avoid the curse of dimensionality. In many real applications, like video semantic recognition, handling few labeled and large unlabeled data samples from the same population is a recently addressed challenge in feature selection. To solve this problem, we propose a novel semi-supervised feature selection method via insensitive sparse regression (ISR). Specifically, we compute the soft label matrix by the special label propagation, which can predict the labels of the unlabeled data. To guarantee the robustness of ISR to the false labeled instances or outliers, we propose Insensitive Regression Model (IRM) by capped <inline-formula><tex-math notation="LaTeX">$l_2$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq1-2810286.gif"/></alternatives></inline-formula>-<inline-formula> <tex-math notation="LaTeX">$l_p$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq2-2810286.gif"/></alternatives></inline-formula>-norm loss. The soft label is imposed as the weights of IRM to fully utilize the label information. Meanwhile, to perform feature selection, we incorporate <inline-formula><tex-math notation="LaTeX"> $l_{2,q}$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq3-2810286.gif"/></alternatives></inline-formula> -norm regularizer with IRM as the structural sparsity constraint when <inline-formula><tex-math notation="LaTeX"> $0<q\leq 1$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq4-2810286.gif"/></alternatives> </inline-formula>. Moreover, we put forward an effective approach for solving the formulated non-convex optimization problem. We analyze the performance of convergence rigorously and discuss the parameter determination problem. Extensive experimental results on several public data sets verify the effectiveness of our proposed algorithm in comparison with the state-of-art feature selection methods. Finally, we apply our method to video semantic recognition successfully.09/12/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810286We Like, We Post: A Joint User-Post Approach for Facebook Post Stance Labeling
https://www.computer.org/csdl/trans/tk/2018/10/08305481-abs.html
Web post and user stance labeling is challenging not only because of the informality and variation in language on the Web but also because of the lack of labeled data on fast-emerging new topics—even the labeled data we do have are usually heavily skewed. In this paper, we propose a joint user-post approach for stance labeling to mitigate the latter two difficulties. In labeling post stance, the proposed approach considers post content as well as posting and liking behavior, which involves users. Sentiment analysis is applied to posts to acquire their initial stance, and then the post and user stance are updated iteratively with correlated posting-related actions. The whole process works with limited labeled data, which solves the first problem. We use real interaction between authors and readers for stance labeling. Experimental results show that the proposed approach not only substantially improves content-based post stance labeling, but also yields better performance for the minor stance class, which solves the second problem.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810875Multi-Label Learning with Emerging New Labels
https://www.computer.org/csdl/trans/tk/2018/10/08305522-abs.html
In a multi-label learning task, an object possesses multiple concepts where each concept is represented by a class label. Previous studies on multi-label learning have focused on a fixed set of class labels, i.e., the class label set of test data is the same as that in the training set. In many applications, however, the environment is dynamic and new concepts may emerge in a data stream. In order to maintain a good predictive performance in this environment, a multi-label learning method must have the ability to detect and classify instances with emerging new labels. To this end, we propose a new approach called Multi-label learning with Emerging New Labels (<monospace>MuENL</monospace>). It has three functions: classify instances on currently known labels, detect the emergence of a new label, and construct a new classifier for each new label that works collaboratively with the classifier for known labels. In addition, we show that <monospace>MuENL</monospace> can be easily extended to handle sparse high dimensional data streams by simply reducing the original dimensionality, and then applying <monospace>MuENL</monospace> on the reduced dimensional space. Our empirical evaluation shows the effectiveness of <monospace>MuENL</monospace> on several benchmark datasets and <monospace>MuENLHD</monospace> on the sparse high dimensional Weibo dataset.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810872Supervised Search Result Diversification via Subtopic Attention
https://www.computer.org/csdl/trans/tk/2018/10/08305531-abs.html
Search result diversification aims to retrieve diverse results to satisfy as many different information needs as possible. Supervised methods have been proposed recently to learn ranking functions and they have been shown to produce superior results to unsupervised methods. However, these methods use implicit approaches based on the principle of Maximal Marginal Relevance (MMR). In this paper, we propose a learning framework for explicit result diversification where subtopics are explicitly modeled. Based on the information contained in the sequence of selected documents, we use the attention mechanism to capture the subtopics to be focused on while selecting the next document, which naturally fits our task of document selection for diversification. As a preliminary attempt, we employ recurrent neural networks and max pooling to instantiate the framework. We use both distributed representations and traditional relevance features to model documents in the implementation. The framework is flexible to model query intent in either a flat list or a hierarchy. Experimental results show that the proposed method significantly outperforms all the existing search result diversification approaches.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810873Automated Phrase Mining from Massive Text Corpora
https://www.computer.org/csdl/trans/tk/2018/10/08306825-abs.html
As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, <inline-formula> <tex-math notation="LaTeX">$\mathsf{AutoPhrase}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq1-2812203.gif"/></alternatives></inline-formula>, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, <inline-formula><tex-math notation="LaTeX"> $\mathsf{AutoPhrase}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq2-2812203.gif"/></alternatives> </inline-formula> has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, <inline-formula><tex-math notation="LaTeX">$\mathsf{AutoPhrase}$ </tex-math><alternatives><inline-graphic xlink:href="shang-ieq3-2812203.gif"/></alternatives></inline-formula> can be extended to model single-word quality phrases.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2812203AMDO: An Over-Sampling Technique for Multi-Class Imbalanced Problems
https://www.computer.org/csdl/trans/tk/2018/09/08065074-abs.html
Multi-class imbalanced problems have attracted growing attention from the real-world classification tasks in engineering. The underlying skewed distribution of multiple classes poses difficulties for learning algorithms, which becomes more challenging when considering overlapping between classes, lack of representative data, and mixed-type data. In this work, we address this problem in a data-oriented way. Motivated by a recently proposed over-sampling technique designed for numeric data sets, Mahalanobis Distance-based Over-sampling (MDO), we use this technique to capture the covariance structure of the minority class and to generate synthetic samples along the probability contours for learning algorithms. Based on MDO, we further improve the over-sampling strategy and generalize it for mixed-type data sets. The established technique, Adaptive Mahalanobis Distance-based Over-sampling (AMDO), introduces GSVD (Generalized Singular Value Decomposition) for mixed-type data, develops a partially balanced resampling scheme and optimizes the sample synthesis. Theoretical analysis is conducted to demonstrate the reasonability of AMDO. Extensive experimental testing is performed on 15 multi-class imbalanced benchmarks and two data sets for precipitation phase recognition in comparison with several state-of-the-art multi-class imbalanced learning methods. The results validate the effectiveness and robustness of our proposal.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2761347Privacy Characterization and Quantification in Data Publishing
https://www.computer.org/csdl/trans/tk/2018/09/08276593-abs.html
The increasing interest in collecting and publishing large amounts of individuals’ data as public for purposes such as medical research, market analysis, and economical measures has created major privacy concerns about individual's sensitive information. To deal with these concerns, many Privacy-Preserving Data Publishing (PPDP) techniques have been proposed in literature. However, they lack a proper privacy characterization and measurement. In this paper, we first present a novel multi-variable privacy characterization and quantification model. Based on this model, we are able to analyze the prior and posterior adversarial belief about attribute values of individuals. We can also analyze the sensitivity of any identifier in privacy characterization. Then, we show that privacy should not be measured based on one metric. We demonstrate how this could result in privacy misjudgment. We propose two different metrics for quantification of privacy leakage, distribution leakage, and entropy leakage. Using these metrics, we analyzed some of the most well-known PPDP techniques such as <inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="ibrahim-ieq1-2797092.gif"/></alternatives></inline-formula>-anonymity, <inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives> <inline-graphic xlink:href="ibrahim-ieq2-2797092.gif"/></alternatives></inline-formula>-diversity, and <inline-formula> <tex-math notation="LaTeX">$t$</tex-math><alternatives><inline-graphic xlink:href="ibrahim-ieq3-2797092.gif"/> </alternatives></inline-formula>-closeness. Based on our framework and the proposed metrics, we can determine that all the existing PPDP schemes have limitations in privacy characterization. Our proposed privacy characterization and measurement framework contributes to better understanding and evaluation of these techniques. Thus, this paper provides a foundation for design and analysis of PPDP schemes.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2797092On Generalizing Collective Spatial Keyword Queries
https://www.computer.org/csdl/trans/tk/2018/09/08278270-abs.html
With the proliferation of spatial-textual data such as location-based services and geo-tagged websites, spatial keyword queries are ubiquitous in real life. One example of spatial-keyword query is the so-called <italic>collective spatial keyword query</italic> (CoSKQ) which is to find for a given query consisting a query location and several query keywords a set of objects which <italic>covers</italic> the query keywords collectively and has the smallest <italic>cost</italic> wrt the query location. In the literature, many different functions were proposed for defining the <inline-formula><tex-math notation="LaTeX">${cost}$</tex-math><alternatives> <inline-graphic xlink:href="chan-ieq1-2800746.gif"/></alternatives></inline-formula> and correspondingly, many different approaches were developed for the CoSKQ problem. In this paper, we study the CoSKQ problem systematically by proposing <italic>a unified cost function</italic> and <italic>a unified approach</italic> for the CoSKQ problem (with the unified cost function). The unified cost function includes all existing cost functions as special cases and the unified approach solves the CoSKQ problem with the unified cost function in a unified way. Experiments were conducted on both real and synthetic datasets which verified our proposed approach.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2800746On Power Law Growth of Social Networks
https://www.computer.org/csdl/trans/tk/2018/09/08280512-abs.html
What is the growth dynamics of social networks, like Facebook or WeChat? Does it truly exhibit exponential early-growth, as predicted by the celebrated models, like the Bass model? How about the dynamics of links, for which there are few published models? For the first time, we examine the growth of WeChat which is the largest online social network in China, together with several other real social networks. We observe Power-Law growth dynamics for both nodes and links, a fact that breaks the textbook models featuring Sigmoid curves. We propose <sc>NetTide</sc>, along with differential equations for the growth of nodes and links. Our model fits the growth dynamics of real social networks well; it encompasses many traditional growth dynamics as special cases, while remaining parsimonious in parameters. The <sc>NetTide</sc> for link growth is the first one of its kind, accurately fitting real data, and capturing densification phenomenon. We further formulate two stochastic generators, which interpret the growth of nodes and links through survival analysis and micro-level interactions within a social network, respectively. The proposed generators reproduce realistic growth dynamics of social networks. When applied on the WeChat data, our <sc> NetTide</sc> forecasted <inline-formula><tex-math notation="LaTeX">$\geq$</tex-math><alternatives> <inline-graphic xlink:href="zang-ieq1-2801844.gif"/></alternatives></inline-formula> 730 days ahead with 3 percent error.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2801844Querying a Collection of Continuous Functions
https://www.computer.org/csdl/trans/tk/2018/09/08283622-abs.html
We introduce a new query primitive called <italic>Function Query</italic> (FQ). An FQ operates on a set of math functions and retrieves the functions whose output with a given input satisfies a query condition (e.g., being among top k, within a given range). While FQ finds its natural uses in querying a database of math functions, it can also be applied on a database of discrete values. We show that by interpreting the database as a set of user-defined functions, FQ can achieve the same functionality as existing analytic queries such as top-k query and scalar product query. We address the challenge of efficient execution of FQ. The core of our solution is a novel data structure called <italic>Intersection-tree</italic>. Our research takes advantage of the fact that 1) the intersections of a set of continuous functions partition their domain into a number of <italic>subdomains</italic>, and 2) in each of these subdomains, the functions can be sorted based on their output. We evaluate the performance of the proposed techniques through analysis, prototyping, and experiments using both synthetic and real-world data. When querying a database of functions, our techniques scale well. When applied on a database of discrete values, our techniques are more versatile and outperform existing techniques in terms of various performance metrics.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2802936On Efficiently Answering Why-Not Range-Based Skyline Queries in Road Networks
https://www.computer.org/csdl/trans/tk/2018/09/08283816-abs.html
The range-based skyline (r-skyline) query on road networks retrieves the skyline objects for each of the query points that are within a road region, considering the objects’ spatial and non-spatial attributes. However, reasoning about missing query results, specified by <italic>why-not questions</italic>, has not till recently received the attention it is worth of. In this paper, we systematically carry out the study of why-not questions on the r-skyline query in the road network environment (abbrev. as the <italic>why-not RSQ problem</italic>). We present three modification strategies, including modifying the query range, modifying the why-not point, and modifying both of them, for supporting the why-not RSQ problem. We also propose three efficient algorithms to tackle the why-not RSQ problem, where several newly presented effective concepts/techniques are leveraged, such as the concepts of <italic> skyline scope</italic> and <italic>skyline dominance region</italic>, <italic>non-spatial attribute modification pruning</italic>, and <italic>G-tree index</italic>. Extensive experimental evaluation using both real and synthetic data sets demonstrates the performance of our proposed algorithms.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2803821A Fast Parallel Community Discovery Model on Complex Networks Through Approximate Optimization
https://www.computer.org/csdl/trans/tk/2018/09/08283822-abs.html
Community discovery plays an essential role in the analysis of the structural features of complex networks. Since online networks grow increasingly large and complex over time, the methods traditionally used for community discovery cannot efficiently handle large-scale network data. This introduces the important problem of how to effectively and efficiently discover large communities from complex networks. In this study, we propose a fast parallel community discovery model called picaso (a <bold>p</bold>arallel commun<bold>i</bold>ty dis<bold>c</bold>overy <bold>a</bold> lgorithm ba<bold>s</bold>ed on approximate <bold>o</bold>ptimization), which integrates two new techniques: (1) Mountain model, which works by utilizing graph theory to approximate the selection of nodes needed for merging, and (2) Landslide algorithm, which is used to update the modularity increment based on the approximated optimization. In addition, the GraphX distribution computing framework is employed in order to achieve parallel community detection over complex networks. In the proposed model, clustering on modularity is used to initialize the Mountain model as well as to compute the weight of each edge in the networks. The relationships among the communities are then simplified by applying the Landslide algorithm, which allows us to obtain the community structures of the complex networks. Extensive experiments were conducted on real and synthetic complex network datasets, and the results demonstrate that the proposed algorithm can outperform the state of the art methods, in effectiveness and efficiency, when working to solve the problem of community detection. Moreover, we demonstratively prove that overall time performance approximates to four times faster than similar approaches. Effectively our results suggest a new paradigm for large-scale community discovery of complex networks.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2803818Privacy Enhanced Matrix Factorization for Recommendation with Local Differential Privacy
https://www.computer.org/csdl/trans/tk/2018/09/08290673-abs.html
Recommender systems are collecting and analyzing user data to provide better user experience. However, several privacy concerns have been raised when a recommender knows user's set of items or their ratings. A number of solutions have been suggested to improve privacy of legacy recommender systems, but the existing solutions in the literature can protect either items or ratings only. In this paper, we propose a recommender system that protects both user's items and ratings. For this, we develop novel matrix factorization algorithms under local differential privacy (LDP). In a recommender system with LDP, individual users randomize their data themselves to satisfy differential privacy and send the perturbed data to the recommender. Then, the recommender computes aggregates of the perturbed data. This framework ensures that both user's items and ratings remain private from the recommender. However, applying LDP to matrix factorization typically raises utility issues with i) high dimensionality due to a large number of items and ii) iterative estimation algorithms. To tackle these technical challenges, we adopt dimensionality reduction technique and a novel binary mechanism based on sampling. We additionally introduce a factor that stabilizes the perturbed gradients. With MovieLens and LibimSeTi datasets, we evaluate recommendation accuracy of our recommender system and demonstrate that our algorithm performs better than the existing differentially private gradient descent algorithm for matrix factorization under stronger privacy requirements.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2805356Optimizing Quality for Probabilistic Skyline Computation and Probabilistic Similarity Search
https://www.computer.org/csdl/trans/tk/2018/09/08291012-abs.html
Probabilistic queries have been extensively explored to provide answers with confidence, in order to support the real-life applications struggling with uncertain data, such as sensor networks and data integration. However, the uncertainty of data may propagate, and thus, the results returned by probabilistic queries contain much noise, which <italic>degrades</italic> query quality significantly. In this paper, we propose an efficient optimization framework, termed as <inline-formula><tex-math notation="LaTeX">$\mathsf {QueryClean}$</tex-math><alternatives> <inline-graphic xlink:href="gao-ieq1-2805824.gif"/></alternatives></inline-formula>, for both probabilistic skyline computation and probabilistic similarity search. The goal of <inline-formula><tex-math notation="LaTeX">$\mathsf {QueryClean}$</tex-math><alternatives><inline-graphic xlink:href="gao-ieq2-2805824.gif"/></alternatives> </inline-formula> is to optimize query quality via selecting a group of uncertain objects to clean under limited resource available, where a joint-entropy based quality function is leveraged. We develop an efficient structure called <inline-formula><tex-math notation="LaTeX">$\mathsf {ASI}$</tex-math><alternatives> <inline-graphic xlink:href="gao-ieq3-2805824.gif"/></alternatives></inline-formula> to index the possible result sets of probabilistic queries, which helps to avoid many types of probabilistic query evaluations over a large number of the possible worlds for quality computation. Moreover, we present <italic>exact</italic> and <italic>approximate </italic> algorithms for the optimization problem, using two newly presented heuristics. Considerable experimental results on both real and synthetic data sets demonstrate the efficiency and scalability of our proposed framework <inline-formula><tex-math notation="LaTeX">$\mathsf {QueryClean}$</tex-math><alternatives> <inline-graphic xlink:href="gao-ieq4-2805824.gif"/></alternatives></inline-formula>.08/07/2018 12:32 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2805824Relationship between Variants of One-Class Nearest Neighbors and Creating Their Accurate Ensembles
https://www.computer.org/csdl/trans/tk/2018/09/08293843-abs.html
In one-class classification problems, only the data for the target class is available, whereas the data for the non-target class may be completely absent. In this paper, we study one-class nearest neighbor (OCNN) classifiers and their different variants. We present a theoretical analysis to show the relationships among different variants of OCNN that may use different neighbors or thresholds to identify unseen examples of the non-target class. We also present a method based on inter-quartile range for optimizing parameters used in OCNN in the absence of non-target data during training. Then, we propose two ensemble approaches based on random subspace and random projection methods to create accurate OCNN ensembles. We tested the proposed methods on 15 benchmark and real world domain-specific datasets and show that random-projection ensembles of OCNN perform best.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2806975CRAFTER: A Tree-Ensemble Clustering Algorithm for Static Datasets with Mixed Attributes and High Dimensionality
https://www.computer.org/csdl/trans/tk/2018/09/08294273-abs.html
Clustering is an important aspect of data mining, while clustering high-dimensional mixed-attribute data in a scalable fashion still remains a challenging problem. In this paper, we propose a tree-ensemble clustering algorithm for static datasets, CRAFTER, to tackle this problem. CRAFTER is able to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of datasets. CRAFTER leverages the advantages of a tree-ensemble to handle mixed attributes and high dimensionality. The concept of the class probability estimates is utilized to identify the representative data points for clustering. Through a series of experiments on both synthetic and real datasets, we have demonstrated that CRAFTER is superior than Random Forest Clustering (RFC), an existing tree-based clustering method, in terms of both the clustering quality and the computational cost.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807444A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications
https://www.computer.org/csdl/trans/tk/2018/09/08294302-abs.html
Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into a low dimensional space in which the graph structural information and graph properties are maximumly preserved. In this survey, we conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in different graph embedding problem settings and how the existing work addresses these challenges in their solutions. Finally, we summarize the applications that graph embedding enables and suggest four promising future research directions in terms of computation efficiency, problem settings, techniques, and application scenarios.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807452A Survey of Location Prediction on Twitter
https://www.computer.org/csdl/trans/tk/2018/09/08295255-abs.html
Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we make a conclusion of the survey and list future research directions.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807840Unsupervised Coupled Metric Similarity for Non-IID Categorical Data
https://www.computer.org/csdl/trans/tk/2018/09/08300657-abs.html
Appropriate similarity measures always play a critical role in data analytics, learning, and processing. Measuring the intrinsic similarity of categorical data for unsupervised learning has not been substantially addressed, and even less effort has been made for the similarity analysis of categorical data that is not independent and identically distributed (non-IID). In this work, a Coupled Metric Similarity (CMS) is defined for unsupervised learning which flexibly captures the value-to-attribute-to-object heterogeneous coupling relationships. CMS learns the similarities in terms of intrinsic heterogeneous intra- and inter-attribute couplings and attribute-to-object couplings in categorical data. The CMS validity is guaranteed by satisfying metric properties and conditions, and CMS can flexibly adapt to IID to non-IID data. CMS is incorporated into spectral clustering and k-modes clustering and compared with relevant state-of-the-art similarity measures that are not necessarily metrics. The experimental results and theoretical analysis show the CMS effectiveness of capturing independent and coupled data characteristics, which significantly outperforms other similarity measures on most datasets.08/07/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808532Correction to “K Nearest Neighbour Joins for Big Data on MapReduce: A Theoretical and Experimental Analysis”
https://www.computer.org/csdl/trans/tk/2018/09/08426042-abs.html
08/06/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2748438Low-Rank Multi-View Embedding Learning for Micro-Video Popularity Prediction
https://www.computer.org/csdl/trans/tk/2018/08/08233154-abs.html
Recently, a prevailing trend of user generated content (UGC) on social media sites is the emerging micro-videos. Micro-videos afford many potential opportunities ranging from network content caching to online advertising, yet there are still little efforts dedicated to research on micro-video understanding. In this paper, we focus on popularity prediction of micro-videos by presenting a novel low-rank multi-view embedding learning framework. We name it as transductive low-rank multi-view regression (TLRMVR), and it is capable of boosting the performance of micro-video popularity prediction by jointly considering the intrinsic representations of the source and target samples. In particular, TLRMVR integrates low-rank multi-view embedding and regression analysis into a unified framework such that the lowest-rank representation shared by all views not only captures the global structure of all views, but also indicates the regression requirements. The framework is formulated as a regression model and it seeks a set of view-specific projection matrices with low-rank constraints to map multi-view features into a common subspace. In addition, a multi-graph regularization term is constructed to improve the generalization capability and further prevents the overfitting problem. Extensive experiments conducted on a publicly available dataset demonstrate that our proposed method achieve promising results as compared with state-of-the-art baselines.07/06/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2785784Learning Multi-Instance Deep Ranking and Regression Network for Visual House Appraisal
https://www.computer.org/csdl/trans/tk/2018/08/08253468-abs.html
This paper presents a weakly supervised regression model for the <italic>visual house appraisal</italic> problem, which aims to predict the value of a house from its photos and textual descriptions (e.g., number of bedrooms). The key idea of our approach is a multi-layer neural network, called <italic>multi-instance Deep Ranking and Regression </italic> (MiDRR) net, which jointly solves two coupled tasks: ranking and regression, in the multiple instance setting. The network is trained using weakly supervised data, which do not require intensive human annotations. We also design a set of human heuristics to promote deep features through imposing constraints over the solution space, e.g., a house with three bedrooms often has a higher value than that with only two bedrooms. While these constraints are specific to the studied problem, the developed formula can be easily generalized to the other regression applications. For test and evaluation purposes, we collect a comprehensive house image benchmark that includes 900,000 photos from 30,000 houses recently traded in the USA, and apply the proposed MiDRR net to predict house values. Extensive evaluations with comparisons demonstrate that additional usage of imagery data as well as human heuristics can significantly boost system performance and that the proposed MiDRR net clearly outperforms the alternative methods.07/06/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2791611NetCycle+: A Framework for Collective Evolution Inference in Dynamic Heterogeneous Networks
https://www.computer.org/csdl/trans/tk/2018/08/08254360-abs.html
Collective inference has attracted considerable attention in the last decade, where the response variables within a group of instances are correlated and should be inferred collectively, instead of independently. Previous works on collective inference mainly focus on exploiting the <italic>autocorrelation</italic> among instances in a <italic> static</italic> network during the inference process. There are also approaches on time series prediction, which mainly exploit the autocorrelation within an instance at different time points during the inference process. However, in many real-world applications, the response variables of related instances can co-evolve over time and their evolutions are not following a static correlation across time, but are following an internal <italic>life cycle</italic>. In this paper, we study the problem of <italic>collective evolution inference</italic>, where the goal is to predict the values of the response variables for a group of related instances at the end of their life cycles. This problem is extremely important for various applications, e.g., predicting fund-raising results in crowd-funding and predicting gene-expression levels in bioinformatics. This problem is also highly challenging because different instances in the network can co-evolve over time and they can be at different stages of their life cycles and thus have different evolving patterns. Moreover, the instances in collective evolution inference problems are usually connected through <italic>heterogeneous information networks</italic> (HINs for short), which involve complex relationships among the instances interconnected by multiple types of links. We propose an approach, called <italic>NetCycle+</italic>, by incorporating information from both the correlation among related instances and their life cycles. Furthermore, in order to study the deep dependencies between nodes in the network, we extend the graph convolution model into our algorithm. We compared our approach with existing methods of collective inference and time series analysis on two real-world networks. The results demonstrate that our proposed approach can improve the inference performance by considering the autocorrelation through networks and the life cycles of the instances.07/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2792020SDE: A Novel Clustering Framework Based on Sparsity-Density Entropy
https://www.computer.org/csdl/trans/tk/2018/08/08254398-abs.html
Clustering of data with high dimension and variable densities poses a remarkable challenge to the traditional density-based clustering methods. Recently, entropy, a numerical measure of the uncertainty of information, can be used to measure the border degree of samples in data space and also select significant features in feature set. It was used in our new framework based on the sparsity-density entropy (SDE) to cluster the data with high dimension and variable densities. First, SDE conducts high-quality sampling for multidimensional data and selects the representative features using sparsity score entropy (SSE). Second, the clustering results and noises are obtained adopting a new density-variable clustering method called density entropy (DE). DE automatically determines the border set based on the global minimum of border degrees and then adaptively performs cluster analysis for each local cluster based on the local minimum of border degrees. The effectiveness and efficiency of the proposed SDE framework are validated on synthetic and real data sets in comparison with several clustering algorithms. The results showed that the proposed SDE framework concurrently detected the noises and processed the data with high dimension and various densities.07/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2792021Making a Small World Smaller: Path Optimization in Networks
https://www.computer.org/csdl/trans/tk/2018/08/08255632-abs.html
Reduction of end-to-end network delay is an optimization task with applications in multiple domains. Low delays enable improved information flow in social networks, quick spread of ideas in collaboration networks, low travel times for vehicles on road networks, and increased rate of packets in the case of communication networks. Delay reduction can be achieved by both improving the propagation capabilities of individual nodes and adding additional edges in the network. One of the main challenges in such network design problems is that the effects of local changes are not independent, and as a consequence, there is a combinatorial search-space of possible improvements. Thus, minimizing the cumulative propagation delay requires novel scalable and data-driven approaches. We consider the problem of network delay minimization via node upgrades. We show that the problem is NP-hard and prove strong inapproximability results about it (i.e., APX-hard) even for equal vertex delays. On the positive side, probabilistic approximations for a restricted version of the problem can be obtained. We propose a greedy heuristic to solve the general problem setting which has good quality in practice, but does not scale to very large instances. To enable scalability to real-world networks, we develop approximations for Greedy with probabilistic guarantees for every iteration, tailored to different models of delay distribution and network structures. Our methods scale almost linearly with the graph size and consistently outperform competitors in quality. We evaluate our approaches on several real-world graphs from different genres. We achieve up to two orders of magnitude speed-up compared to alternatives from the literature on moderate size networks, and obtain high-quality results in minutes on large datasets while competitors from the literature require more than four hours.07/06/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2792470A Two-Phase Algorithm for Differentially Private Frequent Subgraph Mining
https://www.computer.org/csdl/trans/tk/2018/08/08259370-abs.html
Mining frequent subgraphs from a collection of input graphs is an important task for exploratory data analysis on graph data. However, if the input graphs contain sensitive information, releasing discovered frequent subgraphs may pose considerable threats to individual privacy. In this paper, we study the problem of frequent subgraph mining (FSM) under the rigorous differential privacy model. We present a two-phase differentially private FSM algorithm, which is referred to as <italic>DFG</italic>. In <italic>DFG</italic>, frequent subgraphs are privately identified in the first phase, and the noisy support of each identified frequent subgraph is calculated in the second phase. In particular, to privately identity frequent subgraphs, we propose a frequent subgraph identification approach, which can improve the accuracy of discovered frequent subgraphs through candidate pruning. Moreover, to compute the noisy support of each identified frequent subgraph, we devise a lattice-based noisy support computation approach, which leverages the inclusion relations between the discovered frequent subgraphs to improve the accuracy of the noisy supports. Through formal privacy analysis, we prove that <italic>DFG</italic> satisfies <inline-formula><tex-math notation="LaTeX"> $\epsilon$</tex-math><alternatives><inline-graphic xlink:href="cheng-ieq1-2793862.gif"/></alternatives></inline-formula> -differential privacy. Extensive experimental results on real datasets show that <italic>DFG</italic> can privately find frequent subgraphs while achieving high data utility.07/06/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2793862TAKer: Fine-Grained Time-Aware Microblog Search with Kernel Density Estimation
https://www.computer.org/csdl/trans/tk/2018/08/08260854-abs.html
Temporal information has been widely used to promote the information retrieval (IR) performance, especially for microblog search which usually prefers the latest news and events. Previous studies mainly focused on incorporating the document-level temporal information into retrieval, while the temporal relevance of each query word was not well investigated. In this paper, we propose a word temporal predictor to characterize the word-level temporal relevance by fine-grained time-aware kernel density estimation over the feedback documents. In addition, we present a fine-grained time-aware framework to integrate the proposed word temporal predictor with the traditional document temporal predictor for retrieval. Finally, we incorporate the framework into two state-of-the-art retrieval models, namely language model (LM) and BM25. The experimental results on the TREC 2011-2014 Microblog collections, show that our proposed word temporal predictor is effective to boost the retrieval performance within both LM and BM25 frameworks. In particular, we achieve significant improvements over the strong baselines with optimized settings in most cases. Furthermore, our fine-grained time-aware models with word temporal predictor are comparable to if not better than the state-of-the-art temporal retrieval models.07/06/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2794538Differentially Private Distributed Online Learning
https://www.computer.org/csdl/trans/tk/2018/08/08260919-abs.html
In the big data era, the generation of data presents some new characteristics, including wide distribution, high velocity, high dimensionality, and privacy concern. To address these challenges for big data analytics, we develop a privacy-preserving distributed online learning framework on the data collected from distributed data sources. Specifically, each node (i.e., data source) has the capacity of learning a model from its local dataset, and exchanges intermediate parameters with a random part of their own neighboring (logically connected) nodes. Hence, the topology of the communications in our distributed computing framework is unfixed in practice. As online learning always performs on the sensitive data, we introduce the notion of differential privacy (DP) into our distributed online learning algorithm (DOLA) to protect the data privacy during the learning, which prevents an adversary from inferring any significant sensitive information. Our model is of general value for big data analytics in the distributed setting, because it can provide rigorous and scalable privacy proof and have much less computational complexity when compared to classic schemes, e.g., secure multiparty computation (SMC). To tackle high-dimensional incoming data entries, we study a sparse version of the DOLA with novel DP techniques to save the computing resources and improve the utility. Furthermore, we present two modified private DOLAs to meet the need of practical applications. One is to convert the DOLA to distributed stochastic optimization in an offline setting, the other is to use the mini-batches approach to reduce the amount of the perturbation noise and improve the utility. We conduct experiments on real datasets in a configured distributed platform. Numerical experiment results validate the feasibility of our private DOLAs.07/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2794384Paradoxical Correlation Pattern Mining
https://www.computer.org/csdl/trans/tk/2018/08/08263124-abs.html
Given a large transactional database, correlation computing/association analysis aims at efficiently finding strongly correlated items. For traditional association analysis, relationships among variables are usually measured at a global level. In this study, we investigate confounding factors that can help to capture abnormal correlation behaviors at a local level. Indeed, many real-world phenomena are localized to specific markets or subpopulations. Such local relationships may not be visible or may be miscalculated when collectively analyzing the entire data. In particular, confounding effects that change the direction of correlation are a most severe problem because the global correlations alone leads to errant conclusions. To this end, we propose CONFOUND, an efficient algorithm to identify paradoxical correlation patterns (i.e., where controlling for a third item changes the direction of association for strongly correlated pairs) using effective pruning strategies. Moreover, we also provide an enhanced version of this algorithm, called CONFOUND+, which substantially speeds up the confounder search step. Finally, experimental results showed that our proposed CONFOUND and CONFOUND+ algorithms can effectively identify confounders and the computational performance is orders of magnitude faster than benchmark methods.07/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2791602Duplicate Reduction in Graph Mining: Approaches, Analysis, and Evaluation
https://www.computer.org/csdl/trans/tk/2018/08/08263219-abs.html
At the core of graph mining lies <italic>independent</italic> expansion of substructures where a substructure (also referred to as a subgraph) <italic>independently</italic> grows into a number of larger substructures in each iteration. Such an independent expansion, invariably, leads to the generation of duplicates. In the presence of graph partitions, duplicates are generated both within and across partitions. Eliminating these duplicates (for correctness) not only incurs generation and storage cost but also additional computation for its elimination. Our primary aim is to design techniques to reduce generating duplicate substructures as we show that they cannot be eliminated. This paper introduces three constraint-based optimization techniques, each significantly improving the overall mining cost by reducing the number of duplicates generated. These alternatives provide flexibility to choose the right technique based on graph properties. We establish theoretical correctness of each technique as well as its analysis with respect to graph characteristics such as degree, number of unique labels, and label distribution. We also investigate the applicability of their combination for improvements in duplicate reduction. Finally, we discuss the effects of the constraints with respect to the partitioning schemes used in graph mining. Our experiments demonstrate significant benefits of these constraints in terms of storage, computation, and communication cost (specific to partitioned approaches) across graphs with varied characteristics.07/06/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2795003Capturing the Spatiotemporal Evolution in Road Traffic Networks
https://www.computer.org/csdl/trans/tk/2018/08/08263223-abs.html
The urban road networks undergo frequent traffic congestions during the peak hours and around the city center. Capturing the spatiotemporal evolution of the congestion scenario in real-time in an urban-scale can aid in developing smart traffic management systems, and guiding commuters in making informed decision about route choice. The congestion scenario is often represented by a set of distinguishable network partitions that have a homogeneous level of congestion inside them but are heterogeneous to others. Due to the dynamic nature of traffic, these partitions evolve with time in terms of their structure and location. In this paper, we propose a comprehensive framework to capture the evolution by incrementally updating the partitions in an efficient manner using a two-layer approach. The physical layer maintains a set of small-sized road network building blocks in a fine granularity, and performs low-level computations to incrementally update them, whereas the logical layer performs high-level computations in order to serve as an interface to query the physical layer about the congested partitions in a coarse granularity. We also propose an in-memory index called <monospace>Bin</monospace> that compactly stores the historical sets of building blocks in the main memory with no information loss, and facilitates their efficient retrieval. Our experimental results show that the proposed method is much efficient than the existing re-partitioning methods without significant sacrifice in accuracy. The proposed <monospace>Bin</monospace> consumes a minimum space with least redundancy at different time stamps.07/06/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2795001Health Monitoring on Social Media over Time
https://www.computer.org/csdl/trans/tk/2018/08/08263414-abs.html
Social media has become a major source for analyzing all aspects of daily life. Thanks to dedicated latent topic analysis methods such as the Ailment Topic Aspect Model (ATAM), public health can now be observed on Twitter. In this work, we are interested in using social media to monitor people’s health over time. The use of tweets has several benefits including instantaneous data availability at virtually no cost. Early monitoring of health data is complementary to post-factum studies and enables a range of applications such as measuring behavioral risk factors and triggering health campaigns. We formulate two problems: <italic>health transition detection</italic> and <italic> health transition prediction.</italic> We first propose the Temporal Ailment Topic Aspect Model (TM–ATAM), a new latent model dedicated to solving the first problem by capturing transitions that involve health-related topics. TM–ATAM is a non-obvious extension to ATAM that was designed to extract health-related topics. It learns health-related topic transitions by <italic>minimizing the prediction error on topic distributions between consecutive posts at different time and geographic granularities.</italic> To solve the second problem, we develop T–ATAM, a Temporal Ailment Topic Aspect Model where time is treated as a random variable <italic>natively</italic> inside ATAM. Our experiments on an 8-month corpus of tweets show that TM–ATAM outperforms TM–LDA in estimating health-related transitions from tweets for different geographic populations. We examine the ability of TM–ATAM to detect transitions due to climate conditions in different geographic regions. We then show how T–ATAM can be used to predict the most important transition and additionally compare T–ATAM with CDC (Center for Disease Control) data and Google Flu Trends.07/06/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2795606SLADE: A Smart Large-Scale Task Decomposer in Crowdsourcing
https://www.computer.org/csdl/trans/tk/2018/08/08268652-abs.html
Crowdsourcing has been shown to be effective in a wide range of applications, and is seeing increasing use. A large-scale crowdsourcing task often consists of thousands or millions of atomic tasks, each of which is usually a simple task such as binary choice or simple voting. To distribute a large-scale crowdsourcing task to limited crowd workers, a common practice is to pack a set of atomic tasks into a task bin and send to a crowd worker in a batch. It is challenging to decompose a large-scale crowdsourcing task and execute batches of atomic tasks, which ensures reliable answers at a minimal total cost. Large batches lead to unreliable answers of atomic tasks, while small batches incur unnecessary cost. In this paper, we investigate a general crowdsourcing task decomposition problem, called the <underline>S</underline>mart <underline>L</underline>arge-sc<underline>A</underline>le task <underline>DE </underline>composer (SLADE) problem, which aims to decompose a large-scale crowdsourcing task to achieve the desired reliability at a minimal cost. We prove the NP-hardness of the SLADE problem and propose solutions in both <italic> homogeneous</italic> and <italic>heterogeneous</italic> scenarios. For the <italic>homogeneous</italic> SLADE problem, where all the atomic tasks share the same reliability requirement, we propose a greedy heuristic algorithm and an efficient and effective approximation framework using an optimal priority queue (OPQ) structure with provable approximation ratio. For the <italic>heterogeneous</italic> SLADE problem, where the atomic tasks can have different reliability requirements, we extend the OPQ-based framework leveraging a partition strategy, and also prove its approximation guarantee. Finally, we verify the effectiveness and efficiency of the proposed solutions through extensive experiments on representative crowdsourcing platforms.07/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2797962In Search of Indoor Dense Regions: An Approach Using Indoor Positioning Data
https://www.computer.org/csdl/trans/tk/2018/08/08274887-abs.html
As people spend significant parts of daily lives indoors, it is useful and important to measure indoor densities and find the dense regions in many indoor scenarios like space management and security control. In this paper, we propose a data-driven approach that finds top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="lu-ieq1-2799215.gif"/></alternatives></inline-formula> indoor dense regions by using indoor positioning data. Such data is obtained by indoor positioning systems working at a relatively low frequency, and the reported locations in the data are discrete, from a preselected location set that does not continuously cover the entire indoor space. When a search is triggered, the object positioning information is already out-of-date and thus object locations are uncertain. To this end, we first integrate object location uncertainty into the definitions for counting objects in an indoor region and computing its density. Subsequently, we conduct a thorough analysis of the location uncertainty in the context of complex indoor topology, deriving upper and lower bounds of indoor region densities and introducing distance decaying effect into computing concrete indoor densities. Enabled by the uncertainty analysis outcomes, we design efficient search algorithms for solving the problem. Finally, we conduct extensive experimental studies on our proposals using synthetic and real data. The experimental results verify that the proposed search approach is efficient, scalable, and effective. The top-<inline-formula><tex-math notation="LaTeX"> $k$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq2-2799215.gif"/></alternatives></inline-formula> indoor dense regions returned by our search are considerably consistent with ground truth, despite that the search uses neither historical data nor extra knowledge about objects.07/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2799215Link Weight Prediction Using Supervised Learning Methods and Its Application to Yelp Layered Network
https://www.computer.org/csdl/trans/tk/2018/08/08281007-abs.html
Real-world networks feature weights of interactions, where link weights often represent some physical attributes. In many situations, to recover the missing data or predict the network evolution, we need to predict link weights in a network. In this paper, we first proposed a series of new centrality indices for links in line graph. Then, utilizing these line graph indices, as well as a number of original graph indices, we designed three supervised learning methods to realize link weight prediction both in the networks of single layer and multiple layers, which perform much better than several recently proposed baseline methods. We found that the resource allocation index (RA) plays a more important role in the weight prediction than other topological properties, and the line graph indices are at least as important as the original graph indices in link weight prediction. In particular, the success application of our methods on Yelp layered network suggests that we can indeed predict the offline co-foraging behaviors of users just based on their online social interactions, which may open a new direction for link weight prediction algorithms, and meanwhile provide insights to design better restaurant recommendation systems.07/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2801854Road Traffic Speed Prediction: A Probabilistic Model Fusing Multi-Source Data
https://www.computer.org/csdl/trans/tk/2018/07/07955005-abs.html
Road traffic speed prediction is a challenging problem in intelligent transportation system (ITS) and has gained increasing attentions. Existing works are mainly based on raw speed sensing data obtained from infrastructure sensors or probe vehicles, which, however, are limited by expensive cost of sensor deployment and maintenance. With sparse speed observations, traditional methods based only on speed sensing data are insufficient, especially when emergencies like traffic accidents occur. To address the issue, this paper aims to improve the road traffic speed prediction by fusing traditional speed sensing data with new-type “sensing” data from cross domain sources, such as tweet sensors from social media and trajectory sensors from map and traffic service platforms. Jointly modeling information from different datasets brings many challenges, including location uncertainty of low-resolution data, language ambiguity of traffic description in texts, and heterogeneity of cross-domain data. In response to these challenges, we present a unified probabilistic framework, called Topic-Enhanced Gaussian Process Aggregation Model (TEGPAM), consisting of three components, i.e., location disaggregation model, traffic topic model, and traffic speed Gaussian Process model, which integrate new-type data with traditional data. Experiments on real world data from two large cities validate the effectiveness and efficiency of our model.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2718525DPPred: An Effective Prediction Framework with Concise Discriminative Patterns
https://www.computer.org/csdl/trans/tk/2018/07/08052529-abs.html
In the literature, two series of models have been proposed to address prediction problems including classification and regression. Simple models, such as generalized linear models, have ordinary performance but strong interpretability on a set of simple features. The other series, including tree-based models, organize numerical, categorical, and high dimensional features into a comprehensive structure with rich interpretable information in the data. In this paper, we propose a novel Discriminative Pattern-based Prediction framework (<inline-formula> <tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq1-2757476.gif"/> </alternatives></inline-formula>) to accomplish the prediction tasks by taking their advantages of both effectiveness and interpretability. Specifically, <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq2-2757476.gif"/></alternatives></inline-formula> adopts the concise discriminative patterns that are on the prefix paths from the root to leaf nodes in the tree-based models. <inline-formula> <tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq3-2757476.gif"/> </alternatives></inline-formula> selects a limited number of the useful discriminative patterns by searching for the most effective pattern combination to fit generalized linear models. Extensive experiments show that in many scenarios, <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq4-2757476.gif"/></alternatives></inline-formula> provides competitive accuracy with the state-of-the-art as well as the valuable interpretability for developers and experts. In particular, taking a clinical application dataset as a case study, our <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math> <alternatives><inline-graphic xlink:href="shang-ieq5-2757476.gif"/></alternatives></inline-formula> outperforms the baselines by using only 40 concise discriminative patterns out of a potentially exponentially large set of patterns.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2757476Efficient Parameter Estimation for Information Retrieval Using Black-Box Optimization
https://www.computer.org/csdl/trans/tk/2018/07/08063964-abs.html
The retrieval function is one of the most important components of an Information Retrieval (IR) system, because it determines to what extent some information is relevant to a user query. Most retrieval functions have “free parameters” whose value must be set before retrieval, significantly affecting the effectiveness of an IR system. Choosing the optimum values for such parameters is therefore of paramount importance. However, the optimum can only be found after a computationally expensive process, especially when the generalization error is estimated via cross-validation. In this paper, we propose to determine free parameter values by solving an optimization problem aimed at maximizing a measure of retrieval effectiveness. We employ the black-box optimization paradigm, since the analytical expression of the measure of effectiveness with respect to the free parameters is unknown. We consider different methods for solving the black-box optimization problem: a simple grid-search over the whole domain, and more sophisticated techniques such as line search and surrogate model based algorithms. Experimental results on several test collections not only provide useful insight about effectiveness, but also about efficiency: they indicate that with appropriate optimization techniques, the computational cost of parameter optimization can be greatly reduced without compromising retrieval effectiveness, even when taking generalization into account.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2761749Workload Management in Database Management Systems: A Taxonomy
https://www.computer.org/csdl/trans/tk/2018/07/08086184-abs.html
Workload management is the discipline of effectively monitoring, managing and controlling work flow across computing systems. In particular, workload management in database management systems (DBMSs) is the process or act of monitoring and controlling work (i.e., requests) executing on a database system in order to make efficient use of system resources in addition to achieving any performance objectives assigned to that work. In the past decade, workload management studies and practice have made considerable progress in both academia and industry. New techniques have been proposed by researchers, and new features of workload management facilities have been implemented in most commercial database products. In this paper, we provide a systematic study of workload management in today's DBMSs by developing a taxonomy of workload management techniques. We apply the taxonomy to evaluate and classify existing workload management techniques implemented in the commercial databases and available in the recent research literature. We also introduce the underlying principles of today's workload management technology for DBMSs, discuss open problems, and outline some research opportunities in this research area.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2767044Second-Order Online Active Learning and Its Applications
https://www.computer.org/csdl/trans/tk/2018/07/08122067-abs.html
The goal of online active learning is to learn predictive models from a sequence of unlabeled data given limited label query budget. Unlike conventional online learning tasks, online active learning is considerably more challenging because of two reasons. First, it is difficult to design an effective query strategy to decide when is appropriate to query the label of an incoming instance given limited query budget. Second, it is also challenging to decide how to update the predictive models effectively whenever the true label of an instance is queried. Most existing approaches for online active learning are often based on a family of first-order online learning algorithms, which are simple and efficient but fall short in the slow convergence and sub-optimal solution in exploiting the labeled training data. To solve these issues, this paper presents a novel framework of Second-order Online Active Learning (SOAL) by fully exploiting both the first-order and second-order information. The proposed algorithms are able to achieve effective online learning efficacy, maximize the predictive accuracy, and minimize the labeling cost. To make SOAL more practical for real-world applications, especially for class-imbalanced online classification tasks (e.g., malicious web detection), we extend the SOAL framework by proposing the Cost-sensitive Second-order Online Active Learning algorithm named “SOAL<inline-formula><tex-math notation="LaTeX">$_{CS}$</tex-math><alternatives> <inline-graphic xlink:href="hoi-ieq1-2778097.gif"/></alternatives></inline-formula>”, which is devised by maximizing the sum of weighted sensitivity and specificity or minimizing the cost of weighted mistakes of different classes. We conducted both theoretical analysis and empirical studies, including an extensive set of experiments on a variety of large-scale real-world datasets, in which the promising empirical results validate the efficacy and scalability of the proposed algorithms towards large-scale online learning tasks.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2778097Sampling and Reconstruction Using Bloom Filters
https://www.computer.org/csdl/trans/tk/2018/07/08233172-abs.html
In this paper, we address the problem of sampling from a set and reconstructing a set stored as a Bloom filter. To the best of our knowledge our work is the first to address this question. We introduce a novel hierarchical data structure called <inline-formula><tex-math notation="LaTeX">$\mathsf{BloomSampleTree}$</tex-math><alternatives> <inline-graphic xlink:href="sengupta-ieq1-2785803.gif"/></alternatives></inline-formula> that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to reconstruct the set efficiently. In the case where the hash functions used in the Bloom filter implementation are partially invertible, in the sense that it is easy to calculate the set of elements that map to a particular hash value, we propose a second, more space-efficient method called HashInvert for the reconstruction. We study the properties of these two methods both analytically as well as experimentally. We provide bounds on run times for both methods and sample quality for the <inline-formula><tex-math notation="LaTeX">$\mathsf{BloomSampleTree}$</tex-math><alternatives> <inline-graphic xlink:href="sengupta-ieq2-2785803.gif"/></alternatives></inline-formula> based algorithm, and show through an extensive experimental evaluation that our methods are efficient and effective.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2785803Learning Multiple Factors-Aware Diffusion Models in Social Networks
https://www.computer.org/csdl/trans/tk/2018/07/08234673-abs.html
Information diffusion is a natural phenomenon occurring in social networks. The adoption behavior of a node toward an information piece in a social network can be affected by different factors, e.g., freshness and hotness. Previously, many diffusion models are proposed to consider one or several fixed factors. In fact, the factors affecting adoption decision of a node are different from one to another and may not be seen before. For a different scenario of diffusion with new factors, previous diffusion models may not model the diffusion well, or are not applicable at all. Moreover, uncertainty of information exposure intrinsically exists between two connected nodes, which causes modeling diffusion more challenge in social networks. In this work, our aim is to design a diffusion model in which factors considered are flexible to be extended and changed and the uncertainly of information exposure is explicitly tackled. Therefore, with different factors, our diffusion model can be adapted to more scenarios of diffusion without requiring the modification of the learning framework. We conduct comprehensive experiments to show that our diffusion model is effective on two important tasks of information diffusion, namely activation prediction and spread estimation.06/05/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2786209A Comprehensive Study on Social Network Mental Disorders Detection via Online Social Media Mining
https://www.computer.org/csdl/trans/tk/2018/07/08239661-abs.html
The explosive growth in popularity of social networking leads to the problematic usage. An increasing number of social network mental disorders (SNMDs), such as Cyber-Relationship Addiction, Information Overload, and Net Compulsion, have been recently noted. Symptoms of these mental disorders are usually observed passively today, resulting in delayed clinical intervention. In this paper, we argue that mining online social behavior provides an opportunity to actively identify SNMDs at an early stage. It is challenging to detect SNMDs because the mental status cannot be directly observed from online social activity logs. Our approach, new and innovative to the practice of SNMD detection, does not rely on self-revealing of those mental factors via questionnaires in Psychology. Instead, we propose a machine learning framework, namely, <italic>Social Network Mental Disorder Detection (SNMDD)</italic>, that exploits features extracted from social network data to accurately identify potential cases of SNMDs. We also exploit multi-source learning in SNMDD and propose a new SNMD-based Tensor Model (STM) to improve the accuracy. To increase the scalability of STM, we further improve the efficiency with performance guarantee. Our framework is evaluated via a user study with 3,126 online social network users. We conduct a feature analysis, and also apply SNMDD on large-scale datasets and analyze the characteristics of the three SNMD types. The results manifest that SNMDD is promising for identifying online social network users with potential SNMDs.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2786695Leveraging Conceptualization for Short-Text Embedding
https://www.computer.org/csdl/trans/tk/2018/07/08241414-abs.html
Most short-text embedding models typically represent each short-text only using the literal meanings of the words, which makes these models indiscriminative for the ubiquitous polysemy. In order to enhance the semantic representation capability of the short-texts, we (i) propose a novel short-text conceptualization algorithm to assign the associated concepts for each short-text, and then (ii) introduce the conceptualization results into learning the conceptual short-text embeddings. Hence, this semantic representation is more expressive than some widely-used text representation models such as the latent topic model. Wherein, the short-text conceptualization algorithm used here is based on a novel co-ranking framework, enabling the signals (i.e., the words and the concepts) to fully interplay to derive the solid conceptualization for the short-texts. Afterwards, we further extend the conceptual short-text embedding models by utilizing an attention-based model that selects the relevant words within the context to make more efficient prediction. The experiments on the real-world datasets demonstrate that the proposed conceptual short-text embedding model and short-text conceptualization algorithm are more effective than the state-of-the-art methods.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2787709Untangling Blockchain: A Data Processing View of Blockchain Systems
https://www.computer.org/csdl/trans/tk/2018/07/08246573-abs.html
Blockchain technologies are gaining massive momentum in the last few years. Blockchains are distributed ledgers that enable parties who do not fully trust each other to maintain a set of global states. The parties agree on the existence, values, and histories of the states. As the technology landscape is expanding rapidly, it is both important and challenging to have a firm grasp of what the core technologies have to offer, especially with respect to their data processing capabilities. In this paper, we first survey the state of the art, focusing on private blockchains (in which parties are authenticated). We analyze both in-production and research systems in four dimensions: distributed ledger, cryptography, consensus protocol, and smart contract. We then present BLOCKBENCH, a benchmarking framework for understanding performance of private blockchains against data processing workloads. We conduct a comprehensive evaluation of three major blockchain systems based on BLOCKBENCH, namely Ethereum, Parity, and Hyperledger Fabric. The results demonstrate several trade-offs in the design space, as well as big performance gaps between blockchain and database systems. Drawing from design principles of database systems, we discuss several research directions for bringing blockchain performance closer to the realm of databases.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2781227Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data
https://www.computer.org/csdl/trans/tk/2018/07/08248802-abs.html
Machine learning methods are used to discover complex nonlinear relationships in biological and medical data. However, sophisticated learning models are computationally unfeasible for data with millions of features. Here, we introduce the first feature selection method for nonlinear learning problems that can scale up to large, ultra-high dimensional biological data. More specifically, we scale up the novel Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) to handle millions of features with tens of thousand samples. The proposed method is guaranteed to find an optimal subset of maximally predictive features with minimal redundancy, yielding higher predictive power and improved interpretability. Its effectiveness is demonstrated through applications to classify phenotypes based on module expression in human prostate cancer patients and to detect enzymes among protein structures. We achieve high accuracy with as few as 20 out of one million features—a dimensionality reduction of 99.998 percent. Our algorithm can be implemented on commodity cloud computing platforms. The dramatic reduction of features may lead to the ubiquitous deployment of sophisticated prediction models in mobile health care applications.06/05/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2789451Sparse-TDA: Sparse Realization of Topological Data Analysis for Multi-Way Classification
https://www.computer.org/csdl/trans/tk/2018/07/08249544-abs.html
Topological data analysis (TDA) has emerged as one of the most promising techniques to reconstruct the unknown shapes of high-dimensional spaces from observed data samples. TDA, thus, yields key shape descriptors in the form of persistent topological features that can be used for any supervised or unsupervised learning task, including multi-way classification. Sparse sampling, on the other hand, provides a highly efficient technique to reconstruct signals in the spatial-temporal domain from just a few carefully-chosen samples. Here, we present a new method, referred to as the Sparse-TDA algorithm, that combines favorable aspects of the two techniques. This combination is realized by selecting an optimal set of sparse pixel samples from the persistent features generated by a vector-based TDA algorithm. These sparse samples are selected from a low-rank matrix representation of persistent features using QR pivoting. We show that the Sparse-TDA method demonstrates promising performance on three benchmark problems related to human posture recognition and image texture classification.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2790386Heterogeneous Metric Learning of Categorical Data with Hierarchical Couplings
https://www.computer.org/csdl/trans/tk/2018/07/08252792-abs.html
Learning appropriate metric is critical for effectively capturing complex data characteristics. The metric learning of categorical data with hierarchical coupling relationships and local heterogeneous distributions is very challenging yet rarely explored. This paper proposes a Heterogeneous mEtric Learning with hIerarchical Couplings (HELIC for short) for this type of categorical data. HELIC captures both low-level value-to-attribute and high-level attribute-to-class hierarchical couplings, and reveals the intrinsic heterogeneities embedded in each level of couplings. Theoretical analyses of the effectiveness and generalization error bound verify that HELIC effectively represents the above complexities. Extensive experiments on 30 data sets with diverse characteristics demonstrate that HELIC-enabled classification significantly enhances the accuracy (up to 40.93 percent), compared with five state-of-the-art baselines.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2791525Multi-View Missing Data Completion
https://www.computer.org/csdl/trans/tk/2018/07/08253467-abs.html
A growing number of multi-view data arises naturally in many scenarios, including medical diagnosis, webpage classification, and multimedia analysis. A challenge in learning from multi-view data is that not all instances are fully represented in all views, resulting in missing view data. In this paper, we focus on feature-level completion for missing view of multi-view data. Aiming at capturing both semantic complementarity and identical distribution among different views, an Isomorphic Linear Correlation Analysis (ILCA) method is proposed to linearly map multi-view data to a feature-isomorphic subspace through learning a set of excellent isomorphic features, thereby unfolding the shared information from different views. Meanwhile, we assume that missing view obeys normal distribution. Then, the missing view data matrix can be modeled as a low-rank component plus a sparse contribution. Thus, to accomplish missing view completion, an Identical Distribution Pursuit Completion (IDPC) model based on the learned features is proposed, in which the identical distribution constraint of missing view to the other available one in the feature-isomorphic subspace is fully exploited. Comprehensive experiments on several multi-view datasets demonstrate that our proposed framework yields promising results.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2791607A Comment on “Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks”
https://www.computer.org/csdl/trans/tk/2018/07/08371706-abs.html
The Friend Relationship-Based User Identification (FRUI) algorithm is considered to be the ideal method. However, if the seed User Matched Pairs (UMPs) were not suitable, FRUI would stop early due to Controversial UMPs. We highlight this gap and propose a minor change to make FRUI a more general identification algorithm.06/04/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2828812Longest Increasing Subsequence Computation over Streaming Sequences
https://www.computer.org/csdl/trans/tk/2018/06/08063967-abs.html
In this paper, we propose a data structure, a quadruple neighbor list (QN-list, for short), to support real time queries of all <underline>l</underline>ongest <underline>i</underline>ncreasing <underline>s</underline>ubsequence (LIS) and LIS with constraints over sequential data streams. The QN-List built by our algorithm requires <inline-formula><tex-math notation="LaTeX">$O(w)$</tex-math><alternatives> <inline-graphic xlink:href="zou-ieq1-2761345.gif"/></alternatives></inline-formula> space, where <inline-formula> <tex-math notation="LaTeX">$w$</tex-math><alternatives><inline-graphic xlink:href="zou-ieq2-2761345.gif"/> </alternatives></inline-formula> is the time window size. The running time for building the initial QN-List takes <inline-formula><tex-math notation="LaTeX">$O(w\ \log w)$</tex-math><alternatives><inline-graphic xlink:href="zou-ieq3-2761345.gif"/></alternatives></inline-formula> time. Applying the QN-List, insertion of the new item takes <inline-formula><tex-math notation="LaTeX">$O(\log w)$ </tex-math><alternatives><inline-graphic xlink:href="zou-ieq4-2761345.gif"/></alternatives></inline-formula> time and deletion of the first item takes <inline-formula><tex-math notation="LaTeX">$O(w)$</tex-math><alternatives> <inline-graphic xlink:href="zou-ieq5-2761345.gif"/></alternatives></inline-formula> time. To the best of our knowledge, this is the first work to support both LIS enumeration and LIS with constraints computation by using a single uniform data structure for real time sequential data streams. Our method outperforms the state-of-the-art methods in both time and space cost, not only theoretically, but also empirically.05/01/2018 1:35 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2761345Scalable Distributed Nonnegative Matrix Factorization with Block-Wise Updates
https://www.computer.org/csdl/trans/tk/2018/06/08226795-abs.html
Nonnegative Matrix Factorization (NMF) has been applied with great success on a wide range of applications. As NMF is increasingly applied to massive datasets such as web-scale dyadic data, it is desirable to leverage a cluster of machines to store those datasets and to speed up the factorization process. However, it is challenging to efficiently implement NMF in a distributed environment. In this paper, we show that by leveraging a new form of update functions, we can perform local aggregation and fully explore parallelism. Therefore, the new form is much more efficient than the traditional form in distributed implementations. Moreover, under the new form of update functions, we can perform frequent updates and lazy updates, which aim to use the most recently updated data whenever possible and avoid unnecessary computations. As a result, frequent updates and lazy updates are more efficient than their traditional concurrent counterparts. Through a series of experiments on a local cluster as well as the Amazon EC2 cloud, we demonstrate that our implementations with frequent updates or lazy updates are up to two orders of magnitude faster than the existing implementation with the traditional form of update functions.05/01/2018 1:35 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2785326Structure Based User Identification across Social Networks
https://www.computer.org/csdl/trans/tk/2018/06/08233135-abs.html
Identification of anonymous identical users of cross-platforms refers to the recognition of the accounts belonging to the same individual among multiple Social Network (SN) platforms. Evidently, cross-platform exploration may help solve many problems in social computing, in both theory and practice. However, it is still an intractable problem due to the fragmentation, inconsistency, and disruption of the accessible information among SNs. Different from the efforts implemented on user profiles and users’ content, many studies have noticed the accessibility and reliability of network structure in most of the SNs for addressing this issue. Although substantial achievements have been made, most of the current network structure-based solutions, requiring prior knowledge of some given identified users, are supervised or semi-supervised. It is laborious to label the prior knowledge manually in some scenarios where prior knowledge is hard to obtain. Noticing that friend relationships are reliable and consistent in different SNs, we proposed an unsupervised scheme, termed Friend Relationship-based User Identification algorithm without Prior knowledge (FRUI-P). The FRUI-P first extracts the friend feature of each user in an SN into friend feature vector, and then calculates the similarities of all the candidate identical users between two SNs. Finally, a one-to-one map scheme is developed to identify the users based on the similarities. Moreover, FRUI-P is proved to be efficient theoretically. Results of extensive experiments demonstrated that FRUI-P performs much better than current state-of-art network structure-based algorithm without prior knowledge. Due to its high precision, FRUI-P can additionally be utilized to generate prior knowledge for supervised and semi-supervised schemes. In applications, the unsupervised anonymous identical user identification method accommodates more scenarios where the seed users are unobtainable.05/01/2018 1:35 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2784430MCS-GPM: Multi-Constrained Simulation Based Graph Pattern Matching in Contextual Social Graphs
https://www.computer.org/csdl/trans/tk/2018/06/08233149-abs.html
Graph Pattern Matching (GPM) has been used in lots of areas, like biology, medical science, and physics. With the advent of Online Social Networks (OSNs), recently, GPM has been playing a significant role in social network analysis, which has been widely used in, for example, finding experts, social community mining, and social position detection. Given a query which contains a pattern graph <inline-formula><tex-math notation="LaTeX">$G_Q$</tex-math><alternatives> <inline-graphic xlink:href="zheng-ieq1-2785824.gif"/></alternatives></inline-formula> and a data graph <inline-formula> <tex-math notation="LaTeX">$G_D$</tex-math><alternatives><inline-graphic xlink:href="zheng-ieq2-2785824.gif"/> </alternatives></inline-formula>, a GPM algorithm finds those subgraphs, <inline-formula><tex-math notation="LaTeX"> $G_M$</tex-math><alternatives><inline-graphic xlink:href="zheng-ieq3-2785824.gif"/></alternatives></inline-formula>, that match <inline-formula><tex-math notation="LaTeX">$G_Q$</tex-math><alternatives> <inline-graphic xlink:href="zheng-ieq4-2785824.gif"/></alternatives></inline-formula> in <inline-formula> <tex-math notation="LaTeX">$G_D$</tex-math><alternatives><inline-graphic xlink:href="zheng-ieq5-2785824.gif"/> </alternatives></inline-formula>. However, the existing GPM methods do not consider the multiple end-to-end constraints of the social contexts, like social relationships, social trust, and social positions on edges in <inline-formula> <tex-math notation="LaTeX">$G_Q$</tex-math><alternatives><inline-graphic xlink:href="zheng-ieq6-2785824.gif"/> </alternatives></inline-formula>, which are commonly found in various applications, such as crowdsourcing travel, social network based e-commerce, and study group selection, etc. In this paper, we first conceptually extend Bounded Simulation to <italic>Multi-Constrained Simulation (MCS)</italic>, and propose a novel NP-Complete Multi-Constrained Graph Pattern Matching (MC-GPM) problem. Then, to address the efficiency issue in large-scale MC-GPM, we propose a new concept called Strong Social Component (SSC), consisting of participants with strong social connections. We also propose an approach to identifying SSCs, and propose a novel index method and a graph compression method for SSC. Moreover, we devise a multithreading heuristic algorithm, called M-HAMC, to bidirectionally search the MC-GPM results in parallel without decompressing graphs. An extensive empirical study over five real-world large-scale social graphs has demonstrated the effectiveness and efficiency of our approach.05/01/2018 1:36 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2785824Multi-Label Learning with Global and Local Label Correlation
https://www.computer.org/csdl/trans/tk/2018/06/08233207-abs.html
It is well-known that exploiting label correlations is important to multi-label learning. Existing approaches either assume that the label correlations are global and shared by all instances; or that the label correlations are local and shared only by a data subset. In fact, in the real-world applications, both cases may occur that some label correlations are globally applicable and some are shared only in a local group of instances. Moreover, it is also a usual case that only partial labels are observed, which makes the exploitation of the label correlations much more difficult. That is, it is hard to estimate the label correlations when many labels are absent. In this paper, we propose a new multi-label approach <monospace>GLOCAL</monospace> dealing with both the full-label and the missing-label cases, exploiting global and local label correlations simultaneously, through learning a latent label representation and optimizing label manifolds. The extensive experimental studies validate the effectiveness of our approach on both full-label and missing-label data.05/01/2018 1:35 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2785795Supervised Topic Modeling Using Hierarchical Dirichlet Process-Based Inverse Regression: Experiments on E-Commerce Applications
https://www.computer.org/csdl/trans/tk/2018/06/08239623-abs.html
The proliferation of e-commerce calls for mining consumer preferences and opinions from user-generated text. To this end, topic models have been widely adopted to discover the underlying semantic themes (i.e., topics). Supervised topic models have emerged to leverage discovered topics for predicting the response of interest (e.g., product quality and sales). However, supervised topic modeling remains a challenging problem because of the need to prespecify the number of topics, the lack of predictive information in topics, and limited scalability. In this paper, we propose a novel supervised topic model, <italic>Hierarchical Dirichlet Process-based Inverse Regression</italic> (HDP-IR). HDP-IR characterizes the corpus with a flexible number of topics, which prove to retain as much predictive information as the original corpus. Moreover, we develop an efficient inference algorithm capable of examining large-scale corpora (millions of documents or more). Three experiments were conducted to evaluate the predictive performance over major e-commerce benchmark testbeds of online reviews. Overall, HDP-IR outperformed existing state-of-the-art supervised topic models. Particularly, retaining sufficient predictive information improved predictive R-squared by over 17.6 percent; having topic structure flexibility contributed to predictive R-squared by at least 4.1 percent. HDP-IR provides an important step for future study on user-generated texts from a topic perspective.05/01/2018 1:35 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2786727RNN-DBSCAN: A Density-Based Clustering Algorithm Using Reverse Nearest Neighbor Density Estimates
https://www.computer.org/csdl/trans/tk/2018/06/08240674-abs.html
A new density-based clustering algorithm, <italic>RNN-DBSCAN</italic>, is presented which uses reverse nearest neighbor counts as an estimate of observation density. Clustering is performed using a <italic>DBSCAN</italic>-like approach based on <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="bryant-ieq1-2787640.gif"/></alternatives></inline-formula> nearest neighbor graph traversals through dense observations. <italic>RNN-DBSCAN</italic> is preferable to the popular density-based clustering algorithm <italic>DBSCAN</italic> in two aspects. First, problem complexity is reduced to the use of a single parameter (choice of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="bryant-ieq2-2787640.gif"/></alternatives></inline-formula> nearest neighbors), and second, an improved ability for handling large variations in cluster density (heterogeneous density). The superiority of <italic>RNN-DBSCAN</italic> is demonstrated on several artificial and real-world datasets with respect to prior work on reverse nearest neighbor based clustering approaches (<italic>RECORD</italic>, <italic>IS-DBSCAN</italic>, and <italic> ISB-DBSCAN</italic>) along with <italic>DBSCAN</italic> and <italic>OPTICS</italic>. Each of these clustering approaches is described by a common graph-based interpretation wherein clusters of dense observations are defined as connected components, along with a discussion on their computational complexity. Heuristics for <italic>RNN-DBSCAN </italic> parameter selection are presented, and the effects of <inline-formula><tex-math notation="LaTeX">$k$ </tex-math><alternatives><inline-graphic xlink:href="bryant-ieq3-2787640.gif"/></alternatives></inline-formula> on <italic>RNN-DBSCAN</italic> clusterings discussed. Additionally, with respect to scalability, an approximate version of <italic>RNN-DBSCAN</italic> is presented leveraging an existing approximate <inline-formula><tex-math notation="LaTeX"> $k$</tex-math><alternatives><inline-graphic xlink:href="bryant-ieq4-2787640.gif"/></alternatives></inline-formula> nearest neighbor technique.05/01/2018 1:36 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2787640Range Queries on Multi-Attribute Trajectories
https://www.computer.org/csdl/trans/tk/2018/06/08240703-abs.html
Motivated by the trend of providing comprehensive knowledge about trajectory data, we study multi-attribute trajectories each of which contains a sequence of time-stamped locations and a set of characteristic attributes. This enriches the data representation by providing a comprehensive description of moving objects and thus enables new types of queries on moving object trajectories. In this paper, we consider answering range queries that return trajectories (i) containing particular attribute values and (ii) passing a certain area during the query time. We integrate standard trajectories and attributes into one unified framework and propose an index structure as well as the query algorithm. The structure is general and flexible in terms of handling both multi-attribute trajectories and standard trajectories, answering a range of queries and supporting update-intensive applications. The evaluation is conducted in a prototype database system and experimental results demonstrate that our method outperforms alternative methods by a factor of 3-10 on a data set of one million real trajectories and synthetic attribute values.04/30/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2787711Profit Maximization for Viral Marketing in Online Social Networks: Algorithms and Analysis
https://www.computer.org/csdl/trans/tk/2018/06/08241389-abs.html
Information can be disseminated widely and rapidly through Online Social Networks (OSNs) with “word-of-mouth” effects. Viral marketing is such a typical application in which new products or commercial activities are advertised by some seed users in OSNs to other users in a cascading manner. The selection of initial seed users yields a tradeoff between the expense and reward of viral marketing. In this paper, we define a general profit metric that naturally combines the benefit of influence spread with the cost of seed selection in viral marketing. We carry out a comprehensive study on finding a set of seed nodes to maximize the profit of viral marketing. We show that the profit metric is significantly different from the influence metric in that it is no longer monotone. This characteristic differentiates the profit maximization problem from the traditional influence maximization problem. We develop new seed selection algorithms for profit maximization with strong approximation guarantees. We also derive several upper bounds to benchmark the practical performance of an algorithm on any specific problem instance. Experimental evaluations with real OSN datasets demonstrate the effectiveness of our algorithms and techniques.05/01/2018 1:36 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2787757Multi-Instance Learning with Discriminative Bag Mapping
https://www.computer.org/csdl/trans/tk/2018/06/08242668-abs.html
Multi-instance learning (MIL) is a useful tool for tackling labeling ambiguity in learning because it allows a bag of instances to share one label. Bag mapping transforms a bag into a single instance in a new space via instance selection and has drawn significant attention recently. To date, most existing work is based on the original space, using all instances inside each bag for bag mapping, and the selected instances are not directly tied to an MIL objective. As a result, it is difficult to guarantee the distinguishing capacity of the selected instances in the new bag mapping space. In this paper, we propose a discriminative mapping approach for multi-instance learning (MILDM) that aims to identify the best instances to directly distinguish bags in the new mapping space. Accordingly, each instance bag can be mapped using the selected instances to a new feature space, and hence any generic learning algorithm, such as an instance-based learning algorithm, can be used to derive learning models for multi-instance classification. Experiments and comparisons on eight different types of real-world learning tasks (including 14 data sets) demonstrate that MILDM outperforms the state-of-the-art bag mapping multi-instance learning approaches. Results also confirm that MILDM achieves balanced performance between runtime efficiency and classification effectiveness.05/01/2018 1:36 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2788430A Topic Modeling Approach for Traditional Chinese Medicine Prescriptions
https://www.computer.org/csdl/trans/tk/2018/06/08242679-abs.html
In traditional Chinese medicine (TCM), prescriptions are the daughters of doctors’ clinical experiences, which have been the main way to cure diseases in China for several thousand years. In the long Chinese history, a large number of prescriptions have been invented based on TCM theories. Regularities in the prescriptions are important for both clinical practice and novel prescription development. Previous works used many methods to discover regularities in prescriptions, but rarely described how a prescription is generated using TCM theories. In this work, we propose a topic model which characterizes the generative process of prescriptions in TCM theories and further incorporate domain knowledge into the topic model. Using 33,765 prescriptions in TCM prescription books, the model can reflect the prescribing patterns in TCM. Our method can outperform several previous topic models and group recommendation methods on generalization performance, herbs recommendation, symptoms suggestion, and prescribing patterns discovery.05/01/2018 1:35 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2787158Exploring Hierarchical Structures for Recommender Systems
https://www.computer.org/csdl/trans/tk/2018/06/08246532-abs.html
Items in real-world recommender systems exhibit certain hierarchical structures. Similarly, user preferences also present hierarchical structures. Recent studies show that incorporating the hierarchy of items or user preferences can improve the performance of recommender systems. However, hierarchical structures are often not explicitly available, especially those of user preferences. Thus, there's a gap between the importance of hierarchies and their availability. In this paper, we investigate the problem of exploring the implicit hierarchical structures for recommender systems when they are not explicitly available. We propose a novel recommendation framework to bridge the gap, which enables us to explore the implicit hierarchies of users and items simultaneously. We then extend the framework to integrate explicit hierarchies when they are available, which gives a unified framework for both explicit and implicit hierarchical structures. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework by incorporating implicit and explicit structures.05/01/2018 1:35 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2789443Scalable Content-Aware Collaborative Filtering for Location Recommendation
https://www.computer.org/csdl/trans/tk/2018/06/08246576-abs.html
Location recommendation plays an essential role in helping people find attractive places. Though recent research has studied how to recommend locations with social and geographical information, few of them addressed the cold-start problem of new users. Because mobility records are often shared on social networks, semantic information can be leveraged to tackle this challenge. A typical method is to feed them into explicit-feedback-based content-aware collaborative filtering, but they require drawing negative samples for better learning performance, as users’ negative preference is not observable in human mobility. However, prior studies have empirically shown sampling-based methods do not perform well. To this end, we propose a scalable Implicit-feedback-based Content-aware Collaborative Filtering (ICCF) framework to incorporate semantic content and to steer clear of negative sampling. We then develop an efficient optimization algorithm, scaling linearly with data size and feature size, and quadratically with the dimension of latent space. We further establish its relationship with graph Laplacian regularized matrix factorization. Finally, we evaluate ICCF with a large-scale LBSN dataset in which users have profiles and textual content. The results show that ICCF outperforms several competing baselines, and that user information is not only effective for improving recommendations but also coping with cold-start scenarios.05/01/2018 1:35 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2789445Space Filling Approach for Distributed Processing of Top-k Dominating Queries
https://www.computer.org/csdl/trans/tk/2018/06/08248785-abs.html
A top-k dominating query returns <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="amagata-ieq1-2790387.gif"/></alternatives></inline-formula> data objects that dominate the highest number of data objects in a given dataset. This query provides us with a set of intuitively preferred data, thus can support a wide variety of multi-criteria decision-making applications, e.g., e-commerce and web search. Due to the growth of data centers and cloud computing infrastructures, the above applications are increasingly being operated in distributed environments. These motivate us to address the problem of distributed top-k dominating query processing. We propose an efficient decentralized algorithm that exploits virtual points and returns the exact answer. The virtual points are utilized to focus on the data space to be preferentially searched and also to limit the search space to prune unnecessary computation and data forwarding. We also propose two other algorithms, which return an approximate answer set while further reducing query processing time. Extensive experiments on both real and synthetic data demonstrate the efficiency and scalability of our algorithms.05/01/2018 1:36 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2790387Sparse Feature Attacks in Adversarial Learning
https://www.computer.org/csdl/trans/tk/2018/06/08249883-abs.html
Adversarial learning is the study of machine learning techniques deployed in non-benign environments. Example applications include classification for detecting spam, network intrusion detection, and credit card scoring. In fact, as the use of machine learning grows in diverse application domains, the possibility for adversarial behavior is likely to increase. When adversarial learning is modelled in a game-theoretic setup, the standard assumption about the adversary (player) behavior is the ability to change <italic>all</italic> features of the classifiers (the opponent player) at will. The adversary pays a cost proportional to the size of the “attack”. We refer to this form of adversarial behavior as a dense feature attack. However, the aim of an adversary is not just to subvert a classifier but carry out data transformation in a way such that spam continues to remain effective. We demonstrate that an adversary could potentially achieve this objective by carrying out a sparse feature attack. We design an algorithm to show how a classifier should be designed to be robust against sparse adversarial attacks. Our main insight is that sparse feature attacks are best defended by designing classifiers which use <inline-formula> <tex-math notation="LaTeX">$\ell _{1}$</tex-math><alternatives><inline-graphic xlink:href="liu-ieq1-2790928.gif"/> </alternatives></inline-formula> regularizers.05/01/2018 1:36 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2790928Answering Natural Language Questions by Subgraph Matching over Knowledge Graphs
https://www.computer.org/csdl/trans/tk/2018/05/08085196-abs.html
RDF question/answering (Q/A) allows users to ask questions in natural languages over a knowledge base represented by RDF. To answer a natural language question, the existing work takes a two-stage approach: question understanding and query evaluation. Their focus is on question understanding to deal with the disambiguation of the natural language phrases. The most common technique is the joint disambiguation, which has the exponential search space. In this paper, we propose a systematic framework to answer natural language questions over RDF repository (RDF Q/A) from a graph data-driven perspective. We propose a semantic query graph to model the query intention in the natural language question in a structural way, based on which, RDF Q/A is reduced to subgraph matching problem. More importantly, we resolve the ambiguity of natural language questions at the time when matches of query are found. The cost of disambiguation is saved if there are no matching found. More specifically, we propose two different frameworks to build the semantic query graph, one is relation (edge)-first and the other one is node-first. We compare our method with some state-of-the-art RDF Q/A systems in the benchmark dataset. Extensive experiments confirm that our method not only improves the precision but also speeds up query performance greatly.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2766634Robust Prototype-Based Learning on Data Streams
https://www.computer.org/csdl/trans/tk/2018/05/08103824-abs.html
In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which allows dynamically modeling time-changing concepts, making predictions in a local fashion. Instead of learning a single model on a fixed or adaptive sliding window of historical data or ensemble learning a set of weighted base classifiers, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a proposed P-Tree, which are obtained based on the error-driven representativeness learning and synchronization-inspired constrained clustering. To identify abrupt concept drifts in data streams, PCA and statistical analysis based heuristic approaches have been introduced. To further learn the associations among distributed data streams, the extended P-Tree structure and KNN-style strategy are introduced. We demonstrate that our new data stream classification approach has several attractive benefits: (a) SyncStream is capable of dynamically modeling the evolving concepts from even a small set of prototypes. (b) Owing to synchronization-based constrained clustering and P-Tree, SyncStream supports efficient and effective data representation and maintenance. (c) SyncStream is also tolerant of inappropriate or noisy examples via error-driven representativeness learning. (d) SyncStream allows learning relationship among distributed data streams at the instance level. The experimental results indicate its efficiency and effectiveness.04/03/2018 3:51 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2772239Efficient and Scalable Integrity Verification of Data and Query Results for Graph Databases
https://www.computer.org/csdl/trans/tk/2018/05/08118163-abs.html
Graphs are used for representing and understanding objects and their relationships for numerous applications such as social networks, Semantic Webs, and biological networks. Integrity assurance of data and query results for graph databases is an essential security requirement. In this paper, we propose two efficient integrity verification schemes—HMACs for graphs (gHMAC) for two-party data sharing, and redactable HMACs for graphs (rgHMAC) for third-party data sharing, such as a cloud-based graph database service. We compute one HMAC value for both the schemes and two other verification objects for rgHMAC scheme that are shared with the verifier. We show that the proposed schemes are provably secure with respect to integrity attacks on the structure and/or content of graphs and query results. The proposed schemes have linear complexity in terms of the number of vertices and edges in the graphs, which is shown to be optimal. Our experimental results corroborate that the proposed HMAC-based schemes for graphs are highly efficient as compared to the digital signature-based schemes—computation of HMAC tags is about 10 times faster than the computation of digital signatures.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2776221UniWalk: Unidirectional Random Walk Based Scalable SimRank Computation over Large Graph
https://www.computer.org/csdl/trans/tk/2018/05/08126261-abs.html
SimRank is an important measure of vertex-pair similarity according to the structure of graphs. Although progress has been achieved, existing methods still face challenges to handle large graphs. Besides huge index construction and maintenance cost, existing methods may require considerable search space and time overheads in the online SimRank query. In this paper, we design a Monte Carlo based method, UniWalk, to enable the fast top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="song-ieq1-2779126.gif"/> </alternatives></inline-formula> SimRank computation over large undirected graphs. UniWalk directly locates the top- <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="song-ieq2-2779126.gif"/></alternatives></inline-formula> similar vertices for any single source vertex <inline-formula><tex-math notation="LaTeX">$u$</tex-math><alternatives> <inline-graphic xlink:href="song-ieq3-2779126.gif"/></alternatives></inline-formula> via <inline-formula> <tex-math notation="LaTeX">$R$</tex-math><alternatives><inline-graphic xlink:href="song-ieq4-2779126.gif"/> </alternatives></inline-formula> sampling paths originating from <inline-formula><tex-math notation="LaTeX">$u$ </tex-math><alternatives><inline-graphic xlink:href="song-ieq5-2779126.gif"/></alternatives></inline-formula>, which avoids selecting candidate vertex set <inline-formula><tex-math notation="LaTeX">$\mathcal{C}$</tex-math><alternatives> <inline-graphic xlink:href="song-ieq6-2779126.gif"/></alternatives></inline-formula> and the following <inline-formula> <tex-math notation="LaTeX">$O(|\mathcal{C}|R)$</tex-math><alternatives> <inline-graphic xlink:href="song-ieq7-2779126.gif"/></alternatives></inline-formula> bidirectional sampling paths. We also devise a path enumeration strategy to improve the SimRank precision by using path probabilities instead of path frequencies when sampling, a space-efficient method to reduce intermediate results, and a path-sharing strategy to lower the redundant path sampling cost for multiple source vertices. Furthermore, we extend UniWalk to existing distributed graph processing frameworks to improve its scalability. We conduct extensive experiments to illustrate that UniWalk has high scalability, and outperforms the state-of-the-art methods by orders of magnitude.04/03/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2779126Game-Theoretic Cross Social Media Analytic: How Yelp Ratings Affect Deal Selection on Groupon?
https://www.computer.org/csdl/trans/tk/2018/05/08128496-abs.html
Deal selection on Groupon is a typical social learning and decision making process, where the quality of a deal is usually unknown to the customers. The customers must acquire this knowledge through social learning from other social medias such as reviews on Yelp. Additionally, the quality of a deal depends on both the state of the vendor and decisions of other customers on Groupon. How social learning and network externality affect the decisions of customers in deal selection on Groupon is our main interest. We develop a data-driven game-theoretic framework to understand the rational deal selection behaviors cross social medias. The sufficient condition of the Nash equilibrium is identified. A value-iteration algorithm is proposed to find the optimal deal selection strategy. We conduct a year-long experiment to trace the competitions among deals on Groupon and the corresponding Yelp ratings. We utilize the dataset to analyze the deal selection game with realistic settings. Finally, the performance of the proposed social learning framework is evaluated with real data. The results suggest that customers do make decisions in a rational way instead of following naive strategies, and there is still room to improve their decisions with assistance from the proposed framework.04/03/2018 3:51 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2779494Range-Based Nearest Neighbor Queries with Complex-Shaped Obstacles
https://www.computer.org/csdl/trans/tk/2018/05/08128500-abs.html
In this paper, we study a novel variant of obstructed nearest neighbor queries, namely, <italic>range-based obstructed nearest neighbor</italic> (RONN) search. As a natural generalization of <italic>continuous obstructed nearest-neighbor</italic> (CONN), an RONN query retrieves a set of <italic>obstructed nearest neighbors</italic> corresponding to every point in a specified range. We propose a new index, namely binary obstructed tree (called <italic>OB-tree</italic>), for indexing complex objects in the obstructed space. The novelty of OB-tree lies in the idea of <italic>dividing the obstructed space into non-obstructed subspaces</italic>, aiming to efficiently retrieve highly qualified candidates for RONN processing. We develop an algorithm for construction of the OB-tree and propose a space division scheme, called <italic>optimal obstacle balance</italic> (OOB2) scheme, to address the tree balance problem. Accordingly, we propose an efficient algorithm, called <italic>RONN by OB-tree Acceleration</italic> (RONN-OBA), which exploits the OB-tree and a binary traversal order of data objects to accelerate query processing of RONN. In addition, we extend our work in several aspects regarding the shape of obstacles, and range-based <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="zhu-ieq1-2779487.gif"/></alternatives></inline-formula> NN queries in obstructed space. At last, we conduct a comprehensive performance evaluation using both real and synthetic datasets to validate our ideas and the proposed algorithms. The experimental result shows that the RONN-OBA algorithm outperforms the two R-tree based algorithms and RONN-OA significantly.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2779487Efficient Information Flow Maximization in Probabilistic Graphs
https://www.computer.org/csdl/trans/tk/2018/05/08166795-abs.html
Reliable propagation of information through large networks, e.g., communication networks, social networks, or sensor networks is very important in many applications concerning marketing, social networks, and wireless sensor networks. However, social ties of friendship may be obsolete, and communication links may fail, inducing the notion of uncertainty in such networks. In this paper, we address the problem of optimizing information propagation in uncertain networks given a constrained budget of edges. We show that this problem requires to solve two NP-hard subproblems: the computation of expected information flow, and the optimal choice of edges. To compute the expected information flow to a source vertex, we propose the <italic>F-tree</italic> as a specialized data structure, that identifies independent components of the graph for which the information flow can either be computed analytically and efficiently, or for which traditional Monte-Carlo sampling can be applied independently of the remaining network. For the problem of finding the optimal edges, we propose a series of heuristics that exploit properties of this data structure. Our evaluation shows that these heuristics lead to high quality solutions, thus yielding high information flow, while maintaining low running time.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2780123FBSGraph: Accelerating Asynchronous Graph Processing via Forward and Backward Sweeping
https://www.computer.org/csdl/trans/tk/2018/05/08170287-abs.html
Graph algorithm is pervasive in many applications ranging from targeted advertising to natural language processing. Recently, <italic>Asynchronous Graph Processing</italic> (AGP) is becoming a promising model to support graph algorithm on large-scale distributed computing platforms because it enables faster convergence speed and lower synchronization cost than the synchronous model for no barrier between iterations. However, existing AGP methods still suffer from poor performance for inefficient vertex state propagation. In this paper, we propose an effective and low-cost forward and backward sweeping execution method to accelerate state propagation for AGP, based on a key observation that states in AGP can be propagated between vertices much faster when the vertices are processed sequentially along the graph path within each round. Through dividing graph into paths and asynchronously processing vertices on each path in an alternative forward and backward way according to their order on this path, vertex states in our approach can be quickly propagated to other vertices and converge in a faster way with only little additional overhead. In order to efficiently support it over distributed platforms, we also propose a scheme to reduce the communication overhead along with a static priority ordering scheme to further improve the convergence speed. Experimental results on a cluster with 1,024 cores show that our approach achieves excellent scalability for large-scale graph algorithms and the overall execution time is reduced by at least 39.8 percent, in comparison with the most cutting-edge methods.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2781241Density-Based Place Clustering Using Geo-Social Network Data
https://www.computer.org/csdl/trans/tk/2018/05/08186182-abs.html
Spatial clustering deals with the unsupervised grouping of places into clusters and finds important applications in urban planning and marketing. Current spatial clustering models disregard information about the people and the time who and when are related to the clustered places. In this paper, we show how the density-based clustering paradigm can be extended to apply on places which are visited by users of a geo-social network. Our model considers spatio-temporal information and the social relationships between users who visit the clustered places. After formally defining the model and the distance measure it relies on, we provide alternatives to our model and the distance measure. We evaluate the effectiveness of our model via a case study on real data; in addition, we design two quantitative measures, called social entropy and community score, to evaluate the quality of the discovered clusters. The results show that temporal-geo-social clusters have special properties and cannot be found by applying simple spatial clustering approaches and other alternatives.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2782256$K$ -Ary Tree Hashing for Fast Graph Classification
https://www.computer.org/csdl/trans/tk/2018/05/08186208-abs.html
Existing graph classification usually relies on an exhaustive enumeration of substructure patterns, where the number of substructures expands exponentially w.r.t. with the size of the graph set. Recently, the Weisfeiler-Lehman (WL) graph kernel has achieved the best performance in terms of both accuracy and efficiency among state-of-the-art methods. However, it is still time-consuming, especially for large-scale graph classification tasks. In this paper, we present a -Ary Tree based Hashing (KATH) algorithm, which is able to obtain competitive accuracy with a very fast runtime. The main idea of KATH is to construct a traversal table to quickly approximate the subtree patterns in WL using <inline-formula><tex-math notation="LaTeX">$K$</tex-math></inline-formula>-ary trees. Based on the traversal table, KATH employs a recursive indexing process that performs only <inline-formula><tex-math notation="LaTeX">$r$ </tex-math><alternatives><inline-graphic xlink:href="wu-ieq3-2782278.gif"/></alternatives></inline-formula> times of matrix indexing to generate all <inline-formula><tex-math notation="LaTeX">$(r-1)$</tex-math><alternatives> <inline-graphic xlink:href="wu-ieq4-2782278.gif"/></alternatives></inline-formula>-depth <inline-formula> <tex-math notation="LaTeX">$K$</tex-math><alternatives><inline-graphic xlink:href="wu-ieq5-2782278.gif"/></alternatives> </inline-formula>-ary trees, where the leaf node labels of a tree can uniquely specify the pattern. After that, the MinHash scheme is used to fingerprint the acquired subtree patterns for a graph. Our experimental results on both real world and synthetic data sets show that KATH runs significantly faster than state-of-the-art methods while achieving competitive or better accuracy.04/03/2018 3:51 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2782278Minority Oversampling in Kernel Adaptive Subspaces for Class Imbalanced Datasets
https://www.computer.org/csdl/trans/tk/2018/05/08187729-abs.html
The class imbalance problem in machine learning occurs when certain classes are underrepresented relative to the others, leading to a learning bias toward the majority classes. To cope with the skewed class distribution, many learning methods featuring minority oversampling have been proposed, which are proved to be effective. To reduce information loss during feature space projection, this study proposes a novel oversampling algorithm, named minority oversampling in kernel adaptive subspaces (MOKAS), which exploits the invariant feature extraction capability of a kernel version of the adaptive subspace self-organizing maps. The synthetic instances are generated from well-trained subspaces and then their pre-images are reconstructed in the input space. Additionally, these instances characterize nonlinear structures present in the minority class data distribution and help the learning algorithms to counterbalance the skewed class distribution in a desirable manner. Experimental results on both real and synthetic data show that the proposed MOKAS is capable of modeling complex data distribution and outperforms a set of state-of-the-art oversampling algorithms.04/03/2018 3:51 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2779849Diagnosing and Minimizing Semantic Drift in Iterative Bootstrapping Extraction
https://www.computer.org/csdl/trans/tk/2018/05/08194873-abs.html
Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the “symptoms” of semantic drift, then DPs are the “causes” of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to describe our approach in cleaning information extraction errors caused by semantic drift, but we perform experiments on different relation extraction processes on three large real data extraction collections. The experimental results show that our DP cleaning method enables us to clean around 90 percent incorrect instances or patterns with about 90 percent precision, which outperforms the previous approaches we compare with.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2782697A Unified View of Social and Temporal Modeling for B2B Marketing Campaign Recommendation
https://www.computer.org/csdl/trans/tk/2018/05/08214219-abs.html
Business to Business (B2B) marketing aims at meeting the needs of other businesses instead of individual consumers, and thus entails management of more complex business needs than consumer marketing. The buying processes of the business customers involve series of different marketing campaigns providing multifaceted information about the products or services. While most existing studies focus on individual consumers, little has been done to guide business customers due to the dynamic and complex nature of these business buying processes. To this end, in this paper, we focus on providing a unified view of social and temporal modeling for B2B marketing campaign recommendation. Along this line, we first exploit the temporal behavior patterns in the B2B buying processes and develop a marketing campaign recommender system. Specifically, we start with constructing a temporal graph as the knowledge representation of the buying process of each business customer. Temporal graph can effectively extract and integrate the campaign order preferences of individual business customers. It is also worth noting that our system is backward compatible since the participating frequency used in conventional static recommender systems is naturally embedded in our temporal graph. The campaign recommender is then built in a low-rank graph reconstruction framework based on probabilistic graphical models. Our framework can identify the common graph patterns and predict missing edges in the temporal graphs. In addition, since business customers very often have different decision makers from the same company, we also incorporate social factors, such as community relationships of the business customers, for further improving overall performances of the missing edge prediction and recommendation. Finally, we have performed extensive empirical studies on real-world B2B marketing data sets and the results show that the proposed method can effectively improve the quality of the campaign recommendations for challenging B2B marketing tasks.04/03/2018 3:50 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2783926Index-Based Densest Clique Percolation Community Search in Networks
https://www.computer.org/csdl/trans/tk/2018/05/08214241-abs.html
Community search is important in graph analysis and can be used in many real applications. In the literature, various community models have been proposed. However, most of them cannot well identify the overlaps between communities which is an essential feature of real graphs. To address this issue, the <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="yuan-ieq1-2783933.gif"/> </alternatives></inline-formula>-clique percolation community model was proposed and has been proven effective in many applications. Motivated by this, in this paper, we adopt the <inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="yuan-ieq2-2783933.gif"/></alternatives></inline-formula>-clique percolation community model and study the densest clique percolation community search problem which aims to find the <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="yuan-ieq3-2783933.gif"/></alternatives></inline-formula>-clique percolation community with the maximum <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="yuan-ieq4-2783933.gif"/></alternatives></inline-formula> value that contains a given set of query nodes. We adopt an index-based approach to solve this problem. Based on the observation that a <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="yuan-ieq5-2783933.gif"/> </alternatives></inline-formula>-clique percolation community is a union of maximal cliques, we devise a novel compact index, <inline-formula><tex-math notation="LaTeX">$\mathsf {DCPC}$</tex-math><alternatives> <inline-graphic xlink:href="yuan-ieq6-2783933.gif"/></alternatives></inline-formula>-<inline-formula> <tex-math notation="LaTeX">$\mathsf {Index}$</tex-math><alternatives> <inline-graphic xlink:href="yuan-ieq7-2783933.gif"/></alternatives></inline-formula>, to preserve the maximal cliques and their connectivity information of the input graph. With <inline-formula><tex-math notation="LaTeX">$\mathsf {DCPC}$</tex-math><alternatives><inline-graphic xlink:href="yuan-ieq8-2783933.gif"/></alternatives></inline-formula>- <inline-formula><tex-math notation="LaTeX">$\mathsf {Index}$</tex-math><alternatives> <inline-graphic xlink:href="yuan-ieq9-2783933.gif"/></alternatives></inline-formula>, we can answer the densest clique percolation community query efficiently. Besides, we also propose an index construction algorithm based on the definition of <inline-formula><tex-math notation="LaTeX">$\mathsf {DCPC}$</tex-math><alternatives> <inline-graphic xlink:href="yuan-ieq10-2783933.gif"/></alternatives></inline-formula>-<inline-formula> <tex-math notation="LaTeX">$\mathsf {Index}$</tex-math><alternatives> <inline-graphic xlink:href="yuan-ieq11-2783933.gif"/></alternatives></inline-formula> and further improve the algorithm in terms of efficiency and memory consumption. We conduct extensive performance studies on real graphs and the experimental results demonstrate the efficiency of our index-based query processing algorithm and index construction algorithm.04/03/2018 3:51 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2783933Authenticating Aggregate Queries over Set-Valued Data with Confidentiality
https://www.computer.org/csdl/trans/tk/2018/04/08107548-abs.html
With recent advances in data-as-a-service (DaaS) and cloud computing, aggregate query services over set-valued data are becoming widely available for business intelligence that drives decision making. However, as the service provider is often a third-party delegate of the data owner, the integrity of the query results cannot be guaranteed and is thus imperative to be authenticated. Unfortunately, existing query authentication techniques either do not work for set-valued data or they lack data confidentiality. In this paper, we propose authenticated aggregate queries over set-valued data that not only ensure the integrity of query results but also preserve the confidentiality of source data. As many aggregate queries are composed of multiset operations such as set union and subset, we first develop a family of privacy-preserving authentication protocols for primitive multiset operations. Using these protocols as building blocks, we present a privacy-preserving authentication framework for various aggregate queries and further optimize their authentication performance. Security analysis and empirical evaluation show that our proposed privacy-preserving authentication techniques are feasible and robust under a wide range of system workloads.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2773541Janus: A Hybrid Scalable Multi-Representation Cloud Datastore
https://www.computer.org/csdl/trans/tk/2018/04/08110657-abs.html
Cloud-based data-intensive applications have to process high volumes of transactional and analytical requests on large-scale data. Businesses base their decisions on the results of analytical requests, creating a need for real-time analytical processing. We propose Janus, a hybrid scalable cloud datastore, which enables the efficient execution of diverse workloads by storing data in different representations. Janus manages big datasets in the context of datacenters, thus supporting scaling out by partitioning the data across multiple servers. This requires Janus to efficiently support distributed transactions. In order to support the different datacenter requirements, Janus also allows diverse partitioning strategies for the different representations. Janus proposes a novel data movement pipeline to continuously ensure up to date data between the different representations. Unlike existing multi-representation storage systems and Change Data Capture (CDC) pipelines, the data movement pipeline in Janus supports partitioning and handles both distributed transactions and diverse partitioning strategies. In this paper, we focus on supporting Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) workloads, and hence use row and column-oriented representations, which are the most efficient representations for these workloads. Our evaluations over Amazon AWS illustrate that Janus can provide real-time analytical results, in addition to processing high-throughput transactional workloads.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2773607Bidding Machine: Learning to Bid for Directly Optimizing Profits in Display Advertising
https://www.computer.org/csdl/trans/tk/2018/04/08115218-abs.html
Real-time bidding (RTB) based display advertising has become one of the key technological advances in computational advertising. RTB enables advertisers to buy individual ad impressions via an auction in real-time and facilitates the evaluation and the bidding of individual impressions across multiple advertisers. In RTB, the advertisers face three main challenges when optimizing their bidding strategies, namely (i) estimating the utility (e.g., conversions, clicks) of the ad impression, (ii) forecasting the market value (thus the cost) of the given ad impression, and (iii) deciding the optimal bid for the given auction based on the first two. Previous solutions assume the first two are solved before addressing the bid optimization problem. However, these challenges are strongly correlated and dealing with any individual problem independently may not be globally optimal. In this paper, we propose <italic>Bidding Machine</italic>, a comprehensive <italic>learning to bid</italic> framework, which consists of three optimizers dealing with each challenge above, and as a whole, jointly optimizes these three parts. We show that such a joint optimization would largely increase the campaign effectiveness and the profit. From the learning perspective, we show that the bidding machine can be updated smoothly with both offline periodical batch or online sequential training schemes. Our extensive offline empirical study and online A/B testing verify the high effectiveness of the proposed bidding machine.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2775228Propagation-Based Temporal Network Summarization
https://www.computer.org/csdl/trans/tk/2018/04/08118109-abs.html
Modern networks are very large in size and also evolve with time. As their sizes grow, the complexity of performing network analysis grows as well. Getting a smaller representation of a temporal network with similar properties will help in various data mining tasks. In this paper, we study the novel problem of getting a smaller diffusion-equivalent representation of a set of time-evolving networks. We first formulate a well-founded and general temporal-network condensation problem based on the so-called system-matrix of the network. We then propose <sc>NetCondense</sc>, a scalable and effective algorithm which solves this problem using careful transformations in sub-quadratic running time, and linear space complexities. Our extensive experiments show that we can reduce the size of large real temporal networks (from multiple domains such as social, co-authorship, and email) significantly without much loss of information. We also show the wide-applicability of <sc>NetCondense</sc> by leveraging it for several tasks: for example, we use it to understand, explore, and visualize the original datasets and to also speed-up algorithms for the influence-maximization and event detection problems on temporal networks.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2776282Reverse $k$ Nearest Neighbor Search over Trajectories
https://www.computer.org/csdl/trans/tk/2018/04/08118111-abs.html
GPS enables mobile devices to continuously provide new opportunities to improve our daily lives. For example, the data collected in applications created by Uber or Public Transport Authorities can be used to plan transportation routes, estimate capacities, and proactively identify low coverage areas. In this paper, we study a new kind of query—<italic>Reverse <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq2-2776268.gif"/></alternatives></inline-formula> Nearest Neighbor Search over Trajectories</italic> (<inline-formula><tex-math notation="LaTeX">$\mathbf{R}{k}\mathbf{NNT}$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq3-2776268.gif"/></alternatives></inline-formula>), which can be used for route planning and capacity estimation. Given a set of existing routes <inline-formula><tex-math notation="LaTeX"> $\mathcal{D}_{\mathcal{R}}$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq4-2776268.gif"/></alternatives> </inline-formula>, a set of passenger transitions <inline-formula><tex-math notation="LaTeX"> $\mathcal{D}_{\mathcal{T}}$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq5-2776268.gif"/></alternatives> </inline-formula>, and a query route <inline-formula><tex-math notation="LaTeX">$Q$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq6-2776268.gif"/></alternatives></inline-formula>, an <inline-formula> <tex-math notation="LaTeX">$\mathbf{R}{k}\mathbf{NNT}$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq7-2776268.gif"/></alternatives></inline-formula> query returns all transitions that take <inline-formula><tex-math notation="LaTeX">$Q$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq8-2776268.gif"/></alternatives></inline-formula> as one of its <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq9-2776268.gif"/> </alternatives></inline-formula> nearest travel routes. To solve the problem, we first develop an index to handle dynamic trajectory updates, so that the most up-to-date transition data are available for answering an <inline-formula> <tex-math notation="LaTeX">$\mathbf{R}{k}\mathbf{NNT}$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq10-2776268.gif"/></alternatives></inline-formula> query. Then we introduce a filter refinement framework for processing <inline-formula><tex-math notation="LaTeX">$\mathbf{R}{k}\mathbf{NNT}$</tex-math> <alternatives><inline-graphic xlink:href="wang-ieq11-2776268.gif"/></alternatives></inline-formula> queries using the proposed indexes. Next, we show how to use <inline-formula><tex-math notation="LaTeX">$\mathbf{R}{k}\mathbf{NNT}$ </tex-math><alternatives><inline-graphic xlink:href="wang-ieq12-2776268.gif"/></alternatives></inline-formula> to solve the optimal route planning problem <inline-formula><tex-math notation="LaTeX">$\mathbf{MaxR}{k}\mathbf{NNT}$</tex-math> <alternatives><inline-graphic xlink:href="wang-ieq13-2776268.gif"/></alternatives></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$\mathbf{MinR}{k}\mathbf{NNT}$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq14-2776268.gif"/></alternatives></inline-formula>), which is to search for the optimal route from a start location to an end location that could attract the maximum (or minimum) number of passengers based on a predefined travel distance threshold. Experiments on real datasets demonstrate the efficiency and scalability of our approaches. To the best of our knowledge, this is the first work to study the <inline-formula> <tex-math notation="LaTeX">$\mathbf{R}{k}\mathbf{NNT}$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq15-2776268.gif"/></alternatives></inline-formula> problem for route planning.03/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2776268Community Deception or: How to Stop Fearing Community Detection Algorithms
https://www.computer.org/csdl/trans/tk/2018/04/08118127-abs.html
In this paper, we research the community deception problem. Tackling this problem consists in developing techniques to hide a target community (<inline-formula><tex-math notation="LaTeX">$\mathcal{C}$</tex-math><alternatives> <inline-graphic xlink:href="pirro-ieq1-2776133.gif"/></alternatives></inline-formula>) from community detection algorithms. This need emerges whenever a group (e.g., activists, police enforcements, or network participants in general) want to observe and cooperate in a social network while avoiding to be detected. We introduce and formalize the community deception problem and devise an efficient algorithm that allows to achieve deception by identifying a certain number (<inline-formula><tex-math notation="LaTeX">$\beta$</tex-math><alternatives> <inline-graphic xlink:href="pirro-ieq2-2776133.gif"/></alternatives></inline-formula>) of <inline-formula> <tex-math notation="LaTeX">$\mathcal{C}$</tex-math><alternatives><inline-graphic xlink:href="pirro-ieq3-2776133.gif"/> </alternatives></inline-formula>’s members connections to be rewired. Deception can be practically achieved in social networks like Facebook by friending or unfriending network members as indicated by our algorithm. We compare our approach with another technique based on modularity. By considering a variety of (large) real networks, we provide a systematic evaluation of the robustness of community detection algorithms to deception techniques. Finally, we open some challenging research questions about the design of detection algorithms robust to deception techniques.03/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2776133A Novel Representation and Compression for Queries on Trajectories in Road Networks
https://www.computer.org/csdl/trans/tk/2018/04/08119585-abs.html
Recording and querying time-stamped trajectories incurs high cost of data storage and computing. In this paper, we explore several characteristics of the trajectories in road networks, which have motivated the idea of coding trajectories by associating timestamps with relative spatial path and locations. Such a representation contains a large number of duplicate information to achieve a lower entropy compared with the existing representations, thereby drastically cutting the storage cost. We propose several techniques to compress spatial path and locations separately, which can support fast positioning and achieve better compression ratio. For locations, we propose two novel encoding schemes such that the binary code can preserve distance information, which is very helpful for LBS applications. In addition, an unresolved question in this area is whether it is possible to perform a search directly on the compressed trajectories, and if the answer is yes, then how. Here, we show that directly querying compressed trajectories based on our encoding scheme is possible and can be done efficiently. We design a set of primitive operations for this purpose, and propose index structures to reduce query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real trajectories in road network.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2776927Learning Dynamic Conditional Gaussian Graphical Models
https://www.computer.org/csdl/trans/tk/2018/04/08119838-abs.html
In the paper, we propose a class of dynamic conditional Gaussian graphical models (DCGGMs) based on a set of non-identical distribution observations, which changes smoothly with time or condition. Specifically, the DCGGMs model the dynamic output network influenced by conditioning input variables, which are encoded by a set of varying parameters. Moreover, we propose a joint smooth graphical Lasso to estimate the DCGGMs, which combines kernel smoother with sparse group Lasso penalty. At the same time, we design an efficient accelerated proximal gradient algorithm to solve this estimator. Theoretically, we establish the asymptotic properties of our model on consistency and sparsistency under the high-dimensional settings. In particular, we highlight a class of consistency theory for dynamic graphical models, in which the sample size can be seen as <inline-formula><tex-math notation="LaTeX">$n^{4/5}$ </tex-math><alternatives><inline-graphic xlink:href="huang-ieq1-2777462.gif"/></alternatives></inline-formula> for estimating a local graphical model when the bandwidth parameter <inline-formula><tex-math notation="LaTeX">$h$ </tex-math><alternatives><inline-graphic xlink:href="huang-ieq2-2777462.gif"/></alternatives></inline-formula> of kernel smoother is chosen as <inline-formula><tex-math notation="LaTeX">$h\; \asymp\; n^{-1/5}$</tex-math> <alternatives><inline-graphic xlink:href="huang-ieq3-2777462.gif"/></alternatives></inline-formula> for describing the dynamic. Finally, the extensive numerical experiments on both synthetic and real datasets are provided to support the effectiveness of the proposed method.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2777462To Meet or Not to Meet: Finding the Shortest Paths in Road Networks
https://www.computer.org/csdl/trans/tk/2018/04/08120025-abs.html
Finding the shortest path in road networks becomes one of important issues in location based services (LBS). The problem of finding the optimal meeting point for a group of users has also been well studied in existing works. In this paper, we investigate a new problem for two users. Each user has his/her own source and destination. However, whether to meet before going to their destinations is with some uncertainty. We model it as minimum path pair (<italic> MPP</italic>) query, which consists of two pairs of source and destination and a user-specified weight <inline-formula> <tex-math notation="LaTeX">$\alpha$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq1-2777851.gif"/> </alternatives></inline-formula> to balance the two different needs. The result is a pair of paths connecting the two sources and destinations respectively, with minimal overall cost of the two paths and the shortest route between them. To solve <italic>MPP</italic> queries, we devise algorithms by enumerating node pairs. We adopt a location-based pruning strategy to reduce the number of node pairs for enumeration. An efficient algorithm based on point-to-point shortest path calculation is proposed to further improve query efficiency. We also give two fast approximate algorithms with approximation bounds. Extensive experiments are conducted to show the effectiveness and efficiency of our methods.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2777851Efficient Computation of G-Skyline Groups
https://www.computer.org/csdl/trans/tk/2018/04/08120106-abs.html
The skyline of a data point set is made up of the best points in the set, and is very important for multi-criteria decision making. In these years, the skyline problem attracts more and more attention, and many variants of the traditional skyline emerge in the database field. One recent and important variant is group-based skyline, which aims to find the best groups of points in a given set. In this paper, we bring forward an efficient approach, called <italic>minimum dominance search</italic> (MDS), to solve the g-skyline problem, a latest group-based skyline problem. MDS consists of two steps: In the first step, we construct a novel g-skyline support structure, i.e., minimum dominance graph (MDG), which proves to be a minimum g-skyline support structure. In the second step, we search for g-skyline groups based on the MDG through two searching algorithms, and a skyline-combination based optimization strategy is employed to improve these two algorithms. We conduct comprehensive experiments on both synthetic and real-world data sets, and show that our algorithms are orders of magnitude faster than the state-of-the-art in most cases.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2777994Linguistic Petri Nets Based on Cloud Model Theory for Knowledge Representation and Reasoning
https://www.computer.org/csdl/trans/tk/2018/04/08123830-abs.html
Fuzzy Petri nets (FPNs) are a vital modeling technique for the construction of knowledge-based systems, which have been commonly used in many fields, such as fault diagnosis, risk assessment, workflow management, and disassembly process planning. However, the conventional FPNs have been blamed for the following reasons: 1) the representation parameters in FPNs cannot precisely model experts' experience since it is difficult to manage the fuzziness and randomness of knowledge assessments simultaneously, and 2) the weight coefficients in the existing approximate reasoning algorithms are hardly enough to reflect the associated weights of reordered places. In response, we propose a new type of FPNs, called cloud reasoning Petri nets (CRPNs) based on the concept of interval clouds and the hybrid averaging operator. The cloud production rules in a knowledge-based system are modeled by CRPNs, where the truth degrees of places, the certainty factors of rules, and the thresholds of transitions are represented by interval clouds. Moreover, a matrix operation-based reasoning algorithm is proposed to improve the efficiency of calculating final truth degrees, in which both local and ordered weight coefficients are taken into consideration. Finally, a practical example concerning a power system is provided to demonstrate the usefulness and advantages of the proposed CRPN model.03/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2778256Towards Why-Not Spatial Keyword Top-$k$ Queries: A Direction-Aware Approach
https://www.computer.org/csdl/trans/tk/2018/04/08125132-abs.html
With the continued proliferation of location-based services, a growing number of web-accessible data objects are geo-tagged and have text descriptions. An important query over such web objects is the <italic>direction-aware spatial keyword query</italic> that aims to retrieve the top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="xu-ieq2-2778731.gif"/></alternatives></inline-formula> objects that best match query parameters in terms of spatial distance and textual similarity in a given query direction. In some cases, it can be difficult for users to specify appropriate query parameters. After getting a query result, users may find some desired objects are unexpectedly missing and may therefore question the entire result. Enabling why-not questions in this setting may aid users to retrieve better results, thus improving the overall utility of the query functionality. This paper studies the direction-aware why-not spatial keyword top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="xu-ieq3-2778731.gif"/></alternatives> </inline-formula> query problem. We propose efficient query refinement techniques to revive missing objects by minimally modifying users’ direction-aware queries. We prove that the best refined query directions lie in a finite solution space for a special case and reduce the search for the optimal refinement to a linear programming problem for the general case. Extensive experimental studies demonstrate that the proposed techniques outperform a baseline method by two orders of magnitude and are robust in a broad range of settings.03/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2778731175