IEEE Transactions on Knowledge & Data Engineering
https://www.computer.org/csdl/trans/tk/index.html
The IEEE Transactions on Knowledge and Data Engineering is an archival journal published monthly. The information published in this Transactions is designed to inform researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area. We are interested in well-defined theoretical results and empirical studies that have potential impact on the acquisition, management, storage, and graceful degeneration of knowledge and data, as well as in provision of knowledge and data services. Specific topics include, but are not limited to: a) artificial intelligence techniques, including speech, voice, graphics, images, and documents; b) knowledge and data engineering tools and techniques; c) parallel and distributed processing; d) real-time distributed; e) system architectures, integration, and modeling; f) database design, modeling and management; g) query design and implementation languages; h) distributed database control; j) algorithms for data and knowledge management; k) performance evaluation of algorithms and systems; l) data communications aspects; m) system applications and experience; n) knowledge-based and expert systems; and, o) integrity, security, and fault tolerance.en-ushttps://s3.amazonaws.com/ieeecs.cdn.common/images/computer.org/CSLogo_dark.pngIEEE Computer Society Digital LibraryList of 100 recently published journal articles.
https://www.computer.org/csdl
Platform-Independent Robust Query Processing
https://www.computer.org/csdl/trans/tk/2019/01/07843652-abs.html
To address the classical selectivity estimation problem for OLAP queries in relational databases, a radically different approach called <monospace>PlanBouquet</monospace> was recently proposed in <xref ref-type="bibr" rid="ref1"> [1]</xref> , wherein the estimation process is completely abandoned and replaced with a calibrated discovery mechanism. The beneficial outcome of this new construction is that provable guarantees on worst-case performance, measured as Maximum Sub-Optimality (<italic>MSO</italic>), are obtained thereby facilitating robust query processing. The <monospace> PlanBouquet</monospace> formulation suffers, however, from a systemic drawback—the MSO bound is a function of not only the query, but also the optimizer's behavioral profile over the underlying database platform. As a result, there are adverse consequences: (i) the bound value becomes highly variable, depending on the specifics of the current operating environment, and (ii) it becomes infeasible to compute the value without substantial investments in preprocessing overheads. In this paper, we first present <monospace>SpillBound</monospace>, a new query processing algorithm that retains the core strength of the <monospace>PlanBouquet</monospace> discovery process, but reduces the bound dependency to only the query. It does so by incorporating plan termination and selectivity monitoring mechanisms in the database engine. Specifically, <monospace>SpillBound</monospace> delivers a worst-case multiplicative bound, of <inline-formula><tex-math notation="LaTeX">$D^2+3D$</tex-math><alternatives> <inline-graphic xlink:href="venkatesh-ieq1-2664827.gif"/></alternatives></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$D$</tex-math><alternatives><inline-graphic xlink:href="venkatesh-ieq2-2664827.gif"/> </alternatives></inline-formula> is simply the number of error-prone predicates in the user query. Consequently, the bound value becomes independent of the optimizer and the database platform, and the guarantee can be issued simply by query inspection. We go on to prove that <monospace>SpillBound</monospace> is within an <inline-formula> <tex-math notation="LaTeX">$O(D)$</tex-math><alternatives><inline-graphic xlink:href="venkatesh-ieq3-2664827.gif"/> </alternatives></inline-formula> factor of the <italic>best possible</italic> deterministic selectivity discovery algorithm in its class. We next devise techniques to bridge this quadratic-to-linear MSO gap by introducing the notion of <italic>contour alignment</italic>, a characterization of the nature of plan structures along the <italic> boundaries</italic> of the selectivity space. Specifically, we propose a variant of <monospace>SpillBound</monospace>, called <monospace>AlignedBound</monospace>, which exploits the alignment property and provides a guarantee in the range <inline-formula><tex-math notation="LaTeX">$\mathbf {[2D+2,D^2+3D]}$</tex-math><alternatives> <inline-graphic xlink:href="venkatesh-ieq4-2664827.gif"/></alternatives></inline-formula>. Finally, a detailed empirical evaluation over the standard decision-support benchmarks indicates that: (i) <monospace>SpillBound </monospace> provides markedly superior performance w.r.t. MSO as compared to <monospace>PlanBouquet</monospace>, and (ii) <monospace>AlignedBound</monospace> provides additional benefits for query instances that are challenging for <monospace>SpillBound</monospace>, often coming close to the ideal of MSO linearity in <inline-formula> <tex-math notation="LaTeX">$D$</tex-math><alternatives><inline-graphic xlink:href="venkatesh-ieq5-2664827.gif"/> </alternatives></inline-formula>. From an absolute perspective, <monospace>AlignedBound</monospace> evaluates virtually all the benchmark queries considered in our study with MSO of around <bold>10</bold> or lesser. Therefore, in an overall sense, <monospace>SpillBound</monospace> and <monospace>AlignedBound</monospace> offer a substantive step forward in the long-standing quest for robust query processing.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2664827Inferring Higher-Order Structure Statistics of Large Networks from Sampled Edges
https://www.computer.org/csdl/trans/tk/2019/01/07884989-abs.html
Recently exploring locally connected subgraphs (also known as motifs or graphlets) of complex networks attracts a lot of attention. Previous work made the strong assumption that the graph topology of interest is known in advance. In practice, sometimes researchers have to deal with the situation where the graph topology is unknown because it is expensive to collect and store all topological information. Hence, typically what is available to researchers is only a snapshot of the graph, i.e., a subgraph of the graph. Crawling methods such as breadth first sampling can be used to generate the snapshot. However, these methods fail to sample a streaming graph represented as a high speed stream of edges. Therefore, graph mining applications such as network traffic monitoring usually use random edge sampling (i.e., sample each edge with a fixed probability) to collect edges and generate a sampled graph, which we call a “ <italic>RESampled graph</italic>”. Clearly, a RESampled graph's motif statistics may be quite different from those of the original graph. To resolve this, we propose a framework Minfer, which takes the given RESampled graph and accurately infers the underlying graph's motif statistics. Experiments using large scale datasets show the accuracy and efficiency of our method.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2685584Interactive Data Exploration with Smart Drill-Down
https://www.computer.org/csdl/trans/tk/2019/01/07885129-abs.html
We present <italic>smart drill-down</italic>, an operator for interactively exploring a relational table to discover and summarize “interesting” groups of tuples. Each group of tuples is described by a <italic>rule</italic> . For instance, the rule <inline-formula><tex-math notation="LaTeX">$(a, b, \star, 1000)$</tex-math><alternatives> <inline-graphic xlink:href="joglekar-ieq1-2685998.gif"/></alternatives></inline-formula> tells us that there are 1,000 tuples with value <inline-formula><tex-math notation="LaTeX">$a$</tex-math><alternatives> <inline-graphic xlink:href="joglekar-ieq2-2685998.gif"/></alternatives></inline-formula> in the first column and <inline-formula><tex-math notation="LaTeX">$b$</tex-math><alternatives> <inline-graphic xlink:href="joglekar-ieq3-2685998.gif"/></alternatives></inline-formula> in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are <sc>NP-Hard</sc>, and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2685998<inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives> <inline-graphic xlink:href="kim-ieq1-2698461.gif"/></alternatives></inline-formula>-Injection: Toward Effective Collaborative Filtering Using Uninteresting Items
https://www.computer.org/csdl/trans/tk/2019/01/07913668-abs.html
We develop a novel framework, named as <inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives> <inline-graphic xlink:href="kim-ieq2-2698461.gif"/></alternatives></inline-formula>-injection, to address the sparsity problem of recommender systems. By carefully injecting low values to a selected set of unrated user-item pairs in a user-item matrix, we demonstrate that top-<italic>N</italic> recommendation accuracies of various collaborative filtering (CF) techniques can be significantly and consistently improved. We first adopt the notion of <italic>pre-use preferences</italic> of users toward a vast amount of <italic>unrated</italic> items. Using this notion, we identify <italic>uninteresting</italic> items that have not been rated yet but are likely to receive low ratings from users, and selectively impute them as low values. As our proposed approach is method-agnostic, it can be easily applied to a variety of CF algorithms. Through comprehensive experiments with three real-life datasets (e.g., Movielens, Ciao, and Watcha), we demonstrate that our solution consistently and universally enhances the accuracies of existing CF algorithms (e.g., item-based CF, SVD-based CF, and SVD++) by 2.5 to 5 times on average. Furthermore, our solution improves the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy. The datasets and codes that we used in the experiments are available at: <uri>https://goo.gl/KUrmip</uri>.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2698461Passive and Partially Active Fault Tolerance for Massively Parallel Stream Processing Engines
https://www.computer.org/csdl/trans/tk/2019/01/07959652-abs.html
Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2720602A Hardware-Accelerated Solution for Hierarchical Index-Based Merge-Join
https://www.computer.org/csdl/trans/tk/2019/01/08330031-abs.html
Hardware acceleration through <italic>field programmable gate arrays (FPGAs)</italic> has recently become a technique of growing interest for many data-intensive applications. Join query is one of the most fundamental database query types useful in relational database management systems. However, the available solutions so far have been beset by higher costs in comparison to other query types. In this paper, we develop a novel solution to accelerate the processing of sort-merge join queries with low match rates. Specifically, our solution makes use of hierarchical indexes to identify result-yielding regions in the solution space in order to take advantage of result sparseness. Further, in addition to one-dimensional <italic>equi-join</italic> query processing, our solution supports processing of multidimensional similarity join queries. Experimental results show that our solution is superior to the best existing method in a low match rate setting; the method achieves a speedup factor of 4.8 for join queries with a match rate of 5 percent.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822707Order-Sensitive Imputation for Clustered Missing Values
https://www.computer.org/csdl/trans/tk/2019/01/08330055-abs.html
The issue of missing values (MVs) has appeared widely in real-world datasets and hindered the use of many statistical or machine learning algorithms for data analytics due to their incompetence in handling incomplete datasets. To address this issue, several MV imputation algorithms have been developed. However, these approaches do not perform well when most of the incomplete tuples are clustered with each other, coined here as the <italic> Clustered Missing Values Phenomenon</italic>, which attributes to the lack of sufficient complete tuples near an MV for imputation. In this paper, we propose the <italic>Order-Sensitive Imputation for Clustered Missing values</italic> (OSICM) framework, in which missing values are imputed sequentially such that the values filled earlier in the process are also used for later imputation of other MVs. Obviously, the order of imputations is critical to the effectiveness and efficiency of OSICM framework. We formulate the searching of the optimal imputation order as an optimization problem, and show its NP-hardness. Furthermore, we devise an algorithm to find the exact optimal solution and propose two approximate/heuristic algorithms to trade off effectiveness for efficiency. Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our OSICM framework.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822662Achieving Data Truthfulness and Privacy Preservation in Data Markets
https://www.computer.org/csdl/trans/tk/2019/01/08330057-abs.html
As a significant business paradigm, many online information platforms have emerged to satisfy society's needs for person-specific data, where a service provider collects raw data from data contributors, and then offers value-added data services to data consumers. However, in the data trading layer, the data consumers face a pressing problem, i.e., how to verify whether the service provider has truthfully collected and processed data? Furthermore, the data contributors are usually unwilling to reveal their sensitive personal data and real identities to the data consumers. In this paper, we propose TPDM, which efficiently integrates <underline>T</underline>ruthfulness and <underline>P</underline>rivacy preservation in <underline>D</underline>ata <underline>M</underline>arkets. TPDM is structured internally in an Encrypt-then-Sign fashion, using partially homomorphic encryption and identity-based signature. It simultaneously facilitates batch verification, data processing, and outcome verification, while maintaining identity preservation and data confidentiality. We also instantiate TPDM with a profile matching service and a data distribution service, and extensively evaluate their performances on Yahoo! Music ratings dataset and 2009 RECS dataset, respectively. Our analysis and evaluation results reveal that TPDM achieves several desirable properties, while incurring low computation and communication overheads when supporting large-scale data markets.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822727Top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="semertzidis-ieq1-2823754.gif"/></alternatives></inline-formula> Durable Graph Pattern Queries on Temporal Graphs
https://www.computer.org/csdl/trans/tk/2019/01/08332489-abs.html
Graphs offer a natural model for the relationships and interactions among entities, such as those occurring among users in social and cooperation networks, and proteins in biological networks. Since most such networks are dynamic, to capture their evolution over time, we assume a sequence of graph snapshots where each graph snapshot represents the state of the network at a different time instance. Given this sequence, we seek to find the top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="semertzidis-ieq2-2823754.gif"/> </alternatives></inline-formula> <italic>most durable matches</italic> of an input graph pattern query, that is, the matches that exist for the longest period of time. The straightforward way to address this problem is to apply a state-of-the-art graph pattern matching algorithm at each snapshot and then aggregate the results. However, for large networks and long sequences, this approach is computationally expensive, since all matches have to be generated at each snapshot, including those appearing only once. We propose a new approach that uses a compact representation of the sequence of graph snapshots, appropriate time indexes to prune the search space, and strategies to determine the duration of the seeking matches. Finally, we present experiments with real datasets that illustrate the efficiency and effectiveness of our approach.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2823754Finding Optimal Skyline Product Combinations under Price Promotion
https://www.computer.org/csdl/trans/tk/2019/01/08332494-abs.html
Nowadays, with the development of e-commerce, a growing number of customers choose to go shopping online. To find attractive products from online shopping marketplaces, the skyline query is a useful tool which offers more interesting and preferable choices for customers. The skyline query and its variants have been extensively investigated. However, to the best of our knowledge, they have not taken into account the requirements of customers in certain practical application scenarios. Recently, online shopping marketplaces usually hold some price promotion campaigns to attract customers and increase their purchase intention. Considering the requirements of customers in this practical application scenario, we are concerned about product selection under price promotion. We formulate a constrained optimal product combination (COPC) problem. It aims to find out the skyline product combinations which both meet a customer's willingness to pay and bring the maximum discount rate. The COPC problem is significant to offer powerful decision support for customers under price promotion, which is certified by a customer study. To process the COPC problem effectively, we first propose a two list exact (TLE) algorithm. The COPC problem is proven to be NP-hard, and the TLE algorithm is not scalable because it needs to process an exponential number of product combinations. Additionally, we design a lower bound approximate (LBA) algorithm that has a guarantee about the accuracy of the results and an incremental greedy (IG) algorithm that has good performance. The experiment results demonstrate the efficiency and effectiveness of our proposed algorithms.12/11/2018 4:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2823707An Efficient Method for High Quality and Cohesive Topical Phrase Mining
https://www.computer.org/csdl/trans/tk/2019/01/08332520-abs.html
A phrase is a natural, meaningful, and essential semantic unit. In topic modeling, visualizing phrases for individual topics is an effective way to explore and understand unstructured text corpora. However, from phrase quality and topical cohesion perspectives, the outcomes of existing approaches remain to be improved. Usually, the process of topical phrase mining is twofold: phrase mining and topic modeling. For phrase mining, existing approaches often suffer from order sensitive and inappropriate segmentation problems, which make them often extract inferior quality phrases. For topic modeling, traditional topic models do not fully consider the constraints induced by phrases, which may weaken the cohesion. Moreover, existing approaches often suffer from losing domain terminologies since they neglect the impact of domain-level topical distribution. In this paper, we propose an efficient method for high quality and cohesive topical phrase mining. A high quality phrase should satisfy frequency, phraseness, completeness, and appropriateness criteria. In our framework, we integrate quality guaranteed phrase mining method, a novel topic model incorporating the constraint of phrases, and a novel document clustering method into an iterative framework to improve both phrase quality and topical cohesion. We also describe efficient algorithmic designs to execute these methods efficiently. The empirical verification demonstrates that our method outperforms the state-of-the-art methods from the aspects of both interpretability and efficiency.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2823758HCBC: A Hierarchical Case-Based Classifier Integrated with Conceptual Clustering
https://www.computer.org/csdl/trans/tk/2019/01/08333767-abs.html
The structured case representation improves case-based reasoning (CBR) by exploring structures in the case base and the relevance of case structures. Recent CBR classifiers have mostly been built upon the attribute-value case representation rather than structured case representation, in which the structural relations embodied in their representation structure are accordingly overlooked in improving the similarity measure. This results in retrieval inefficiency and limitations on the performance of CBR classifiers. This paper proposes a hierarchical case-based classifier, HCBC, which introduces a concept lattice to hierarchically organize cases. By exploiting structural case relations in the concept lattice, a novel dynamic weighting model is proposed to enhance the concept similarity measure. Based on this similarity measure, HCBC retrieves the top-K concepts that are most similar to a new case by using a bottom-up pruning-based recursive retrieval (PRR) algorithm. The concepts extracted in this way are applied to suggest a class label for the case by a weighted majority voting. Experimental results show that HCBC outperforms other classifiers in terms of classification performance and robustness on categorical data, and also works confidently well on numeric datasets. In addition, PRR effectively reduces the search space and greatly improves the retrieval efficiency of HCBC.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2824317I/O Efficient Core Graph Decomposition: Application to Degeneracy Ordering
https://www.computer.org/csdl/trans/tk/2019/01/08354806-abs.html
Core decomposition is a fundamental graph problem with a large number of applications. Most existing approaches for core decomposition assume that the graph is kept in memory of a machine. Nevertheless, many real-world graphs are too big to reside in memory. In this paper, we study I/O efficient core decomposition following a semi-external model, which only allows node information to be loaded in memory. We propose a semi-external algorithm and an optimized algorithm for I/O efficient core decomposition. To handle dynamic graph updates, we firstly show that our algorithm can be naturally extended to handle edge deletion. Then, we propose an I/O efficient core maintenance algorithm to handle edge insertion, and an improved algorithm to further reduce I/O and CPU cost. In addition, based on our core decomposition algorithms, we further propose an I/O efficient semi-external algorithm for degeneracy ordering, which is an important graph problem that is highly related to core decomposition. We also consider how to maintain the degeneracy order. We conduct extensive experiments on 12 real large graphs. Our optimal core decomposition algorithm significantly outperforms the existing I/O efficient algorithm in terms of both processing time and memory consumption. They are very scalable to handle web-scale graphs. As an example, we are the first to handle a web graph with 978.5 million nodes and 42.6 billion edges using less than 4.2 GB memory. We also show that our proposed algorithms for degeneracy order computation and maintenance can handle big graphs efficiently with small memory overhead.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2833070A Note on the Behavior of Majority Voting in Multi-Class Domains with Biased Annotators
https://www.computer.org/csdl/trans/tk/2019/01/08375733-abs.html
Majority voting is a popular and robust strategy to aggregate different opinions in learning from crowds, where each worker labels examples according to their own criteria. Although it has been extensively studied in the binary case, its behavior with multiple classes is not completely clear, specifically when annotations are biased. This paper attempts to fill that gap. The behavior of the majority voting strategy is studied in-depth in multi-class domains, emphasizing the effect of annotation bias. By means of a complete experimental setting, we show the limitations of the standard majority voting strategy. The use of three simple techniques that infer global information from the annotations and annotators allows us to put the performance of the majority voting strategy in context.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.28454002018 Index IEEE Transactions on Knowledge and Data Engineering Vol. 30
https://www.computer.org/csdl/trans/tk/2019/01/08566022-abs.html
Presents the 2018 subject/author index for this publication.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2882359Special Section on the International Conference on Data Engineering 2016
https://www.computer.org/csdl/trans/tk/2019/01/08566026-abs.html
12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2876580Dynamic Data Exchange in Distributed RDF Stores
https://www.computer.org/csdl/trans/tk/2018/12/08323222-abs.html
When RDF datasets become too large to be managed by centralised systems, they are often distributed in a cluster of shared-nothing servers, and queries are answered using a distributed join algorithm. Although such solutions have been extensively studied in relational and RDF databases, we argue that existing approaches exhibit two drawbacks. First, they usually decide <italic>statically</italic>(i.e., at query compile time) how to shuffle the data, which can lead to missed opportunities for local computation. Second, they often materialise large intermediate relations whose size is determined by the entire dataset (and not the data stored in each server), so these relations can easily exceed the memory of individual servers. As a possible remedy, we present a novel distributed join algorithm for RDF. Our approach decides when to shuffle data <italic>dynamically</italic>, which ensures that query answers that can be wholly produced within a server involve only local computation. It also uses a novel flow control mechanism to ensure that every query can be answered even if each server has a bounded amount of memory that is much smaller than the intermediate relations. We complement our algorithm with a new query planning approach that balances the cost of communication against the cost of local processing at each server. Moreover, as in several existing approaches, we distribute RDF data using graph partitioning so as to maximise local computation, but we refine the partitioning algorithm to produce more balanced partitions. We show empirically that our techniques can outperform the state of the art by orders of magnitude in terms of query evaluation times, network communication, and memory use. In particular, bounding the memory use in individual servers can mean the difference between success and failure for answering queries with large answer sets.11/11/2018 7:15 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818696Partially Related Multi-Task Clustering
https://www.computer.org/csdl/trans/tk/2018/12/08323233-abs.html
Multi-task clustering improves the clustering performance of each task by transferring knowledge across related tasks. Most existing multi-task clustering methods are based on the ideal assumption that the tasks are completely related. However, in real applications, the tasks are usually partially related. In these cases, brute-force transfer may cause negative effect which degrades the clustering performance. In this paper, we propose two multi-task clustering methods for partially related tasks: the self-adapted multi-task clustering (SAMTC) method and the manifold regularized coding multi-task clustering (MRCMTC) method, which can automatically identify and transfer related instances among the tasks, thus avoiding negative transfer. Both SAMTC and MRCMTC construct the similarity matrix for each target task by exploiting useful information from the source tasks through related instances transfer, and adopt spectral clustering to get the final clustering results. But, they learn the related instances from the source tasks in different ways. Experimental results on real data sets show the superiorities of the proposed algorithms over traditional single-task clustering methods and existing multi-task clustering methods on both completely and partially related tasks.11/11/2018 7:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818705Semi-Supervised Ensemble Clustering Based on Selected Constraint Projection
https://www.computer.org/csdl/trans/tk/2018/12/08323237-abs.html
Traditional cluster ensemble approaches have several limitations. (1) Few make use of prior knowledge provided by experts. (2) It is difficult to achieve good performance in high-dimensional datasets. (3) All of the weight values of the ensemble members are equal, which ignores different contributions from different ensemble members. (4) Not all pairwise constraints contribute to the final result. In the face of this situation, we propose double weighting semi-supervised ensemble clustering based on selected constraint projection(DCECP) which applies constraint weighting and ensemble member weighting to address these limitations. Specifically, DCECP first adopts the random subspace technique in combination with the constraint projection procedure to handle high-dimensional datasets. Second, it treats prior knowledge of experts as pairwise constraints, and assigns different subsets of pairwise constraints to different ensemble members. An adaptive ensemble member weighting process is designed to associate different weight values with different ensemble members. Third, the weighted normalized cut algorithm is adopted to summarize clustering solutions and generate the final result. Finally, nonparametric statistical tests are used to compare multiple algorithms on real-world datasets. Our experiments on 15 high-dimensional datasets show that DCECP performs better than most clustering algorithms.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818729Privacy-Preserving Collaborative Model Learning: The Case of Word Vector Training
https://www.computer.org/csdl/trans/tk/2018/12/08325493-abs.html
Nowadays, machine learning is becoming a new paradigm for mining hidden knowledge in big data. The collection and manipulation of big data not only create considerable values, but also raise serious privacy concerns. To protect the huge amount of potentially sensitive data, a straightforward approach is to encrypt data with specialized cryptographic tools. However, it is challenging to utilize or operate on encrypted data, especially to perform machine learning algorithms. In this paper, we investigate the problem of training high quality word vectors over large-scale encrypted data (from distributed data owners) with the privacy-preserving collaborative neural network learning algorithms. We leverage and also design a suite of arithmetic primitives (e.g., multiplication, fixed-point representation, sigmoid function computation, etc.) on encrypted data, served as components of our construction. We theoretically analyze the security and efficiency of our proposed construction, and conduct extensive experiments on representative real-world datasets to verify its practicality and effectiveness.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2819673Uncertain Graph Sparsification
https://www.computer.org/csdl/trans/tk/2018/12/08325513-abs.html
Uncertain graphs are prevalent in several applications including communications systems, biological databases, and social networks. The ever increasing size of the underlying data renders both graph storage and query processing extremely expensive. Sparsification has often been used to reduce the size of deterministic graphs by maintaining only the important edges. However, adaptation of deterministic sparsification methods fails in the uncertain setting. To overcome this problem, we introduce the first sparsification techniques aimed explicitly at uncertain graphs. The proposed methods reduce the number of edges and redistribute their probabilities in order to decrease the graph size, while preserving its underlying structure. The resulting graph can be used to efficiently and accurately approximate any query and mining tasks on the original graph. An extensive experimental evaluation with real and synthetic datasets illustrates the effectiveness of our techniques on several common graph tasks, including clustering coefficient, page rank, reliability, and shortest path distance.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2819651Inferring Cognitive Wellness from Motor Patterns
https://www.computer.org/csdl/trans/tk/2018/12/08326510-abs.html
Changes in the motor pattern have been shown to be useful advanced indicators of cognitive disorders, such as Parkinson's disease (PD) and cerebral small vessel disease (SVD). It would be highly advantageous to tap into data containing people's motor patterns from motion sensing devices to analyze subtle changes in cognitive abilities, thereby providing personalized interventions before the actual onset of such conditions. However, this goal is very challenging due to two main technical problems: 1) the size of data labeled by doctors is small, and 2) the available data tends to be highly imbalanced (the vast majority tend to be from normal subjects with only a small fraction from subjects with cognitive disorder). In order to effectively deal with these challenges to infer cognitive wellness from motor patterns with high accuracy, we propose the MOtor-Cognitive Analytics (MOCA) framework. The proposed MOCA first uses the random oversampling iterative random forest based feature selection method to reduce the feature space dimensionality and avoid overfitting, and then adds a bias in the optimization problem of weighted extreme learning machine to achieve good generalization ability in handling imbalanced small-sampling dataset. Experimental results on two real-world datasets including SVD and stroke patients show that MOCA can effectively reduce the rate of misdiagnosis and significantly outperform state-of-the-art methods in inferring people's cognitive capabilities. This work opens up opportunities for population-level pre-screening using motion sensing devices and can inform current discussions on reforming the health-care infrastructure.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2820024Attributed Social Network Embedding
https://www.computer.org/csdl/trans/tk/2018/12/08326519-abs.html
Embedding network data into a low-dimensional vector space has shown promising performance for many real-world applications, such as node classification and entity retrieval. However, most existing methods focused only on leveraging network structure. For social networks, besides the network structure, there also exists rich information about social actors, such as user profiles of friendship networks and textual content of citation networks. These rich attribute information of social actors reveal the homophily effect, exerting huge impacts on the formation of social networks. In this paper, we explore the rich evidence source of attributes in social networks to improve network embedding. We propose a generic Attributed Social Network Embedding framework (<italic>ASNE</italic>), which learns representations for social actors (i.e., nodes) by preserving both the <italic>structural proximity</italic> and <italic>attribute proximity</italic>. While the <italic>structural proximity</italic> captures the global network structure, the <italic>attribute proximity</italic> accounts for the homophily effect. To justify our proposal, we conduct extensive experiments on four real-world social networks. Compared to the state-of-the-art network embedding approaches, <italic>ASNE</italic> can learn more informative representations, achieving substantial gains on the tasks of link prediction and node classification. Specifically, <italic>ASNE</italic> significantly outperforms <italic> node2vec</italic> with an 8.2 percent relative improvement on the link prediction task, and a 12.7 percent gain on the node classification task.11/11/2018 7:15 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2819980Topology-Driven Diversity for Targeted Influence Maximization with Application to User Engagement in Social Networks
https://www.computer.org/csdl/trans/tk/2018/12/08326536-abs.html
Research on influence maximization ofter has to cope with marketing needs relating to the propagation of information towards specific users. However, little attention has been paid to the fact that the success of an information diffusion campaign might depend not only on the number of the initial influencers to be detected but also on their <italic>diversity</italic> w.r.t. the target of the campaign. Our main hypothesis is that if we learn seeds that are not only capable of influencing but also are linked to more diverse (groups of) users, then the influence triggers will be diversified as well, and hence the target users will get higher chance of being engaged. Upon this intuition, we define a novel problem, named <italic>Diversity-sensitive Targeted Influence Maximization (DTIM)</italic>, which assumes to model user diversity by exploiting only topological information within a social graph. To the best of our knowledge, we are the first to bring the concept of topology-driven diversity into targeted IM problems, for which we define two alternative definitions. Accordingly, we propose approximate solutions of DTIM, which detect a size- <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="tagarelli-ieq1-2820010.gif"/></alternatives></inline-formula> set of users that maximizes the diversity-sensitive capital objective function, for a given selection of target users. We evaluate our DTIM methods on a special case of user engagement in online social networks, which concerns users who are not actively involved in the community life. Experimental evaluation on real networks has demonstrated the meaningfulness of our approach, also highlighting the opportunity of further development of solutions for DTIM applications.11/11/2018 7:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2820010A Thorough Evaluation of Distance-Based Meta-Features for Automated Text Classification
https://www.computer.org/csdl/trans/tk/2018/12/08326556-abs.html
We address the problem of automatically learning to classify texts by exploiting information derived from meta-features, i.e., features derived from the original bag-of-words representation. Specifically, we provide an in-depth analysis on the recently proposed distance-based meta-features, a <italic>data engineering</italic> technique that relies on the distance between documents to transform the original feature space into a new one, potentially smaller and more informed. Despite its potential, the meta-feature space may be unnecessarily complex and highly dimensional, which increases the tendency of overfitting, limits the application of meta-features in different contexts, and increases computational costs. In this work, we propose the use of multi-objective strategies to reduce the number of meta-features while maximizing the classification effectiveness, when considering the adequacy of the selected meta-features to a particular dataset or classification method. We present effective and efficient proposals for meta-feature selection that can substantially reduce the number of meta-features by up to 89 percent while keeping or improving the classification effectiveness, something not possible with any of the evaluated baselines. We also use our selection strategies as evaluation tools to analyze different combinations of meta-features. We found very compact combinations of meta-features that can achieve high classification effectiveness in most datasets, despite their peculiarities.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2820051Diverse Relevance Feedback for Time Series with Autoencoder Based Summarizations
https://www.computer.org/csdl/trans/tk/2018/12/08327532-abs.html
We present a relevance feedback based browsing methodology using different representations for time series data. The outperforming representation type, e.g., among dual-tree complex wavelet transformation, Fourier, symbolic aggregate approximation (SAX), is learned based on user annotations of the presented query results with representation feedback. We present the use of autoencoder type neural networks to summarize time series or its representations into sparse vectors, which serves as another representation learned from the data. Experiments on 85 real data sets confirm that diversity in the result set increases precision, representation feedback incorporates item diversity and helps to identify the appropriate representation. The results also illustrate that the autoencoders can enhance the base representations, and achieve comparably accurate results with reduced data sizes.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2820119A General Framework for Implicit and Explicit Social Recommendation
https://www.computer.org/csdl/trans/tk/2018/12/08328917-abs.html
Research of social recommendation aims at exploiting social information to improve the quality of a recommender system. It can be further divided into two classes. Explicit social recommendation assumes the existence of not only the users’ ratings on items, but also the explicit social connections between users. Implicit social recommendation assumes the availability of only the ratings but not the social connections between users, and attempts to infer implicit social connections between users with the goal to boost recommendation accuracy. This paper proposes a unified framework that is applicable to both explicit and implicit social recommendation. We propose an optimization framework to learn the degree of social correlation and rating prediction jointly, so these two tasks can mutually boost the performance of each other. Furthermore, a well-known challenge for implicit social recommendation is that it takes quadratic time to learn the strength of pairwise connections. This paper further proposes several practical tricks to reduce the complexity of our model to be linear to the observed ratings. The experiments show that the proposed model, with only two parameters, can significantly outperform the state-of-the-art solutions for both explicit and implicit social recommender systems.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2821174Characterizing and Predicting Early Reviewers for Effective Product Marketing on E-Commerce Websites
https://www.computer.org/csdl/trans/tk/2018/12/08329164-abs.html
Online reviews have become an important source of information for users before making an informed purchase decision. Early reviews of a product tend to have a high impact on the subsequent product sales. In this paper, we take the initiative to study the behavior characteristics of early reviewers through their posted reviews on two real-world large e-commerce platforms, i.e., Amazon and Yelp. In specific, we divide product lifetime into three consecutive stages, namely <italic>early</italic>, <italic>majority</italic>, and <italic>laggards</italic>. A user who has posted a review in the early stage is considered as an early reviewer. We quantitatively characterize early reviewers based on their rating behaviors, the helpfulness scores received from others and the correlation of their reviews with product popularity. We have found that (1) an early reviewer tends to assign a higher average rating score; and (2) an early reviewer tends to post more helpful reviews. Our analysis of product reviews also indicates that early reviewers’ ratings and their received helpfulness scores are likely to influence product popularity. By viewing the review posting process as a multiplayer competition game, we propose a novel margin-based embedding model for early reviewer prediction. Extensive experiments on two different e-commerce datasets have shown that our proposed approach outperforms a number of competitive baselines.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2821671Ensemble Learning for Multi-Type Classification in Heterogeneous Networks
https://www.computer.org/csdl/trans/tk/2018/12/08329517-abs.html
Heterogeneous networks are networks consisting of different types of objects and links. They can be found in several fields, ranging from the Internet to social sciences, biology, epidemiology, geography, finance, and many others. In the literature, several methods have been proposed for the analysis of network data, but they usually focus on homogeneous networks, where all the objects are of the same type, and links among them describe a single type of relationship. More recently, the complexity of real scenarios has impelled researchers to design methods for the analysis of heterogeneous networks, especially focused on classification and clustering tasks. However, they often make assumptions on the structure of the network that are too restrictive or do not fully exploit different forms of network correlation and autocorrelation. Moreover, when nodes which are the main subject of the classification task are linked to several nodes of the network having missing values, standard methods can lead to either building incomplete classification models or to discarding possibly relevant dependencies (correlation or autocorrelation). In this paper, we propose an ensemble learning approach for multi-type classification. We adopt the system Mr-SBC, which is originally able to analyze heterogeneous networks of arbitrary structure, within an ensemble learning approach. The ensemble allows us to improve the classification accuracy of Mr-SBC by exploiting i) the possible presence of correlation and autocorrelation phenomena, and ii) the classification of instances (which contain missing values) of other node types in the network. As a beneficial side effect, we have also that the models are more stable in terms of standard deviation of the accuracy, over different samples used for training. Experiments performed on real-world datasets show that the proposed method is able to significantly outperform the standard implementation of Mr-SBC. Moreover, it gives Mr-SBC the advantage of outperforming four other well-known algorithms for the classification of data organized in a network.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822307Deep Air Learning: Interpolation, Prediction, and Feature Analysis of Fine-Grained Air Quality
https://www.computer.org/csdl/trans/tk/2018/12/08333777-abs.html
The interpolation, prediction, and feature analysis of fine-gained air quality are three important topics in the area of urban air computing. The solutions to these topics can provide extremely useful information to support air pollution control, and consequently generate great societal and technical impacts. Most of the existing work solves the three problems separately by different models. In this paper, we propose a general and effective approach to solve the three problems in one model called the Deep Air Learning (DAL). The main idea of DAL lies in embedding feature selection and semi-supervised learning in different layers of the deep learning network. The proposed approach utilizes the information pertaining to the unlabeled spatio-temporal data to improve the performance of the interpolation and the prediction, and performs feature selection and association analysis to reveal the main relevant features to the variation of the air quality. We evaluate our approach with extensive experiments based on real data sources obtained in Beijing, China. Experiments show that DAL is superior to the peer models from the recent literature when solving the topics of interpolation, prediction, and feature analysis of fine-gained air quality.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2823740Similarity Metrics for SQL Query Clustering
https://www.computer.org/csdl/trans/tk/2018/12/08352666-abs.html
Database access logs are the starting point for many forms of database administration, from database performance tuning, to security auditing, to benchmark design, and many more. Unfortunately, query logs are also large and unwieldy, and it can be difficult for an analyst to extract broad patterns from the set of queries found therein. Clustering is a natural first step towards understanding the massive query logs. However, many clustering methods rely on the notion of pairwise similarity, which is challenging to compute for SQL queries, especially when the underlying data and database schema is unavailable. We investigate the problem of computing similarity between queries, relying only on the query structure. We conduct a rigorous evaluation of three query similarity heuristics proposed in the literature applied to query clustering on multiple query log datasets, representing different types of query workloads. To improve the accuracy of the three heuristics, we propose a generic feature engineering strategy, using classical query rewrites to standardize query structure. The proposed strategy results in a significant improvement in the performance of all three similarity heuristics.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2831214NAIS: Neural Attentive Item Similarity Model for Recommendation
https://www.computer.org/csdl/trans/tk/2018/12/08352808-abs.html
Item-to-item collaborative filtering (<italic>aka.</italic>item-based CF) has been long used for building recommender systems in industrial settings, owing to its interpretability and efficiency in real-time personalization. It builds a user's profile as her historically interacted items, recommending new items that are similar to the user's profile. As such, the key to an item-based CF method is in the estimation of item similarities. Early approaches use statistical measures such as cosine similarity and Pearson coefficient to estimate item similarities, which are less accurate since they lack tailored optimization for the recommendation task. In recent years, several works attempt to learn item similarities from data, by expressing the similarity as an underlying model and estimating model parameters by optimizing a recommendation-aware objective function. While extensive efforts have been made to use shallow linear models for learning item similarities, there has been relatively less work exploring nonlinear neural network models for item-based CF. In this work, we propose a neural network model named <italic>Neural Attentive Item Similarity model</italic> (NAIS) for item-based CF. The key to our design of NAIS is an attention network, which is capable of distinguishing which historical items in a user profile are more important for a prediction. Compared to the state-of-the-art item-based CF method <italic>Factored Item Similarity Model</italic> (FISM) <xref ref-type="bibr" rid="ref1">[1]</xref> , our NAIS has stronger representation power with only a few additional parameters brought by the attention network. Extensive experiments on two public benchmarks demonstrate the effectiveness of NAIS. This work is the first attempt that designs neural network models for item-based CF, opening up new research possibilities for future developments of neural recommender systems.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2831682Approximate Order-Sensitive <italic>k</italic>-NN Queries over Correlated High-Dimensional Data
https://www.computer.org/csdl/trans/tk/2018/11/08307089-abs.html
The <italic>k</italic> Nearest Neighbor (<italic>k</italic>-NN) query has been gaining more importance in extensive applications involving information retrieval, data mining, and databases. Specifically, in order to trade off accuracy for efficiency, approximate solutions for the <italic>k</italic>-NN query are extensively explored. However, the precision is usually order-insensitive, which is defined on the result set instead of the result sequence. In many situations, it cannot reasonably reflect the query result quality. In this paper, we focus on the approximate <italic> k</italic>-NN query problem with the order-sensitive precision requirement and propose a novel scheme based on the projection-filter-refinement framework. Basically, we adopt PCA to project the high-dimensional data objects into the low-dimensional space. Then, a filter condition is inferred to execute efficient pruning over the projected data. In addition, an index strategy named OR-tree is proposed to reduce the I/O cost. The extensive experiments based on several real-world data sets and a synthetic data set are conducted to verify the effectiveness and efficiency of the proposed solution. Compared to the state-of-the-art methods, our method can support order-sensitive <italic>k</italic> -NN queries with higher result precision while retaining satisfactory CPU and I/O efficiency.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2812153Efficient Detection of Soft Concatenation Mapping
https://www.computer.org/csdl/trans/tk/2018/11/08307189-abs.html
In modern big data warehouse systems, we observe a common phenomenon that a column of data values can be derived from one or several other columns by transforming and concatenating these columns. We call this relationship between columns a Soft Concatenation Mapping (SCM). SCMs imply significant redundancy in the schema or data, and therefore can be exploited for data integration or data compression. In this paper, we formalize the problem of SCM detection and prove it is NP-hard. We then propose efficient approximate algorithms to detect all SCMs or an optimal set of SCMs in a table. Our experiments on both real-world and synthetic datasets show promising results.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2812822A New Query Recommendation Method Supporting Exploratory Search Based on Search Goal Shift Graphs
https://www.computer.org/csdl/trans/tk/2018/11/08315046-abs.html
Exploratory search is an increasingly important activity for Web searchers. However, the current search system can not provide sufficient support for exploratory search. Therefore, we made in-depth analysis for exploratory search processes, and found that there are a lot of search goal shift phenomena in exploratory search. Based on this fact, we have designed a new query recommendation method to support exploratory search. Firstly, according to the behavioral characteristics of searchers in the search goal shift processes, all the queries submitted in the search goal shift processes are extracted from search engine logs using machine learning. And then, we have used the queries to build a search goal shift graph; finally, the random walk algorithm is used to obtain the query recommendations in the search goal shift graph. In addition, we demonstrated the effectiveness of the method for exploratory search by comparing experiments with the other methods.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2815544Efficient Detection of Overlapping Communities Using Asymmetric Triangle Cuts
https://www.computer.org/csdl/trans/tk/2018/11/08315057-abs.html
Real social networks contain many communities, where members within each community are densely connected with each other, while they are sparsely connected with the members outside of the community. Since each member can join multiple communities simultaneously, communities in social networks are usually overlapping with each other. How to efficiently and effectively identify overlapping communities in a large social network becomes a fundamental problem in the big data era. Most existing studies on community finding focused on non-overlapping communities based on several well-known community fitness metrics. However, recent investigations have shown that these fitness metrics may suffer free rider and separation effects where the overlapping region of two communities always belongs to the denser one, rather to both of them. In this paper, we study the overlapping community detection problem in social networks that not only takes the quality of the found overlapping communities but also incorporate both free rider and separation effects on the found communities into consideration. Specifically, in this paper, we first propose a novel community fitness metric - triangle based fitness metric, for overlapping community detection that can minimize the free rider and separation effects on found overlapping communities, and show that the problem is NP-hard. We then propose an efficient yet scalable algorithm for the problem that can deliver a feasible solution. We finally validate the effectiveness of the proposed fitness metric and evaluate the performance of the proposed algorithm, through conducting extensive experiments on real-world datasets with over 100 million vertices and edges. Experimental results demonstrate that the proposed algorithm is very promising.10/05/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2815554Matching Heterogeneous Event Data
https://www.computer.org/csdl/trans/tk/2018/11/08315460-abs.html
Identifying events from different sources is essential to various business process applications such as provenance querying or process mining. Distinct features of heterogeneous events, including opaque names and dislocated traces, prevent existing data integration techniques from performing well. To address these issues, in this paper, (1) we propose an event similarity function by iteratively evaluating similar neighbors. (2) In addition to event nodes, we further employ the similarity of edges (indicating relationships among events) in event matching. We prove NP-hardness of finding the optimal event matching w.r.t. node and edge similarities, and propose an efficient heuristic for event matching. Experiments demonstrate that the proposed event matching approach can achieve significantly higher accuracy than state-of-the-art matching methods. In particular, by considering the event edge similarity, our heuristic matching algorithm further improves the matching accuracy without introducing much overhead.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2815695Rule-Based Entity Resolution on Database with Hidden Temporal Information
https://www.computer.org/csdl/trans/tk/2018/11/08316959-abs.html
In this paper, we deal with the problem of rule-based entity resolution on imprecise temporal data. Entity resolution (ER) is widely explored in research community, but the problem on temporal data, especially without available timestamps, has not been studied well yet. Because of the elapsing of time, records referring to the same entity observed in different time periods may be different. Besides traditional similarity-based ER approaches, by carefully exploring several data quality rules, e.g., matching dependency and data currency, much information can be obtained to facilitate to cope with this problem. In this paper, we use such rules to derive temporal records’ information of time order and trend of their attributes’ evolvement with elapsing of time. Specifically, we first block records into smaller blocks, and then by exploring data currency constraints, we propose a temporal clustering approach with two steps, i.e., the skeleton clustering and the banding clustering. Experimental results on both real and synthetic data show that our entity resolution method can achieve both high accuracy and efficiency on datasets with hidden temporal information.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2816018Identifying Genetic Risk Factors for Alzheimer's Disease via Shared Tree-Guided Feature Learning Across Multiple Tasks
https://www.computer.org/csdl/trans/tk/2018/11/08317000-abs.html
The genome-wide association study (GWAS) is a popular approach to identify disease-associated genetic factors for Alzhemer's Disease (AD). However, it remains challenging because of the small number of samples, very high feature dimensionality and complex structures. To accurately identify genetic risk factors for AD, we propose a novel method based on an in-depth exploration of the hierarchical structure among the features and the commonality across related tasks. Specifically, we first extract and encode the tree hierarchy among features; then, we integrate the tree structures with multi-task feature learning (MTFL) to learn the shared features—that are predictive of AD—among related tasks simultaneously. Thus, we can unify the strength of both the prior structure information and MTFL to boost the prediction performance. However, due to the highly complex regularizer that encodes the tree structure and the extremely high feature dimensionality, the learning process can be computationally prohibitive. To address this, we further develop a novel safe screening rule to quickly identify and remove the irrelevant features before training. Experiment results demonstrate that the proposed approach significantly outperforms the state-of-the-art in detecting genetic risk factors of AD and the speedup gained by the proposed screening can be several orders of magnitude.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2816029Conditional Reliability in Uncertain Graphs
https://www.computer.org/csdl/trans/tk/2018/11/08318619-abs.html
Network reliability is a well-studied problem that requires to measure the probability that a target node is reachable from a source node in a probabilistic (or uncertain) graph, i.e., a graph where every edge is assigned a probability of existence. Many approaches and problem variants have been considered in the literature, with the majority of them assuming that edge-existence probabilities are fixed. Nevertheless, in real-world graphs, edge probabilities typically depend on external conditions. In metabolic networks, a protein can be converted into another protein with some probability depending on the presence of certain enzymes. In social influence networks, the probability that a tweet of some user will be re-tweeted by her followers depends on whether the tweet contains specific hashtags. In transportation networks, the probability that a network segment will work properly or not, might depend on external conditions such as weather or time of the day. In this paper, we overcome this limitation and focus on <italic>conditional reliability</italic>, that is, assessing reliability when edge-existence probabilities depend on a set of conditions. In particular, we study the problem of determining the top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="khan-ieq1-2816653.gif"/> </alternatives></inline-formula> conditions that maximize the reliability between two nodes. We deeply characterize our problem and show that, even employing polynomial-time reliability-estimation methods, it is <inline-formula> <tex-math notation="LaTeX">$\mathbf {NP}$</tex-math><alternatives><inline-graphic xlink:href="khan-ieq2-2816653.gif"/> </alternatives></inline-formula>-hard, does not admit any <inline-formula><tex-math notation="LaTeX">$\mathbf {PTAS}$ </tex-math><alternatives><inline-graphic xlink:href="khan-ieq3-2816653.gif"/></alternatives></inline-formula>, and the underlying objective function is non-submodular. We then devise a practical method that targets both accuracy and efficiency. We also study natural generalizations of the problem with multiple source and target nodes. An extensive empirical evaluation on several large, real-life graphs demonstrates effectiveness and scalability of our methods.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2816653SLA Definition for Multi-Tenant DBMS and its Impact on Query Optimization
https://www.computer.org/csdl/trans/tk/2018/11/08319945-abs.html
In the cloud context, users are often called tenants. A cloud DBMS shared by many tenants is called a multi-tenant DBMS. The resource consolidation in such a DBMS allows the tenants to only pay for the resources that they consume, while providing the opportunity for the provider to increase its economic gain. For this, a Service Level Agreement (SLA) is usually established between the provider and a tenant. However, in the current systems, the SLA is often defined by the provider, while the tenant should agree with it before using the service. In addition, only the availability objective is described in the SLA, but not the performance objective. In this paper, an SLA negotiation framework is proposed, in which the provider and the tenant define the performance objective together in a fair way. To demonstrate the feasibility and the advantage of this framework, we evaluate its impact on query optimization. We formally define the problem by including the cost-efficiency aspect, we design a cost model and study the plan search space for this problem, we revise two search methods to adapt to the new context, and we propose a heuristic to solve the resource contention problem caused by concurrent queries of multiple tenants. We also conduct a performance evaluation to show that, our optimization approach (i.e., driven by the SLA) can be much more cost-effective than the traditional approach which always minimizes the query completion time.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2817235BEATS: Blocks of Eigenvalues Algorithm for Time Series Segmentation
https://www.computer.org/csdl/trans/tk/2018/11/08319952-abs.html
The massive collection of data via emerging technologies like the Internet of Things (IoT) requires finding optimal ways to reduce the observations in the time series analysis domain. The IoT time series require aggregation methods that can preserve and represent the key characteristics of the data. In this paper, we propose a segmentation algorithm that adapts to unannounced mutations of the data (i.e., data drifts). The algorithm splits the data streams into blocks and groups them in square matrices, computes the Discrete Cosine Transform (DCT), and quantizes them. The key information is contained in the upper-left part of the resulting matrix. We extract this sub-matrix, compute the modulus of its eigenvalues, and remove duplicates. The algorithm, called BEATS, is designed to tackle dynamic IoT streams, whose distribution changes over time. We implement experiments with six datasets combining real, synthetic, real-world data, and data with drifts. Compared to other segmentation methods like Symbolic Aggregate approXimation (SAX), BEATS shows significant improvements. Trying it with classification and clustering algorithms it provides efficient results. BEATS is an effective mechanism to work with dynamic and multi-variate data, making it suitable for IoT data sources. The datasets, code of the algorithm and the analysis results can be accessed publicly at: <uri> https://github.com/auroragonzalez/BEATS</uri>.10/05/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2817229Online Product Quantization
https://www.computer.org/csdl/trans/tk/2018/11/08320306-abs.html
Approximate nearest neighbor (ANN) search has achieved great success in many tasks. However, existing popular methods for ANN search, such as hashing and quantization methods, are designed for static databases only. They cannot handle well the database with data distribution evolving dynamically, due to the high computational effort for retraining the model based on the new database. In this paper, we address the problem by developing an online product quantization (online PQ) model and incrementally updating the quantization codebook that accommodates to the incoming streaming data. Moreover, to further alleviate the issue of large scale computation for the online PQ update, we design two budget constraints for the model to update partial PQ codebook instead of all. We derive a loss bound which guarantees the performance of our online PQ model. Furthermore, we develop an online PQ model over a sliding window with both data insertion and deletion supported, to reflect the real-time behavior of the data. The experiments demonstrate that our online PQ model is both time-efficient and effective for ANN search in dynamic large scale databases compared with baseline methods and the idea of partial PQ codebook update further reduces the update cost.10/05/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2817526Classifier Ensemble by Exploring Supplementary Ordering Information
https://www.computer.org/csdl/trans/tk/2018/11/08322055-abs.html
Supplementary information has been proven to be particularly useful in many machine learning tasks. In ensemble learning for a set of trained base classifiers, there also exists abundant implicit supplementary information about the performance orderings for the trained base classifiers in previous literature. However, few classifier ensemble studies consider exploring and utilizing supplementary information. The current study proposes a new learning method for stack classifier ensembles by considering the implicit supplementary ordering information regarding a set of trained classifiers. First, a new metric learning algorithm for measuring the similarities between two arbitrary learning tasks is introduced. Second, supplementary ordering information for the trained classifiers of a given learning task is inferred based on the learned similarities and related performance results reported in the previous literature. Third, a set of ordered soft constraints is generated based on the supplementary ordering information, and achieving the optimal combination weights of the trained classifiers is formalized into a goal programming problem. The optimal combination weights are then obtained. Finally, the experimental results verify the effectiveness of the proposed new classifier ensemble method.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818138NHAD: Neuro-Fuzzy Based Horizontal Anomaly Detection in Online Social Networks
https://www.computer.org/csdl/trans/tk/2018/11/08322278-abs.html
Use of social network is the basic functionality of today's life. With the advent of more and more online social media, the information available and its utilization have come under the threat of several anomalies. Anomalies are the major cause of online frauds which allow information access by unauthorized users as well as information forging. One of the anomalies that act as a silent attacker is the horizontal anomaly. These are the anomalies caused by a user because of his/her variable behavior towards different sources. Horizontal anomalies are difficult to detect and hazardous for any network. In this paper, a self-healing neuro-fuzzy approach (NHAD) is used for the detection, recovery, and removal of horizontal anomalies efficiently and accurately. The proposed approach operates over the five paradigms, namely, missing links, reputation gain, significant difference, trust properties, and trust score. The proposed approach is evaluated with three datasets: DARPA'98 benchmark dataset, synthetic dataset, and real-time traffic. Results show that the accuracy of the proposed NHAD model for 10 to 30 percent anomalies in synthetic dataset ranges between 98.08 and 99.88 percent. The evaluation over DARPA'98 dataset demonstrates that the proposed approach is better than the existing solutions as it provides 99.97 percent detection rate for anomalous class. For real-time traffic, the proposed NHAD model operates with an average accuracy of 99.42 at 99.90 percent detection rate.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818163EMOMA: Exact Match in One Memory Access
https://www.computer.org/csdl/trans/tk/2018/11/08323198-abs.html
An important function in modern routers and switches is to perform a lookup for a key. Hash-based methods, and in particular cuckoo hash tables, are popular for such lookup operations, but for large structures stored in off-chip memory, such methods have the downside that they may require more than one off-chip memory access to perform the key lookup. Although the number of off-chip memory accesses can be reduced using on-chip approximate membership structures such as Bloom filters, some lookups may still require more than one off-chip memory access. This can be problematic for some hardware implementations, as having only a single off-chip memory access enables a predictable processing of lookups and avoids the need to queue pending requests. We provide a data structure for hash-based lookups based on cuckoo hashing that uses only one off-chip memory access per lookup, by utilizing an on-chip pre-filter to determine which of multiple locations holds a key. We make particular use of the flexibility to move elements within a cuckoo hash table to ensure the pre-filter always gives the correct response. While this requires a slightly more complex insertion procedure and some additional memory accesses during insertions, it is suitable for most packet processing applications where key lookups are much more frequent than insertions. An important feature of our approach is its simplicity. Our approach is based on simple logic that can be easily implemented in hardware, and hardware implementations would benefit most from the single off-chip memory access per lookup.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818716High-Order Proximity Preserved Embedding for Dynamic Networks
https://www.computer.org/csdl/trans/tk/2018/11/08329541-abs.html
Network embedding, aiming to embed a network into a low dimensional vector space while preserving the inherent structural properties of the network, has attracted considerable attention. However, most existing embedding methods focus on the static network while neglecting the evolving characteristic of real-world networks. Meanwhile, most of previous methods cannot well preserve the high-order proximity, which is a critical structural property of networks. These problems motivate us to seek an effective and efficient way to preserve the high-order proximity in embedding vectors when the networks evolve over time. In this paper, we propose a novel method of Dynamic High-order Proximity preserved Embedding (DHPE). Specifically, we adopt the generalized SVD (GSVD) to preserve the high-order proximity. Then, by transforming the GSVD problem to a generalized eigenvalue problem, we propose a generalized eigen perturbation to incrementally update the results of GSVD to incorporate the changes of dynamic networks. Further, we propose an accelerated solution to the DHPE model so that it achieves a linear time complexity with respect to the number of nodes and number of changed edges in the network. Our empirical experiments on one synthetic network and several real-world networks demonstrate the effectiveness and efficiency of the proposed method.10/05/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822283Correction to A Survey of Location Prediction on Twitter
https://www.computer.org/csdl/trans/tk/2018/11/08482519-abs.html
10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2867987Influence Maximization on Social Graphs: A Survey
https://www.computer.org/csdl/trans/tk/2018/10/08295265-abs.html
Influence Maximization (IM), which selects a set of <inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="li-ieq1-2807843.gif"/></alternatives></inline-formula> users (called seed set) from a social network to maximize the expected number of influenced users (called influence spread), is a key algorithmic problem in social influence analysis. Due to its immense application potential and enormous technical challenges, IM has been extensively studied in the past decade. In this paper, we survey and synthesize a wide spectrum of existing studies on IM from an <italic>algorithmic perspective</italic>, with a special focus on the following key aspects: (1) a review of well-accepted diffusion models that capture the information diffusion process and build the foundation of the IM problem, (2) a fine-grained taxonomy to classify existing IM algorithms based on their design objectives, (3) a rigorous theoretical comparison of existing IM algorithms, and (4) a comprehensive study on the applications of IM techniques in combining with novel context features of social networks such as topic, location, and time. Based on this analysis, we then outline the key challenges and research directions to expand the boundary of IM research.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807843Mining Summaries for Knowledge Graph Search
https://www.computer.org/csdl/trans/tk/2018/10/08300649-abs.html
Querying heterogeneous and large-scale knowledge graphs is expensive. This paper studies a graph summarization framework to facilitate knowledge graph search. (1) We introduce a class of <italic>reduced summaries</italic>. Characterized by approximate graph pattern matching, these summaries are capable of summarizing entities in terms of their neighborhood similarity up to a certain hop, using small and informative graph patterns. (2) We study a <italic> diversified graph summarization</italic> problem. Given a knowledge graph, it is to discover top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="song-ieq1-2807442.gif"/> </alternatives></inline-formula> summaries that maximize a bi-criteria function, characterized by both informativeness and diversity. We show that diversified summarization is feasible for large graphs, by developing both sequential and parallel summarization algorithms. (a) We show that there exists a 2-approximation algorithm to discover diversified summaries. We further develop an anytime sequential algorithm which discovers summaries under resource constraints. (b) We present a new parallel algorithm with quality guarantees. The algorithm is parallel scalable, which ensures its feasibility in distributed graphs. (3) We also develop a summary-based query evaluation scheme, which only refers to a small number of summaries. Using real-world knowledge graphs, we experimentally verify the effectiveness and efficiency of our summarization algorithms, and query processing using summaries.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807442Top-<italic>k</italic> Critical Vertices Query on Shortest Path
https://www.computer.org/csdl/trans/tk/2018/10/08300661-abs.html
Shortest path query is one of the most fundamental and classic problems in graph analytics, which returns the complete shortest path between any two vertices. However, in many real-life scenarios, only critical vertices on the shortest path are desirable and it is unnecessary to search for the complete path. This paper investigates the shortest path sketch by defining a top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq1-2808495.gif"/></alternatives></inline-formula> critical vertices (<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="ma-ieq2-2808495.gif"/></alternatives> </inline-formula>CV) query on the shortest path. Given a source vertex <inline-formula><tex-math notation="LaTeX">$s$ </tex-math><alternatives><inline-graphic xlink:href="ma-ieq3-2808495.gif"/></alternatives></inline-formula> and target vertex <inline-formula><tex-math notation="LaTeX">$t$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq4-2808495.gif"/></alternatives></inline-formula> in a graph, <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="ma-ieq5-2808495.gif"/></alternatives> </inline-formula>CV query can return the top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq6-2808495.gif"/></alternatives></inline-formula> significant vertices on the shortest path <inline-formula><tex-math notation="LaTeX">$SP(s,t)$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq7-2808495.gif"/></alternatives></inline-formula>. The significance of the vertices can be predefined. The key strategy for seeking the sketch is to apply off-line preprocessed distance oracle to accelerate on-line real-time queries. This allows us to omit unnecessary vertices and obtain the most representative sketch of the shortest path directly. We further explore a series of methods and optimizations to answer <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq8-2808495.gif"/></alternatives></inline-formula>CV query on both centralized and distributed platforms, using exact and approximate approaches, respectively. We evaluate our methods in terms of time, space complexity and approximation quality. Experiments on large-scale real-world networks validate that our algorithms are of high efficiency and accuracy.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808495Self-Tuned Descriptive Document Clustering Using a Predictive Network
https://www.computer.org/csdl/trans/tk/2018/10/08301532-abs.html
Descriptive clustering consists of automatically organizing data instances into clusters and generating a descriptive summary for each cluster. The description should inform a user about the contents of each cluster without further examination of the specific instances, enabling a user to rapidly scan for relevant clusters. Selection of descriptions often relies on heuristic criteria. We model descriptive clustering as an auto-encoder network that predicts features from cluster assignments and predicts cluster assignments from a subset of features. The subset of features used for predicting a cluster serves as its description. For text documents, the occurrence or count of words, phrases, or other attributes provides a sparse feature representation with interpretable feature labels. In the proposed network, cluster predictions are made using logistic regression models, and feature predictions rely on logistic or multinomial regression models. Optimizing these models leads to a completely self-tuned descriptive clustering approach that automatically selects the number of clusters and the number of features for each cluster. We applied the methodology to a variety of short text documents and showed that the selected clustering, as evidenced by the selected feature subsets, are associated with a meaningful topical organization.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2781721Locality Reconstruction Models for Book Representation
https://www.computer.org/csdl/trans/tk/2018/10/08301545-abs.html
Books, as a representative of lengthy documents, convey rich semantics. Traditional document modeling methods, such as bag-of-words models, have difficulty capturing such rich semantics when only considering term-frequency features. In order to explore term spatial distributions over a book, a tree-structured book representation is investigated in this paper. Moreover, an efficient learning framework, Tree2Vector, is introduced for mapping tree-structured book data into vectorial space. In particular, we present two types of locality reconstruction (LR) models: Euclidean-type and cosine-type, during the transformation process of tree structures into vectorial representations. The LR is used for modeling the reconstruction process, in which each parent node in a tree is supposed to be reconstructed by its child nodes. The prominent advantage of this Tree2Vector framework is that it solely utilizes the local information within a single book tree. In addition, extensive experimental results demonstrate that Tree2Vector is able to deliver comparable or better performance in comparison to methods that consider the information of all trees in a database globally. Experimental results also suggest that cosine-type LR consistently performs better than Euclidean-type LR in applications of book and author recommendations.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808953<inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq1-2808971.gif"/></alternatives></inline-formula>: A Scalable Method for in-Memory <italic>k</italic>NN Search over Moving Objects in Road Networks
https://www.computer.org/csdl/trans/tk/2018/10/08301596-abs.html
Nowadays, many location-based applications require the ability of querying <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="cao-ieq2-2808971.gif"/> </alternatives></inline-formula>-nearest neighbors over a very large scale of moving objects in road networks, e.g., taxi-calling and ride-sharing services. Traditional grid index with equal-sized cells can not adapt to the skewed distribution of moving objects in real scenarios. Thus, to obtain the fast querying response time, the grid needs to be split into more smaller cells which introduces the side-effect of higher memory cost, i.e., maintaining such a large volume of cells requires a much larger memory space at the server side. In this paper, we present <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq3-2808971.gif"/></alternatives></inline-formula>, a scalable and in-memory <italic>k </italic>NN query processing technique. <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math> <alternatives><inline-graphic xlink:href="cao-ieq4-2808971.gif"/></alternatives></inline-formula> is dual-index driven, where we adopt a R-tree to store the topology of the road network and a <italic>hierarchical grid model</italic> to manage the moving objects in non-uniform distribution. To answer a <italic>k</italic>NN query in real time, <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq5-2808971.gif"/></alternatives></inline-formula> adopts the strategy that incrementally enlarges the search area for network distance based nearest neighbor evaluation. It is far from trivial to perform the space expansion within the hierarchical grid index. For a given cell, we first define its neighbors in different directions, then propose a cell communication technique which allows each cell in the hierarchical grid index to be aware of its neighbors at any time. Accordingly, an efficient space expansion algorithm to generate the estimation area is proposed. The experimental evaluation shows that <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives><inline-graphic xlink:href="cao-ieq6-2808971.gif"/></alternatives></inline-formula> outperforms the baseline algorithm in terms of time and memory efficiency.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808971Efficient Parallel Skyline Query Processing for High-Dimensional Data
https://www.computer.org/csdl/trans/tk/2018/10/08302507-abs.html
Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous se of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing, as well as the ensuing high computation cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each compute node partitions the input data points into disjoint subsets, and then performs the skyline computation on each subset to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2809598Tensor-Based Big Data Management Scheme for Dimensionality Reduction Problem in Smart Grid Systems: SDN Perspective
https://www.computer.org/csdl/trans/tk/2018/10/08302840-abs.html
Smart grid (SG) is an integration of traditional power grid with advanced information and communication infrastructure for bidirectional energy flow between grid and end users. A huge amount of data is being generated by various smart devices deployed in SG systems. Such a massive data generation from various smart devices in SG systems may lead to various challenges for the networking infrastructure deployed between users and the grid. Hence, an efficient data transmission technique is required for providing desired QoS to the end users in this environment. Generally, the data generated by smart devices in SG has high dimensions in the form of multiple heterogeneous attributes, values of which are changed with time. The high dimensions of data may affect the performance of most of the designed solutions in this environment. Most of the existing schemes reported in the literature have complex operations for the data dimensionality reduction problem which may deteriorate the performance of any implemented solution for this problem. To address these challenges, in this paper, a tensor-based big data management scheme is proposed for dimensionality reduction problem of big data generated from various smart devices. In the proposed scheme, first the Frobenius norm is applied on high-order tensors (used for data representation) to minimize the reconstruction error of the reduced tensors. Then, an empirical probability-based control algorithm is designed to estimate an optimal path to forward the reduced data using software-defined networks for minimization of the network load and effective bandwidth utilization. The proposed scheme minimizes the transmission delay incurred during the movement of the dimensionally reduced data between different nodes. The efficacy of the proposed scheme has been evaluated using extensive simulations carried out on the data traces using ‘R’ programming and Matlab. The big data traces considered for evaluation consist of more than two million entries (2,075,259) collected at one minute sampling rate having hetrogenous features such as–voltage, energy, frequency, electric signals, etc. Moreover, a comparative study for different data traces and a real SG testbed is also presented to prove the efficacy of the proposed scheme. The results obtained depict the effectiveness of the proposed scheme with respect to the parameters such as- network delay, accuracy, and throughput.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2809747Planning with Spatio-Temporal Search Control Knowledge
https://www.computer.org/csdl/trans/tk/2018/10/08303742-abs.html
Knowledge based approaches developed for AI planning can convert an intractable planning problem to a tractable one. Current techniques often use temporal logics to express Search Control Knowledge (SCK) in logic based planning. However, traditional temporal logics are limited in expressiveness since they are unable to express spatial constraints which are as important as temporal ones in many planning domains. To this end, we propose a two-dimensional (spatial and temporal) logic namely PPTL<inline-formula><tex-math notation="LaTeX">$^{\mathrm{SL}}$ </tex-math><alternatives><inline-graphic xlink:href="lu-ieq1-2810144.gif"/></alternatives></inline-formula> by temporalizing separation logic with PPTL (Propositional Projection Temporal Logic) which is well-suited to specify SCK involving both spatial and temporal constraints in planning. We prove that PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq2-2810144.gif"/> </alternatives></inline-formula> is decidable essentially via an equisatisfiable translation from PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq3-2810144.gif"/> </alternatives></inline-formula> to its restricted form. Moreover, we implement a tool, <italic>S-TSolver</italic>, which effectively computes plans under the guidance of the spatio-temporal SCK expressed by PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq4-2810144.gif"/> </alternatives></inline-formula> formulas. The effectiveness of the tool is evaluated on selected benchmark domains from the International Planning Competition.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810144Semi-Supervised Feature Selection via Insensitive Sparse Regression with Application to Video Semantic Recognition
https://www.computer.org/csdl/trans/tk/2018/10/08304684-abs.html
Feature selection plays a significant role in dealing with high-dimensional data to avoid the curse of dimensionality. In many real applications, like video semantic recognition, handling few labeled and large unlabeled data samples from the same population is a recently addressed challenge in feature selection. To solve this problem, we propose a novel semi-supervised feature selection method via insensitive sparse regression (ISR). Specifically, we compute the soft label matrix by the special label propagation, which can predict the labels of the unlabeled data. To guarantee the robustness of ISR to the false labeled instances or outliers, we propose Insensitive Regression Model (IRM) by capped <inline-formula><tex-math notation="LaTeX">$l_2$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq1-2810286.gif"/></alternatives></inline-formula>-<inline-formula> <tex-math notation="LaTeX">$l_p$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq2-2810286.gif"/></alternatives></inline-formula>-norm loss. The soft label is imposed as the weights of IRM to fully utilize the label information. Meanwhile, to perform feature selection, we incorporate <inline-formula><tex-math notation="LaTeX"> $l_{2,q}$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq3-2810286.gif"/></alternatives></inline-formula> -norm regularizer with IRM as the structural sparsity constraint when <inline-formula><tex-math notation="LaTeX"> $0<q\leq 1$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq4-2810286.gif"/></alternatives> </inline-formula>. Moreover, we put forward an effective approach for solving the formulated non-convex optimization problem. We analyze the performance of convergence rigorously and discuss the parameter determination problem. Extensive experimental results on several public data sets verify the effectiveness of our proposed algorithm in comparison with the state-of-art feature selection methods. Finally, we apply our method to video semantic recognition successfully.09/12/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810286We Like, We Post: A Joint User-Post Approach for Facebook Post Stance Labeling
https://www.computer.org/csdl/trans/tk/2018/10/08305481-abs.html
Web post and user stance labeling is challenging not only because of the informality and variation in language on the Web but also because of the lack of labeled data on fast-emerging new topics—even the labeled data we do have are usually heavily skewed. In this paper, we propose a joint user-post approach for stance labeling to mitigate the latter two difficulties. In labeling post stance, the proposed approach considers post content as well as posting and liking behavior, which involves users. Sentiment analysis is applied to posts to acquire their initial stance, and then the post and user stance are updated iteratively with correlated posting-related actions. The whole process works with limited labeled data, which solves the first problem. We use real interaction between authors and readers for stance labeling. Experimental results show that the proposed approach not only substantially improves content-based post stance labeling, but also yields better performance for the minor stance class, which solves the second problem.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810875Multi-Label Learning with Emerging New Labels
https://www.computer.org/csdl/trans/tk/2018/10/08305522-abs.html
In a multi-label learning task, an object possesses multiple concepts where each concept is represented by a class label. Previous studies on multi-label learning have focused on a fixed set of class labels, i.e., the class label set of test data is the same as that in the training set. In many applications, however, the environment is dynamic and new concepts may emerge in a data stream. In order to maintain a good predictive performance in this environment, a multi-label learning method must have the ability to detect and classify instances with emerging new labels. To this end, we propose a new approach called Multi-label learning with Emerging New Labels (<monospace>MuENL</monospace>). It has three functions: classify instances on currently known labels, detect the emergence of a new label, and construct a new classifier for each new label that works collaboratively with the classifier for known labels. In addition, we show that <monospace>MuENL</monospace> can be easily extended to handle sparse high dimensional data streams by simply reducing the original dimensionality, and then applying <monospace>MuENL</monospace> on the reduced dimensional space. Our empirical evaluation shows the effectiveness of <monospace>MuENL</monospace> on several benchmark datasets and <monospace>MuENLHD</monospace> on the sparse high dimensional Weibo dataset.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810872Supervised Search Result Diversification via Subtopic Attention
https://www.computer.org/csdl/trans/tk/2018/10/08305531-abs.html
Search result diversification aims to retrieve diverse results to satisfy as many different information needs as possible. Supervised methods have been proposed recently to learn ranking functions and they have been shown to produce superior results to unsupervised methods. However, these methods use implicit approaches based on the principle of Maximal Marginal Relevance (MMR). In this paper, we propose a learning framework for explicit result diversification where subtopics are explicitly modeled. Based on the information contained in the sequence of selected documents, we use the attention mechanism to capture the subtopics to be focused on while selecting the next document, which naturally fits our task of document selection for diversification. As a preliminary attempt, we employ recurrent neural networks and max pooling to instantiate the framework. We use both distributed representations and traditional relevance features to model documents in the implementation. The framework is flexible to model query intent in either a flat list or a hierarchy. Experimental results show that the proposed method significantly outperforms all the existing search result diversification approaches.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810873Automated Phrase Mining from Massive Text Corpora
https://www.computer.org/csdl/trans/tk/2018/10/08306825-abs.html
As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, <inline-formula> <tex-math notation="LaTeX">$\mathsf{AutoPhrase}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq1-2812203.gif"/></alternatives></inline-formula>, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, <inline-formula><tex-math notation="LaTeX"> $\mathsf{AutoPhrase}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq2-2812203.gif"/></alternatives> </inline-formula> has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, <inline-formula><tex-math notation="LaTeX">$\mathsf{AutoPhrase}$ </tex-math><alternatives><inline-graphic xlink:href="shang-ieq3-2812203.gif"/></alternatives></inline-formula> can be extended to model single-word quality phrases.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2812203AMDO: An Over-Sampling Technique for Multi-Class Imbalanced Problems
https://www.computer.org/csdl/trans/tk/2018/09/08065074-abs.html
Multi-class imbalanced problems have attracted growing attention from the real-world classification tasks in engineering. The underlying skewed distribution of multiple classes poses difficulties for learning algorithms, which becomes more challenging when considering overlapping between classes, lack of representative data, and mixed-type data. In this work, we address this problem in a data-oriented way. Motivated by a recently proposed over-sampling technique designed for numeric data sets, Mahalanobis Distance-based Over-sampling (MDO), we use this technique to capture the covariance structure of the minority class and to generate synthetic samples along the probability contours for learning algorithms. Based on MDO, we further improve the over-sampling strategy and generalize it for mixed-type data sets. The established technique, Adaptive Mahalanobis Distance-based Over-sampling (AMDO), introduces GSVD (Generalized Singular Value Decomposition) for mixed-type data, develops a partially balanced resampling scheme and optimizes the sample synthesis. Theoretical analysis is conducted to demonstrate the reasonability of AMDO. Extensive experimental testing is performed on 15 multi-class imbalanced benchmarks and two data sets for precipitation phase recognition in comparison with several state-of-the-art multi-class imbalanced learning methods. The results validate the effectiveness and robustness of our proposal.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2761347Privacy Characterization and Quantification in Data Publishing
https://www.computer.org/csdl/trans/tk/2018/09/08276593-abs.html
The increasing interest in collecting and publishing large amounts of individuals’ data as public for purposes such as medical research, market analysis, and economical measures has created major privacy concerns about individual's sensitive information. To deal with these concerns, many Privacy-Preserving Data Publishing (PPDP) techniques have been proposed in literature. However, they lack a proper privacy characterization and measurement. In this paper, we first present a novel multi-variable privacy characterization and quantification model. Based on this model, we are able to analyze the prior and posterior adversarial belief about attribute values of individuals. We can also analyze the sensitivity of any identifier in privacy characterization. Then, we show that privacy should not be measured based on one metric. We demonstrate how this could result in privacy misjudgment. We propose two different metrics for quantification of privacy leakage, distribution leakage, and entropy leakage. Using these metrics, we analyzed some of the most well-known PPDP techniques such as <inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="ibrahim-ieq1-2797092.gif"/></alternatives></inline-formula>-anonymity, <inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives> <inline-graphic xlink:href="ibrahim-ieq2-2797092.gif"/></alternatives></inline-formula>-diversity, and <inline-formula> <tex-math notation="LaTeX">$t$</tex-math><alternatives><inline-graphic xlink:href="ibrahim-ieq3-2797092.gif"/> </alternatives></inline-formula>-closeness. Based on our framework and the proposed metrics, we can determine that all the existing PPDP schemes have limitations in privacy characterization. Our proposed privacy characterization and measurement framework contributes to better understanding and evaluation of these techniques. Thus, this paper provides a foundation for design and analysis of PPDP schemes.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2797092On Generalizing Collective Spatial Keyword Queries
https://www.computer.org/csdl/trans/tk/2018/09/08278270-abs.html
With the proliferation of spatial-textual data such as location-based services and geo-tagged websites, spatial keyword queries are ubiquitous in real life. One example of spatial-keyword query is the so-called <italic>collective spatial keyword query</italic> (CoSKQ) which is to find for a given query consisting a query location and several query keywords a set of objects which <italic>covers</italic> the query keywords collectively and has the smallest <italic>cost</italic> wrt the query location. In the literature, many different functions were proposed for defining the <inline-formula><tex-math notation="LaTeX">${cost}$</tex-math><alternatives> <inline-graphic xlink:href="chan-ieq1-2800746.gif"/></alternatives></inline-formula> and correspondingly, many different approaches were developed for the CoSKQ problem. In this paper, we study the CoSKQ problem systematically by proposing <italic>a unified cost function</italic> and <italic>a unified approach</italic> for the CoSKQ problem (with the unified cost function). The unified cost function includes all existing cost functions as special cases and the unified approach solves the CoSKQ problem with the unified cost function in a unified way. Experiments were conducted on both real and synthetic datasets which verified our proposed approach.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2800746On Power Law Growth of Social Networks
https://www.computer.org/csdl/trans/tk/2018/09/08280512-abs.html
What is the growth dynamics of social networks, like Facebook or WeChat? Does it truly exhibit exponential early-growth, as predicted by the celebrated models, like the Bass model? How about the dynamics of links, for which there are few published models? For the first time, we examine the growth of WeChat which is the largest online social network in China, together with several other real social networks. We observe Power-Law growth dynamics for both nodes and links, a fact that breaks the textbook models featuring Sigmoid curves. We propose <sc>NetTide</sc>, along with differential equations for the growth of nodes and links. Our model fits the growth dynamics of real social networks well; it encompasses many traditional growth dynamics as special cases, while remaining parsimonious in parameters. The <sc>NetTide</sc> for link growth is the first one of its kind, accurately fitting real data, and capturing densification phenomenon. We further formulate two stochastic generators, which interpret the growth of nodes and links through survival analysis and micro-level interactions within a social network, respectively. The proposed generators reproduce realistic growth dynamics of social networks. When applied on the WeChat data, our <sc> NetTide</sc> forecasted <inline-formula><tex-math notation="LaTeX">$\geq$</tex-math><alternatives> <inline-graphic xlink:href="zang-ieq1-2801844.gif"/></alternatives></inline-formula> 730 days ahead with 3 percent error.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2801844Querying a Collection of Continuous Functions
https://www.computer.org/csdl/trans/tk/2018/09/08283622-abs.html
We introduce a new query primitive called <italic>Function Query</italic> (FQ). An FQ operates on a set of math functions and retrieves the functions whose output with a given input satisfies a query condition (e.g., being among top k, within a given range). While FQ finds its natural uses in querying a database of math functions, it can also be applied on a database of discrete values. We show that by interpreting the database as a set of user-defined functions, FQ can achieve the same functionality as existing analytic queries such as top-k query and scalar product query. We address the challenge of efficient execution of FQ. The core of our solution is a novel data structure called <italic>Intersection-tree</italic>. Our research takes advantage of the fact that 1) the intersections of a set of continuous functions partition their domain into a number of <italic>subdomains</italic>, and 2) in each of these subdomains, the functions can be sorted based on their output. We evaluate the performance of the proposed techniques through analysis, prototyping, and experiments using both synthetic and real-world data. When querying a database of functions, our techniques scale well. When applied on a database of discrete values, our techniques are more versatile and outperform existing techniques in terms of various performance metrics.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2802936On Efficiently Answering Why-Not Range-Based Skyline Queries in Road Networks
https://www.computer.org/csdl/trans/tk/2018/09/08283816-abs.html
The range-based skyline (r-skyline) query on road networks retrieves the skyline objects for each of the query points that are within a road region, considering the objects’ spatial and non-spatial attributes. However, reasoning about missing query results, specified by <italic>why-not questions</italic>, has not till recently received the attention it is worth of. In this paper, we systematically carry out the study of why-not questions on the r-skyline query in the road network environment (abbrev. as the <italic>why-not RSQ problem</italic>). We present three modification strategies, including modifying the query range, modifying the why-not point, and modifying both of them, for supporting the why-not RSQ problem. We also propose three efficient algorithms to tackle the why-not RSQ problem, where several newly presented effective concepts/techniques are leveraged, such as the concepts of <italic> skyline scope</italic> and <italic>skyline dominance region</italic>, <italic>non-spatial attribute modification pruning</italic>, and <italic>G-tree index</italic>. Extensive experimental evaluation using both real and synthetic data sets demonstrates the performance of our proposed algorithms.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2803821A Fast Parallel Community Discovery Model on Complex Networks Through Approximate Optimization
https://www.computer.org/csdl/trans/tk/2018/09/08283822-abs.html
Community discovery plays an essential role in the analysis of the structural features of complex networks. Since online networks grow increasingly large and complex over time, the methods traditionally used for community discovery cannot efficiently handle large-scale network data. This introduces the important problem of how to effectively and efficiently discover large communities from complex networks. In this study, we propose a fast parallel community discovery model called picaso (a <bold>p</bold>arallel commun<bold>i</bold>ty dis<bold>c</bold>overy <bold>a</bold> lgorithm ba<bold>s</bold>ed on approximate <bold>o</bold>ptimization), which integrates two new techniques: (1) Mountain model, which works by utilizing graph theory to approximate the selection of nodes needed for merging, and (2) Landslide algorithm, which is used to update the modularity increment based on the approximated optimization. In addition, the GraphX distribution computing framework is employed in order to achieve parallel community detection over complex networks. In the proposed model, clustering on modularity is used to initialize the Mountain model as well as to compute the weight of each edge in the networks. The relationships among the communities are then simplified by applying the Landslide algorithm, which allows us to obtain the community structures of the complex networks. Extensive experiments were conducted on real and synthetic complex network datasets, and the results demonstrate that the proposed algorithm can outperform the state of the art methods, in effectiveness and efficiency, when working to solve the problem of community detection. Moreover, we demonstratively prove that overall time performance approximates to four times faster than similar approaches. Effectively our results suggest a new paradigm for large-scale community discovery of complex networks.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2803818Privacy Enhanced Matrix Factorization for Recommendation with Local Differential Privacy
https://www.computer.org/csdl/trans/tk/2018/09/08290673-abs.html
Recommender systems are collecting and analyzing user data to provide better user experience. However, several privacy concerns have been raised when a recommender knows user's set of items or their ratings. A number of solutions have been suggested to improve privacy of legacy recommender systems, but the existing solutions in the literature can protect either items or ratings only. In this paper, we propose a recommender system that protects both user's items and ratings. For this, we develop novel matrix factorization algorithms under local differential privacy (LDP). In a recommender system with LDP, individual users randomize their data themselves to satisfy differential privacy and send the perturbed data to the recommender. Then, the recommender computes aggregates of the perturbed data. This framework ensures that both user's items and ratings remain private from the recommender. However, applying LDP to matrix factorization typically raises utility issues with i) high dimensionality due to a large number of items and ii) iterative estimation algorithms. To tackle these technical challenges, we adopt dimensionality reduction technique and a novel binary mechanism based on sampling. We additionally introduce a factor that stabilizes the perturbed gradients. With MovieLens and LibimSeTi datasets, we evaluate recommendation accuracy of our recommender system and demonstrate that our algorithm performs better than the existing differentially private gradient descent algorithm for matrix factorization under stronger privacy requirements.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2805356Optimizing Quality for Probabilistic Skyline Computation and Probabilistic Similarity Search
https://www.computer.org/csdl/trans/tk/2018/09/08291012-abs.html
Probabilistic queries have been extensively explored to provide answers with confidence, in order to support the real-life applications struggling with uncertain data, such as sensor networks and data integration. However, the uncertainty of data may propagate, and thus, the results returned by probabilistic queries contain much noise, which <italic>degrades</italic> query quality significantly. In this paper, we propose an efficient optimization framework, termed as <inline-formula><tex-math notation="LaTeX">$\mathsf {QueryClean}$</tex-math><alternatives> <inline-graphic xlink:href="gao-ieq1-2805824.gif"/></alternatives></inline-formula>, for both probabilistic skyline computation and probabilistic similarity search. The goal of <inline-formula><tex-math notation="LaTeX">$\mathsf {QueryClean}$</tex-math><alternatives><inline-graphic xlink:href="gao-ieq2-2805824.gif"/></alternatives> </inline-formula> is to optimize query quality via selecting a group of uncertain objects to clean under limited resource available, where a joint-entropy based quality function is leveraged. We develop an efficient structure called <inline-formula><tex-math notation="LaTeX">$\mathsf {ASI}$</tex-math><alternatives> <inline-graphic xlink:href="gao-ieq3-2805824.gif"/></alternatives></inline-formula> to index the possible result sets of probabilistic queries, which helps to avoid many types of probabilistic query evaluations over a large number of the possible worlds for quality computation. Moreover, we present <italic>exact</italic> and <italic>approximate </italic> algorithms for the optimization problem, using two newly presented heuristics. Considerable experimental results on both real and synthetic data sets demonstrate the efficiency and scalability of our proposed framework <inline-formula><tex-math notation="LaTeX">$\mathsf {QueryClean}$</tex-math><alternatives> <inline-graphic xlink:href="gao-ieq4-2805824.gif"/></alternatives></inline-formula>.08/07/2018 12:32 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2805824Relationship between Variants of One-Class Nearest Neighbors and Creating Their Accurate Ensembles
https://www.computer.org/csdl/trans/tk/2018/09/08293843-abs.html
In one-class classification problems, only the data for the target class is available, whereas the data for the non-target class may be completely absent. In this paper, we study one-class nearest neighbor (OCNN) classifiers and their different variants. We present a theoretical analysis to show the relationships among different variants of OCNN that may use different neighbors or thresholds to identify unseen examples of the non-target class. We also present a method based on inter-quartile range for optimizing parameters used in OCNN in the absence of non-target data during training. Then, we propose two ensemble approaches based on random subspace and random projection methods to create accurate OCNN ensembles. We tested the proposed methods on 15 benchmark and real world domain-specific datasets and show that random-projection ensembles of OCNN perform best.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2806975CRAFTER: A Tree-Ensemble Clustering Algorithm for Static Datasets with Mixed Attributes and High Dimensionality
https://www.computer.org/csdl/trans/tk/2018/09/08294273-abs.html
Clustering is an important aspect of data mining, while clustering high-dimensional mixed-attribute data in a scalable fashion still remains a challenging problem. In this paper, we propose a tree-ensemble clustering algorithm for static datasets, CRAFTER, to tackle this problem. CRAFTER is able to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of datasets. CRAFTER leverages the advantages of a tree-ensemble to handle mixed attributes and high dimensionality. The concept of the class probability estimates is utilized to identify the representative data points for clustering. Through a series of experiments on both synthetic and real datasets, we have demonstrated that CRAFTER is superior than Random Forest Clustering (RFC), an existing tree-based clustering method, in terms of both the clustering quality and the computational cost.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807444A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications
https://www.computer.org/csdl/trans/tk/2018/09/08294302-abs.html
Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into a low dimensional space in which the graph structural information and graph properties are maximumly preserved. In this survey, we conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in different graph embedding problem settings and how the existing work addresses these challenges in their solutions. Finally, we summarize the applications that graph embedding enables and suggest four promising future research directions in terms of computation efficiency, problem settings, techniques, and application scenarios.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807452A Survey of Location Prediction on Twitter
https://www.computer.org/csdl/trans/tk/2018/09/08295255-abs.html
Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we make a conclusion of the survey and list future research directions.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807840Unsupervised Coupled Metric Similarity for Non-IID Categorical Data
https://www.computer.org/csdl/trans/tk/2018/09/08300657-abs.html
Appropriate similarity measures always play a critical role in data analytics, learning, and processing. Measuring the intrinsic similarity of categorical data for unsupervised learning has not been substantially addressed, and even less effort has been made for the similarity analysis of categorical data that is not independent and identically distributed (non-IID). In this work, a Coupled Metric Similarity (CMS) is defined for unsupervised learning which flexibly captures the value-to-attribute-to-object heterogeneous coupling relationships. CMS learns the similarities in terms of intrinsic heterogeneous intra- and inter-attribute couplings and attribute-to-object couplings in categorical data. The CMS validity is guaranteed by satisfying metric properties and conditions, and CMS can flexibly adapt to IID to non-IID data. CMS is incorporated into spectral clustering and k-modes clustering and compared with relevant state-of-the-art similarity measures that are not necessarily metrics. The experimental results and theoretical analysis show the CMS effectiveness of capturing independent and coupled data characteristics, which significantly outperforms other similarity measures on most datasets.08/07/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808532Correction to “K Nearest Neighbour Joins for Big Data on MapReduce: A Theoretical and Experimental Analysis”
https://www.computer.org/csdl/trans/tk/2018/09/08426042-abs.html
08/06/2018 2:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2748438Low-Rank Multi-View Embedding Learning for Micro-Video Popularity Prediction
https://www.computer.org/csdl/trans/tk/2018/08/08233154-abs.html
Recently, a prevailing trend of user generated content (UGC) on social media sites is the emerging micro-videos. Micro-videos afford many potential opportunities ranging from network content caching to online advertising, yet there are still little efforts dedicated to research on micro-video understanding. In this paper, we focus on popularity prediction of micro-videos by presenting a novel low-rank multi-view embedding learning framework. We name it as transductive low-rank multi-view regression (TLRMVR), and it is capable of boosting the performance of micro-video popularity prediction by jointly considering the intrinsic representations of the source and target samples. In particular, TLRMVR integrates low-rank multi-view embedding and regression analysis into a unified framework such that the lowest-rank representation shared by all views not only captures the global structure of all views, but also indicates the regression requirements. The framework is formulated as a regression model and it seeks a set of view-specific projection matrices with low-rank constraints to map multi-view features into a common subspace. In addition, a multi-graph regularization term is constructed to improve the generalization capability and further prevents the overfitting problem. Extensive experiments conducted on a publicly available dataset demonstrate that our proposed method achieve promising results as compared with state-of-the-art baselines.07/06/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2785784Learning Multi-Instance Deep Ranking and Regression Network for Visual House Appraisal
https://www.computer.org/csdl/trans/tk/2018/08/08253468-abs.html
This paper presents a weakly supervised regression model for the <italic>visual house appraisal</italic> problem, which aims to predict the value of a house from its photos and textual descriptions (e.g., number of bedrooms). The key idea of our approach is a multi-layer neural network, called <italic>multi-instance Deep Ranking and Regression </italic> (MiDRR) net, which jointly solves two coupled tasks: ranking and regression, in the multiple instance setting. The network is trained using weakly supervised data, which do not require intensive human annotations. We also design a set of human heuristics to promote deep features through imposing constraints over the solution space, e.g., a house with three bedrooms often has a higher value than that with only two bedrooms. While these constraints are specific to the studied problem, the developed formula can be easily generalized to the other regression applications. For test and evaluation purposes, we collect a comprehensive house image benchmark that includes 900,000 photos from 30,000 houses recently traded in the USA, and apply the proposed MiDRR net to predict house values. Extensive evaluations with comparisons demonstrate that additional usage of imagery data as well as human heuristics can significantly boost system performance and that the proposed MiDRR net clearly outperforms the alternative methods.07/06/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2791611NetCycle+: A Framework for Collective Evolution Inference in Dynamic Heterogeneous Networks
https://www.computer.org/csdl/trans/tk/2018/08/08254360-abs.html
Collective inference has attracted considerable attention in the last decade, where the response variables within a group of instances are correlated and should be inferred collectively, instead of independently. Previous works on collective inference mainly focus on exploiting the <italic>autocorrelation</italic> among instances in a <italic> static</italic> network during the inference process. There are also approaches on time series prediction, which mainly exploit the autocorrelation within an instance at different time points during the inference process. However, in many real-world applications, the response variables of related instances can co-evolve over time and their evolutions are not following a static correlation across time, but are following an internal <italic>life cycle</italic>. In this paper, we study the problem of <italic>collective evolution inference</italic>, where the goal is to predict the values of the response variables for a group of related instances at the end of their life cycles. This problem is extremely important for various applications, e.g., predicting fund-raising results in crowd-funding and predicting gene-expression levels in bioinformatics. This problem is also highly challenging because different instances in the network can co-evolve over time and they can be at different stages of their life cycles and thus have different evolving patterns. Moreover, the instances in collective evolution inference problems are usually connected through <italic>heterogeneous information networks</italic> (HINs for short), which involve complex relationships among the instances interconnected by multiple types of links. We propose an approach, called <italic>NetCycle+</italic>, by incorporating information from both the correlation among related instances and their life cycles. Furthermore, in order to study the deep dependencies between nodes in the network, we extend the graph convolution model into our algorithm. We compared our approach with existing methods of collective inference and time series analysis on two real-world networks. The results demonstrate that our proposed approach can improve the inference performance by considering the autocorrelation through networks and the life cycles of the instances.07/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2792020SDE: A Novel Clustering Framework Based on Sparsity-Density Entropy
https://www.computer.org/csdl/trans/tk/2018/08/08254398-abs.html
Clustering of data with high dimension and variable densities poses a remarkable challenge to the traditional density-based clustering methods. Recently, entropy, a numerical measure of the uncertainty of information, can be used to measure the border degree of samples in data space and also select significant features in feature set. It was used in our new framework based on the sparsity-density entropy (SDE) to cluster the data with high dimension and variable densities. First, SDE conducts high-quality sampling for multidimensional data and selects the representative features using sparsity score entropy (SSE). Second, the clustering results and noises are obtained adopting a new density-variable clustering method called density entropy (DE). DE automatically determines the border set based on the global minimum of border degrees and then adaptively performs cluster analysis for each local cluster based on the local minimum of border degrees. The effectiveness and efficiency of the proposed SDE framework are validated on synthetic and real data sets in comparison with several clustering algorithms. The results showed that the proposed SDE framework concurrently detected the noises and processed the data with high dimension and various densities.07/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2792021Making a Small World Smaller: Path Optimization in Networks
https://www.computer.org/csdl/trans/tk/2018/08/08255632-abs.html
Reduction of end-to-end network delay is an optimization task with applications in multiple domains. Low delays enable improved information flow in social networks, quick spread of ideas in collaboration networks, low travel times for vehicles on road networks, and increased rate of packets in the case of communication networks. Delay reduction can be achieved by both improving the propagation capabilities of individual nodes and adding additional edges in the network. One of the main challenges in such network design problems is that the effects of local changes are not independent, and as a consequence, there is a combinatorial search-space of possible improvements. Thus, minimizing the cumulative propagation delay requires novel scalable and data-driven approaches. We consider the problem of network delay minimization via node upgrades. We show that the problem is NP-hard and prove strong inapproximability results about it (i.e., APX-hard) even for equal vertex delays. On the positive side, probabilistic approximations for a restricted version of the problem can be obtained. We propose a greedy heuristic to solve the general problem setting which has good quality in practice, but does not scale to very large instances. To enable scalability to real-world networks, we develop approximations for Greedy with probabilistic guarantees for every iteration, tailored to different models of delay distribution and network structures. Our methods scale almost linearly with the graph size and consistently outperform competitors in quality. We evaluate our approaches on several real-world graphs from different genres. We achieve up to two orders of magnitude speed-up compared to alternatives from the literature on moderate size networks, and obtain high-quality results in minutes on large datasets while competitors from the literature require more than four hours.07/06/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2792470A Two-Phase Algorithm for Differentially Private Frequent Subgraph Mining
https://www.computer.org/csdl/trans/tk/2018/08/08259370-abs.html
Mining frequent subgraphs from a collection of input graphs is an important task for exploratory data analysis on graph data. However, if the input graphs contain sensitive information, releasing discovered frequent subgraphs may pose considerable threats to individual privacy. In this paper, we study the problem of frequent subgraph mining (FSM) under the rigorous differential privacy model. We present a two-phase differentially private FSM algorithm, which is referred to as <italic>DFG</italic>. In <italic>DFG</italic>, frequent subgraphs are privately identified in the first phase, and the noisy support of each identified frequent subgraph is calculated in the second phase. In particular, to privately identity frequent subgraphs, we propose a frequent subgraph identification approach, which can improve the accuracy of discovered frequent subgraphs through candidate pruning. Moreover, to compute the noisy support of each identified frequent subgraph, we devise a lattice-based noisy support computation approach, which leverages the inclusion relations between the discovered frequent subgraphs to improve the accuracy of the noisy supports. Through formal privacy analysis, we prove that <italic>DFG</italic> satisfies <inline-formula><tex-math notation="LaTeX"> $\epsilon$</tex-math><alternatives><inline-graphic xlink:href="cheng-ieq1-2793862.gif"/></alternatives></inline-formula> -differential privacy. Extensive experimental results on real datasets show that <italic>DFG</italic> can privately find frequent subgraphs while achieving high data utility.07/06/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2793862TAKer: Fine-Grained Time-Aware Microblog Search with Kernel Density Estimation
https://www.computer.org/csdl/trans/tk/2018/08/08260854-abs.html
Temporal information has been widely used to promote the information retrieval (IR) performance, especially for microblog search which usually prefers the latest news and events. Previous studies mainly focused on incorporating the document-level temporal information into retrieval, while the temporal relevance of each query word was not well investigated. In this paper, we propose a word temporal predictor to characterize the word-level temporal relevance by fine-grained time-aware kernel density estimation over the feedback documents. In addition, we present a fine-grained time-aware framework to integrate the proposed word temporal predictor with the traditional document temporal predictor for retrieval. Finally, we incorporate the framework into two state-of-the-art retrieval models, namely language model (LM) and BM25. The experimental results on the TREC 2011-2014 Microblog collections, show that our proposed word temporal predictor is effective to boost the retrieval performance within both LM and BM25 frameworks. In particular, we achieve significant improvements over the strong baselines with optimized settings in most cases. Furthermore, our fine-grained time-aware models with word temporal predictor are comparable to if not better than the state-of-the-art temporal retrieval models.07/06/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2794538Differentially Private Distributed Online Learning
https://www.computer.org/csdl/trans/tk/2018/08/08260919-abs.html
In the big data era, the generation of data presents some new characteristics, including wide distribution, high velocity, high dimensionality, and privacy concern. To address these challenges for big data analytics, we develop a privacy-preserving distributed online learning framework on the data collected from distributed data sources. Specifically, each node (i.e., data source) has the capacity of learning a model from its local dataset, and exchanges intermediate parameters with a random part of their own neighboring (logically connected) nodes. Hence, the topology of the communications in our distributed computing framework is unfixed in practice. As online learning always performs on the sensitive data, we introduce the notion of differential privacy (DP) into our distributed online learning algorithm (DOLA) to protect the data privacy during the learning, which prevents an adversary from inferring any significant sensitive information. Our model is of general value for big data analytics in the distributed setting, because it can provide rigorous and scalable privacy proof and have much less computational complexity when compared to classic schemes, e.g., secure multiparty computation (SMC). To tackle high-dimensional incoming data entries, we study a sparse version of the DOLA with novel DP techniques to save the computing resources and improve the utility. Furthermore, we present two modified private DOLAs to meet the need of practical applications. One is to convert the DOLA to distributed stochastic optimization in an offline setting, the other is to use the mini-batches approach to reduce the amount of the perturbation noise and improve the utility. We conduct experiments on real datasets in a configured distributed platform. Numerical experiment results validate the feasibility of our private DOLAs.07/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2794384Paradoxical Correlation Pattern Mining
https://www.computer.org/csdl/trans/tk/2018/08/08263124-abs.html
Given a large transactional database, correlation computing/association analysis aims at efficiently finding strongly correlated items. For traditional association analysis, relationships among variables are usually measured at a global level. In this study, we investigate confounding factors that can help to capture abnormal correlation behaviors at a local level. Indeed, many real-world phenomena are localized to specific markets or subpopulations. Such local relationships may not be visible or may be miscalculated when collectively analyzing the entire data. In particular, confounding effects that change the direction of correlation are a most severe problem because the global correlations alone leads to errant conclusions. To this end, we propose CONFOUND, an efficient algorithm to identify paradoxical correlation patterns (i.e., where controlling for a third item changes the direction of association for strongly correlated pairs) using effective pruning strategies. Moreover, we also provide an enhanced version of this algorithm, called CONFOUND+, which substantially speeds up the confounder search step. Finally, experimental results showed that our proposed CONFOUND and CONFOUND+ algorithms can effectively identify confounders and the computational performance is orders of magnitude faster than benchmark methods.07/06/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2791602Duplicate Reduction in Graph Mining: Approaches, Analysis, and Evaluation
https://www.computer.org/csdl/trans/tk/2018/08/08263219-abs.html
At the core of graph mining lies <italic>independent</italic> expansion of substructures where a substructure (also referred to as a subgraph) <italic>independently</italic> grows into a number of larger substructures in each iteration. Such an independent expansion, invariably, leads to the generation of duplicates. In the presence of graph partitions, duplicates are generated both within and across partitions. Eliminating these duplicates (for correctness) not only incurs generation and storage cost but also additional computation for its elimination. Our primary aim is to design techniques to reduce generating duplicate substructures as we show that they cannot be eliminated. This paper introduces three constraint-based optimization techniques, each significantly improving the overall mining cost by reducing the number of duplicates generated. These alternatives provide flexibility to choose the right technique based on graph properties. We establish theoretical correctness of each technique as well as its analysis with respect to graph characteristics such as degree, number of unique labels, and label distribution. We also investigate the applicability of their combination for improvements in duplicate reduction. Finally, we discuss the effects of the constraints with respect to the partitioning schemes used in graph mining. Our experiments demonstrate significant benefits of these constraints in terms of storage, computation, and communication cost (specific to partitioned approaches) across graphs with varied characteristics.07/06/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2795003Capturing the Spatiotemporal Evolution in Road Traffic Networks
https://www.computer.org/csdl/trans/tk/2018/08/08263223-abs.html
The urban road networks undergo frequent traffic congestions during the peak hours and around the city center. Capturing the spatiotemporal evolution of the congestion scenario in real-time in an urban-scale can aid in developing smart traffic management systems, and guiding commuters in making informed decision about route choice. The congestion scenario is often represented by a set of distinguishable network partitions that have a homogeneous level of congestion inside them but are heterogeneous to others. Due to the dynamic nature of traffic, these partitions evolve with time in terms of their structure and location. In this paper, we propose a comprehensive framework to capture the evolution by incrementally updating the partitions in an efficient manner using a two-layer approach. The physical layer maintains a set of small-sized road network building blocks in a fine granularity, and performs low-level computations to incrementally update them, whereas the logical layer performs high-level computations in order to serve as an interface to query the physical layer about the congested partitions in a coarse granularity. We also propose an in-memory index called <monospace>Bin</monospace> that compactly stores the historical sets of building blocks in the main memory with no information loss, and facilitates their efficient retrieval. Our experimental results show that the proposed method is much efficient than the existing re-partitioning methods without significant sacrifice in accuracy. The proposed <monospace>Bin</monospace> consumes a minimum space with least redundancy at different time stamps.07/06/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2795001Health Monitoring on Social Media over Time
https://www.computer.org/csdl/trans/tk/2018/08/08263414-abs.html
Social media has become a major source for analyzing all aspects of daily life. Thanks to dedicated latent topic analysis methods such as the Ailment Topic Aspect Model (ATAM), public health can now be observed on Twitter. In this work, we are interested in using social media to monitor people’s health over time. The use of tweets has several benefits including instantaneous data availability at virtually no cost. Early monitoring of health data is complementary to post-factum studies and enables a range of applications such as measuring behavioral risk factors and triggering health campaigns. We formulate two problems: <italic>health transition detection</italic> and <italic> health transition prediction.</italic> We first propose the Temporal Ailment Topic Aspect Model (TM–ATAM), a new latent model dedicated to solving the first problem by capturing transitions that involve health-related topics. TM–ATAM is a non-obvious extension to ATAM that was designed to extract health-related topics. It learns health-related topic transitions by <italic>minimizing the prediction error on topic distributions between consecutive posts at different time and geographic granularities.</italic> To solve the second problem, we develop T–ATAM, a Temporal Ailment Topic Aspect Model where time is treated as a random variable <italic>natively</italic> inside ATAM. Our experiments on an 8-month corpus of tweets show that TM–ATAM outperforms TM–LDA in estimating health-related transitions from tweets for different geographic populations. We examine the ability of TM–ATAM to detect transitions due to climate conditions in different geographic regions. We then show how T–ATAM can be used to predict the most important transition and additionally compare T–ATAM with CDC (Center for Disease Control) data and Google Flu Trends.07/06/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2795606SLADE: A Smart Large-Scale Task Decomposer in Crowdsourcing
https://www.computer.org/csdl/trans/tk/2018/08/08268652-abs.html
Crowdsourcing has been shown to be effective in a wide range of applications, and is seeing increasing use. A large-scale crowdsourcing task often consists of thousands or millions of atomic tasks, each of which is usually a simple task such as binary choice or simple voting. To distribute a large-scale crowdsourcing task to limited crowd workers, a common practice is to pack a set of atomic tasks into a task bin and send to a crowd worker in a batch. It is challenging to decompose a large-scale crowdsourcing task and execute batches of atomic tasks, which ensures reliable answers at a minimal total cost. Large batches lead to unreliable answers of atomic tasks, while small batches incur unnecessary cost. In this paper, we investigate a general crowdsourcing task decomposition problem, called the <underline>S</underline>mart <underline>L</underline>arge-sc<underline>A</underline>le task <underline>DE </underline>composer (SLADE) problem, which aims to decompose a large-scale crowdsourcing task to achieve the desired reliability at a minimal cost. We prove the NP-hardness of the SLADE problem and propose solutions in both <italic> homogeneous</italic> and <italic>heterogeneous</italic> scenarios. For the <italic>homogeneous</italic> SLADE problem, where all the atomic tasks share the same reliability requirement, we propose a greedy heuristic algorithm and an efficient and effective approximation framework using an optimal priority queue (OPQ) structure with provable approximation ratio. For the <italic>heterogeneous</italic> SLADE problem, where the atomic tasks can have different reliability requirements, we extend the OPQ-based framework leveraging a partition strategy, and also prove its approximation guarantee. Finally, we verify the effectiveness and efficiency of the proposed solutions through extensive experiments on representative crowdsourcing platforms.07/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2797962In Search of Indoor Dense Regions: An Approach Using Indoor Positioning Data
https://www.computer.org/csdl/trans/tk/2018/08/08274887-abs.html
As people spend significant parts of daily lives indoors, it is useful and important to measure indoor densities and find the dense regions in many indoor scenarios like space management and security control. In this paper, we propose a data-driven approach that finds top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="lu-ieq1-2799215.gif"/></alternatives></inline-formula> indoor dense regions by using indoor positioning data. Such data is obtained by indoor positioning systems working at a relatively low frequency, and the reported locations in the data are discrete, from a preselected location set that does not continuously cover the entire indoor space. When a search is triggered, the object positioning information is already out-of-date and thus object locations are uncertain. To this end, we first integrate object location uncertainty into the definitions for counting objects in an indoor region and computing its density. Subsequently, we conduct a thorough analysis of the location uncertainty in the context of complex indoor topology, deriving upper and lower bounds of indoor region densities and introducing distance decaying effect into computing concrete indoor densities. Enabled by the uncertainty analysis outcomes, we design efficient search algorithms for solving the problem. Finally, we conduct extensive experimental studies on our proposals using synthetic and real data. The experimental results verify that the proposed search approach is efficient, scalable, and effective. The top-<inline-formula><tex-math notation="LaTeX"> $k$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq2-2799215.gif"/></alternatives></inline-formula> indoor dense regions returned by our search are considerably consistent with ground truth, despite that the search uses neither historical data nor extra knowledge about objects.07/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2799215Link Weight Prediction Using Supervised Learning Methods and Its Application to Yelp Layered Network
https://www.computer.org/csdl/trans/tk/2018/08/08281007-abs.html
Real-world networks feature weights of interactions, where link weights often represent some physical attributes. In many situations, to recover the missing data or predict the network evolution, we need to predict link weights in a network. In this paper, we first proposed a series of new centrality indices for links in line graph. Then, utilizing these line graph indices, as well as a number of original graph indices, we designed three supervised learning methods to realize link weight prediction both in the networks of single layer and multiple layers, which perform much better than several recently proposed baseline methods. We found that the resource allocation index (RA) plays a more important role in the weight prediction than other topological properties, and the line graph indices are at least as important as the original graph indices in link weight prediction. In particular, the success application of our methods on Yelp layered network suggests that we can indeed predict the offline co-foraging behaviors of users just based on their online social interactions, which may open a new direction for link weight prediction algorithms, and meanwhile provide insights to design better restaurant recommendation systems.07/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2801854Road Traffic Speed Prediction: A Probabilistic Model Fusing Multi-Source Data
https://www.computer.org/csdl/trans/tk/2018/07/07955005-abs.html
Road traffic speed prediction is a challenging problem in intelligent transportation system (ITS) and has gained increasing attentions. Existing works are mainly based on raw speed sensing data obtained from infrastructure sensors or probe vehicles, which, however, are limited by expensive cost of sensor deployment and maintenance. With sparse speed observations, traditional methods based only on speed sensing data are insufficient, especially when emergencies like traffic accidents occur. To address the issue, this paper aims to improve the road traffic speed prediction by fusing traditional speed sensing data with new-type “sensing” data from cross domain sources, such as tweet sensors from social media and trajectory sensors from map and traffic service platforms. Jointly modeling information from different datasets brings many challenges, including location uncertainty of low-resolution data, language ambiguity of traffic description in texts, and heterogeneity of cross-domain data. In response to these challenges, we present a unified probabilistic framework, called Topic-Enhanced Gaussian Process Aggregation Model (TEGPAM), consisting of three components, i.e., location disaggregation model, traffic topic model, and traffic speed Gaussian Process model, which integrate new-type data with traditional data. Experiments on real world data from two large cities validate the effectiveness and efficiency of our model.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2718525DPPred: An Effective Prediction Framework with Concise Discriminative Patterns
https://www.computer.org/csdl/trans/tk/2018/07/08052529-abs.html
In the literature, two series of models have been proposed to address prediction problems including classification and regression. Simple models, such as generalized linear models, have ordinary performance but strong interpretability on a set of simple features. The other series, including tree-based models, organize numerical, categorical, and high dimensional features into a comprehensive structure with rich interpretable information in the data. In this paper, we propose a novel Discriminative Pattern-based Prediction framework (<inline-formula> <tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq1-2757476.gif"/> </alternatives></inline-formula>) to accomplish the prediction tasks by taking their advantages of both effectiveness and interpretability. Specifically, <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq2-2757476.gif"/></alternatives></inline-formula> adopts the concise discriminative patterns that are on the prefix paths from the root to leaf nodes in the tree-based models. <inline-formula> <tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq3-2757476.gif"/> </alternatives></inline-formula> selects a limited number of the useful discriminative patterns by searching for the most effective pattern combination to fit generalized linear models. Extensive experiments show that in many scenarios, <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq4-2757476.gif"/></alternatives></inline-formula> provides competitive accuracy with the state-of-the-art as well as the valuable interpretability for developers and experts. In particular, taking a clinical application dataset as a case study, our <inline-formula><tex-math notation="LaTeX">$\sf {DPPred}$</tex-math> <alternatives><inline-graphic xlink:href="shang-ieq5-2757476.gif"/></alternatives></inline-formula> outperforms the baselines by using only 40 concise discriminative patterns out of a potentially exponentially large set of patterns.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2757476Efficient Parameter Estimation for Information Retrieval Using Black-Box Optimization
https://www.computer.org/csdl/trans/tk/2018/07/08063964-abs.html
The retrieval function is one of the most important components of an Information Retrieval (IR) system, because it determines to what extent some information is relevant to a user query. Most retrieval functions have “free parameters” whose value must be set before retrieval, significantly affecting the effectiveness of an IR system. Choosing the optimum values for such parameters is therefore of paramount importance. However, the optimum can only be found after a computationally expensive process, especially when the generalization error is estimated via cross-validation. In this paper, we propose to determine free parameter values by solving an optimization problem aimed at maximizing a measure of retrieval effectiveness. We employ the black-box optimization paradigm, since the analytical expression of the measure of effectiveness with respect to the free parameters is unknown. We consider different methods for solving the black-box optimization problem: a simple grid-search over the whole domain, and more sophisticated techniques such as line search and surrogate model based algorithms. Experimental results on several test collections not only provide useful insight about effectiveness, but also about efficiency: they indicate that with appropriate optimization techniques, the computational cost of parameter optimization can be greatly reduced without compromising retrieval effectiveness, even when taking generalization into account.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2761749Workload Management in Database Management Systems: A Taxonomy
https://www.computer.org/csdl/trans/tk/2018/07/08086184-abs.html
Workload management is the discipline of effectively monitoring, managing and controlling work flow across computing systems. In particular, workload management in database management systems (DBMSs) is the process or act of monitoring and controlling work (i.e., requests) executing on a database system in order to make efficient use of system resources in addition to achieving any performance objectives assigned to that work. In the past decade, workload management studies and practice have made considerable progress in both academia and industry. New techniques have been proposed by researchers, and new features of workload management facilities have been implemented in most commercial database products. In this paper, we provide a systematic study of workload management in today's DBMSs by developing a taxonomy of workload management techniques. We apply the taxonomy to evaluate and classify existing workload management techniques implemented in the commercial databases and available in the recent research literature. We also introduce the underlying principles of today's workload management technology for DBMSs, discuss open problems, and outline some research opportunities in this research area.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2767044Second-Order Online Active Learning and Its Applications
https://www.computer.org/csdl/trans/tk/2018/07/08122067-abs.html
The goal of online active learning is to learn predictive models from a sequence of unlabeled data given limited label query budget. Unlike conventional online learning tasks, online active learning is considerably more challenging because of two reasons. First, it is difficult to design an effective query strategy to decide when is appropriate to query the label of an incoming instance given limited query budget. Second, it is also challenging to decide how to update the predictive models effectively whenever the true label of an instance is queried. Most existing approaches for online active learning are often based on a family of first-order online learning algorithms, which are simple and efficient but fall short in the slow convergence and sub-optimal solution in exploiting the labeled training data. To solve these issues, this paper presents a novel framework of Second-order Online Active Learning (SOAL) by fully exploiting both the first-order and second-order information. The proposed algorithms are able to achieve effective online learning efficacy, maximize the predictive accuracy, and minimize the labeling cost. To make SOAL more practical for real-world applications, especially for class-imbalanced online classification tasks (e.g., malicious web detection), we extend the SOAL framework by proposing the Cost-sensitive Second-order Online Active Learning algorithm named “SOAL<inline-formula><tex-math notation="LaTeX">$_{CS}$</tex-math><alternatives> <inline-graphic xlink:href="hoi-ieq1-2778097.gif"/></alternatives></inline-formula>”, which is devised by maximizing the sum of weighted sensitivity and specificity or minimizing the cost of weighted mistakes of different classes. We conducted both theoretical analysis and empirical studies, including an extensive set of experiments on a variety of large-scale real-world datasets, in which the promising empirical results validate the efficacy and scalability of the proposed algorithms towards large-scale online learning tasks.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2778097Sampling and Reconstruction Using Bloom Filters
https://www.computer.org/csdl/trans/tk/2018/07/08233172-abs.html
In this paper, we address the problem of sampling from a set and reconstructing a set stored as a Bloom filter. To the best of our knowledge our work is the first to address this question. We introduce a novel hierarchical data structure called <inline-formula><tex-math notation="LaTeX">$\mathsf{BloomSampleTree}$</tex-math><alternatives> <inline-graphic xlink:href="sengupta-ieq1-2785803.gif"/></alternatives></inline-formula> that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to reconstruct the set efficiently. In the case where the hash functions used in the Bloom filter implementation are partially invertible, in the sense that it is easy to calculate the set of elements that map to a particular hash value, we propose a second, more space-efficient method called HashInvert for the reconstruction. We study the properties of these two methods both analytically as well as experimentally. We provide bounds on run times for both methods and sample quality for the <inline-formula><tex-math notation="LaTeX">$\mathsf{BloomSampleTree}$</tex-math><alternatives> <inline-graphic xlink:href="sengupta-ieq2-2785803.gif"/></alternatives></inline-formula> based algorithm, and show through an extensive experimental evaluation that our methods are efficient and effective.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2785803Learning Multiple Factors-Aware Diffusion Models in Social Networks
https://www.computer.org/csdl/trans/tk/2018/07/08234673-abs.html
Information diffusion is a natural phenomenon occurring in social networks. The adoption behavior of a node toward an information piece in a social network can be affected by different factors, e.g., freshness and hotness. Previously, many diffusion models are proposed to consider one or several fixed factors. In fact, the factors affecting adoption decision of a node are different from one to another and may not be seen before. For a different scenario of diffusion with new factors, previous diffusion models may not model the diffusion well, or are not applicable at all. Moreover, uncertainty of information exposure intrinsically exists between two connected nodes, which causes modeling diffusion more challenge in social networks. In this work, our aim is to design a diffusion model in which factors considered are flexible to be extended and changed and the uncertainly of information exposure is explicitly tackled. Therefore, with different factors, our diffusion model can be adapted to more scenarios of diffusion without requiring the modification of the learning framework. We conduct comprehensive experiments to show that our diffusion model is effective on two important tasks of information diffusion, namely activation prediction and spread estimation.06/05/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2786209A Comprehensive Study on Social Network Mental Disorders Detection via Online Social Media Mining
https://www.computer.org/csdl/trans/tk/2018/07/08239661-abs.html
The explosive growth in popularity of social networking leads to the problematic usage. An increasing number of social network mental disorders (SNMDs), such as Cyber-Relationship Addiction, Information Overload, and Net Compulsion, have been recently noted. Symptoms of these mental disorders are usually observed passively today, resulting in delayed clinical intervention. In this paper, we argue that mining online social behavior provides an opportunity to actively identify SNMDs at an early stage. It is challenging to detect SNMDs because the mental status cannot be directly observed from online social activity logs. Our approach, new and innovative to the practice of SNMD detection, does not rely on self-revealing of those mental factors via questionnaires in Psychology. Instead, we propose a machine learning framework, namely, <italic>Social Network Mental Disorder Detection (SNMDD)</italic>, that exploits features extracted from social network data to accurately identify potential cases of SNMDs. We also exploit multi-source learning in SNMDD and propose a new SNMD-based Tensor Model (STM) to improve the accuracy. To increase the scalability of STM, we further improve the efficiency with performance guarantee. Our framework is evaluated via a user study with 3,126 online social network users. We conduct a feature analysis, and also apply SNMDD on large-scale datasets and analyze the characteristics of the three SNMD types. The results manifest that SNMDD is promising for identifying online social network users with potential SNMDs.06/05/2018 4:49 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2786695199