IEEE Transactions on Knowledge & Data Engineering
https://www.computer.org/csdl/trans/tk/index.html
The IEEE Transactions on Knowledge and Data Engineering is an archival journal published monthly. The information published in this Transactions is designed to inform researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area. We are interested in well-defined theoretical results and empirical studies that have potential impact on the acquisition, management, storage, and graceful degeneration of knowledge and data, as well as in provision of knowledge and data services. Specific topics include, but are not limited to: a) artificial intelligence techniques, including speech, voice, graphics, images, and documents; b) knowledge and data engineering tools and techniques; c) parallel and distributed processing; d) real-time distributed; e) system architectures, integration, and modeling; f) database design, modeling and management; g) query design and implementation languages; h) distributed database control; j) algorithms for data and knowledge management; k) performance evaluation of algorithms and systems; l) data communications aspects; m) system applications and experience; n) knowledge-based and expert systems; and, o) integrity, security, and fault tolerance.en-ushttps://s3.amazonaws.com/ieeecs.cdn.common/images/computer.org/CSLogo_dark.pngIEEE Computer Society Digital LibraryList of 100 recently published journal articles.
https://www.computer.org/csdl
Nonintrusive Smartphone User Verification Using Anonymized Multimodal Data
https://www.computer.org/csdl/trans/tk/2019/03/08341498-abs.html
Smartphone user verification is important as personal daily activities are increasingly conducted on the phone and sensitive information is constantly logged. The commonly adopted user verification methods are typically active, i.e., they require a user's cooperative input of a security token to gain access permission. Though popular, these methods impose heavy burden to smartphone users to memorize, maintain, and input the token at a high frequency. To alleviate this imposition onto the users and to provide additional security, we propose a new nonintrusive and continuous mobile user verification framework that can reduce the frequency required for a user to input his/her security token. Using tailored Hidden Markov Models and sequential likelihood ratio test, our verification is built on low-cost, readily available, anonymized, and multimodal smartphone data without additional effort of data collection and risk of privacy leakage. With extensive evaluation, we achieve a high rate of about 94 percent for detecting illegitimate smartphone uses and a rate of 74 percent for confirming legitimate uses. In a practical setting, this can translate into 74 percent of frequency reduction of inputting a security token using an active authentication method with only about 6 percent risk of miss detection of a random intruder, which is highly desirable.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2828309Detecting Pickpocket Suspects from Large-Scale Public Transit Records
https://www.computer.org/csdl/trans/tk/2019/03/08357468-abs.html
Massive data collected by automated fare collection (AFC) systems provide opportunities for studying both personal traveling behaviors and collective mobility patterns in urban areas. Existing studies on AFC data have primarily focused on identifying passengersâ€™ movement patterns. However, we creatively leveraged such data for identifying pickpocket suspects. Stopping pickpockets in the public transit system has been crucial for improving passenger satisfaction and public safety. Nonetheless, in practice, it is challenging to discern thieves from regular passengers. In this paper, we developed a suspect detection and surveillance system, which can identify pickpocket suspects based on their daily transit records. Specifically, we first extracted a number of useful features from each passenger's daily activities in the transit system. Then, we took a two-step approach that exploits the strengths of unsupervised outlier detection and supervised classification models to identify thieves, who typically exhibit abnormal traveling behaviors. Experimental results demonstrated the effectiveness of our method. We also developed a prototype system for potential uses by security personnel.02/07/2019 6:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2834909C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join
https://www.computer.org/csdl/trans/tk/2019/03/08359017-abs.html
Similarity join of two datasets <inline-formula><tex-math notation="LaTeX">$P$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq1-2836464.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$Q$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq2-2836464.gif"/></alternatives></inline-formula> is a primitive operation that is useful in many application domains. The operation involves identifying pairs <inline-formula><tex-math notation="LaTeX">$(p,q)$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq3-2836464.gif"/></alternatives></inline-formula>, in the Cartesian product of <inline-formula><tex-math notation="LaTeX">$P$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq4-2836464.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$Q$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq5-2836464.gif"/></alternatives></inline-formula> such that <inline-formula><tex-math notation="LaTeX">$(p,q)$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq6-2836464.gif"/></alternatives></inline-formula> satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) <italic>minimum spanning tree</italic> for organizing LSH buckets replication; and (ii) <italic>spectral clustering</italic> for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20 percent data reduction and 50 percent reduction in shuffle time.02/05/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2836464Representing Urban Forms: A Collective Learning Model with Heterogeneous Human Mobility Data
https://www.computer.org/csdl/trans/tk/2019/03/08359195-abs.html
Human mobility data refers to records of human movements, such as cellphone traces, vehicle GPS trajectories, geo-tagged posts, and photos. While successfully mining human mobility data can benefit many applications such as city planning, transportation, urban economics, and public safety, it is very challenging to model large-scale Heterogeneous Human Mobility Data (HHMD) that are generated from different resources. In this paper, we develop a general collective learning approach to model HHMD at an individual level towards identifying and quantifying the urban forms of residential communities. Specifically, our proposed method exploits two geographic regularities among HHMD. First, we jointly capture the correlations among residential communities, urban functions, temporal effects, and user mobility patterns by analogizing communities as documents and mobility patterns as words. Also, we further combine explicit LASSO analysis and significant testing into latent representation learning as a regularization term by analogizing compatible Point-of-Interests (POIs) as the meta-data of communities. In this way, we can learn the urban forms, including a mix of functions and corresponding portfolios, of residential communities from HHDM and POIs. We further leverage these learned results to address two application problems: real estate ranking and restaurant popularity prediction. Finally, we conduct intensive evaluations with a variety of real-world data, where experimental results demonstrate the effectiveness of our proposed modeling method and its successful applications for other problems.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2837027Towards Confidence Interval Estimation in Truth Discovery
https://www.computer.org/csdl/trans/tk/2019/03/08359426-abs.html
The demand for automatic extraction of true information (i.e., truths) from conflicting multi-source data has soared recently. A variety of <italic>truth discovery</italic> methods have witnessed great successes via jointly estimating source reliability and truths. All existing truth discovery methods focus on providing a point estimator for each object's truth, but in many real-world applications, confidence interval estimation of truths is more desirable, since confidence interval contains richer information. To address this challenge, in this paper, we propose a novel truth discovery method (<italic>ETCIBoot</italic>) to construct confidence interval estimates as well as identify truths, where the bootstrapping techniques are nicely integrated into the truth discovery procedure. Due to the properties of bootstrapping, the estimators obtained by <italic>ETCIBoot</italic> are more accurate and robust compared with the state-of-the-art truth discovery approaches. The proposed framework is further adapted to deal with large-scale truth discovery task in distributed paradigm. Theoretically, we prove the asymptotical consistency of the confidence interval obtained by <italic>ETCIBoot</italic>. Experimentally, we demonstrate that <italic>ETCIBoot</italic> is not only effective in constructing confidence intervals but also able to obtain better truth estimates.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2837026Stacked Robust Adaptively Regularized Auto-Regressions for Domain Adaptation
https://www.computer.org/csdl/trans/tk/2019/03/08360058-abs.html
Domain adaptation is the situation for supervised learning in which the training data are sampled from the source domain while the test data are sampled from the target domain that follows a different distribution. The key to solving such a problem is to reduce effects of the discrepancy between the training data and test data. Recently, deep learning methods that employ stacked denoising auto-encoders (SDAs) to learn new representations for both domains have been successfully applied in domain adaptation. And, remarkable performance on multi-domain sentiment analysis datasets has been reported, making deep learning a promising approach to domain adaptation problems. In this paper, a deep learning method called Stacked Robust Adaptively Regularized Auto-regressions (SRARAs) is proposed to learn useful representations for domain adaptation problems. Each layer of SRARAs contains two steps: a linear transformation step, which is based on robust adaptively regularized auto-regression, and a non-linear squashing transformation step. The first step aims at reducing the discrepancy between the training data and test data, and the second step is to introduce non-linearity and control the range of the elements in the outputs. The experimental results on text and image datasets demonstrate that the proposed method is very effective.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2837085Progressive Approaches for Pareto Optimal Groups Computation
https://www.computer.org/csdl/trans/tk/2019/03/08360085-abs.html
Group skyline query is a powerful tool for optimal group analysis. Most of the existing group skyline queries select optimal groups by comparing the dominance relationship between aggregate-based points; such feature creates difficulties for users to specify an appropriate aggregate function. Besides, many significant groups that have great attractions to users in practice may be overlooked. To address these issues, the group skyline (GSky) query is formulated on the basis of a general definition of group dominance operator. While the existing GSky query algorithms are effective, there is still room for improvement in terms of progressiveness and efficiency. In this paper, we propose some new lemmas which facilitate direct generation of the GSky query results. Consecutively, we design a layered unit-based (LU) algorithm that applies a layered optimum strategy. Additionally, for the GSky query over the data that are dynamically produced and cannot be indexed, we propose a novel index-independent algorithm, called sorted-based progressive (SP) algorithm. The experimental results demonstrate the effectiveness, efficiency, and progressiveness of the proposed algorithms. By comparing with the state-of-the-art algorithm for the GSky query, our LU algorithm is more scalable and two orders of magnitude faster.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2837117Robust Image Hashing with Tensor Decomposition
https://www.computer.org/csdl/trans/tk/2019/03/08360464-abs.html
This paper presents a new image hashing that is designed with tensor decomposition (TD), referred to as TD hashing, where image hash generation is viewed as deriving a compact representation from a tensor. Specifically, a stable three-order tensor is first constructed from the normalized image, so as to enhance the robustness of our TD hashing. A popular TD algorithm, called Tucker decomposition, is then exploited to decompose the three-order tensor into a core tensor and three orthogonal factor matrices. As the factor matrices can reflect intrinsic structure of original tensor, hash construction with the factor matrices makes a desirable discrimination of the TD hashing. To examine these claims, there are 14,551 images selected for our experiments. A receiver operating characteristics (ROC) graph is used to conduct theoretical analysis and the ROC comparisons illustrate that the TD hashing outperforms some state-of-the-art algorithms in classification performance between the robustness and discrimination.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2837745Webpage Depth Viewability Prediction Using Deep Sequential Neural Networks
https://www.computer.org/csdl/trans/tk/2019/03/08362690-abs.html
Display advertising is the most important revenue source for publishers in the online publishing industry. The ad pricing standards are shifting to a new model in which ads are paid only if they are viewed. Consequently, an important problem for publishers is to predict the probability that an ad at a given page depth will be shown on a user's screen for a certain dwell time. This paper proposes deep learning models based on Long Short-Term Memory (LSTM) to predict the viewability of any page depth for any given dwell time. The main novelty of our best model consists in the combination of bi-directional LSTM networks, encoder-decoder structure, and residual connections. The experimental results over a dataset collected from a large online publisher demonstrate that the proposed LSTM-based sequential neural networks outperform the comparison methods in terms of prediction performance.02/06/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2839599Nonnegative Matrix Factorization with Side Information for Time Series Recovery and Prediction
https://www.computer.org/csdl/trans/tk/2019/03/08362697-abs.html
Motivated by the recovery and prediction of electricity consumption time series, we extend Nonnegative Matrix Factorization to take into account external features as side information. We consider general linear measurement settings, and propose a framework which models non-linear relationships between external features and the response variable. We extend previous theoretical results to obtain a sufficient condition on the identifiability of NMF with side information. Based on the classical Hierarchical Alternating Least Squares (HALS) algorithm, we propose a new algorithm (HALSX, or Hierarchical Alternating Least Squares with eXogeneous variables) which estimates NMF in this setting. The algorithm is validated on both simulated and real electricity consumption datasets as well as a recommendation system dataset, to show its performance in matrix recovery and prediction for new rows and columns.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2839678Comments Mining With TF-IDF: The Inherent Bias and Its Removal
https://www.computer.org/csdl/trans/tk/2019/03/08364601-abs.html
Text mining have gained great momentum in recent years, with user-generated content becoming widely available. One key use is comment mining, with much attention being given to sentiment analysis and opinion mining. An essential step in the process of comment mining is text pre-processing; a step in which each linguistic term is assigned with a weight that commonly increases with its appearance in the studied text, yet is offset by the frequency of the term in the domain of interest. A common practice is to use the well-known <italic>tf-idf</italic> formula to compute these weights. This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an adjustment. We find that content extracted from discourse is often highly correlated, resulting in dependency structures between observations in the study, thus introducing a statistical bias. Ignoring this bias can manifest in a non-robust analysis at best and can lead to an entirely wrong conclusion at worst. We propose an adjustment to tf-idf that accounts for this bias. We illustrate the effects of both the bias and correction with with seven Facebook fan pages data, covering different domains, including news, finance, politics, sport, shopping, and entertainment.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2840127A General Theory of IR Evaluation Measures
https://www.computer.org/csdl/trans/tk/2019/03/08365758-abs.html
Interval scales are assumed by several basic descriptive statistics, such as mean and variance, and by many statistical significance tests which are daily used in IR to compare systems. Unfortunately, so far, there has not been any systematic and formal study to discover the actual scale properties of IR measures. Therefore, in this paper, we develop a theory of <italic>Information Retrieval (IR)</italic> evaluation measures, based on the representational theory of measurements, to determine whether and when IR measures are interval scales. We found that common set-based retrieval measures—namely Precision, Recall, and F-measure—always are interval scales in the case of binary relevance while this happens also in the case of multi-graded relevance only when the relevance degrees themselves are on a ratio scale and we define a specific partial order among systems. In the case of rank-based retrieval measures—namely AP, gRBP, DCG, and ERR—only gRPB is an interval scale when we choose a specific value of the parameter <inline-formula><tex-math notation="LaTeX">$p$</tex-math><alternatives><inline-graphic xlink:href="ferro-ieq1-2840708.gif"/></alternatives></inline-formula> and define a specific total order among systems while all the other IR measures are not interval scales. Besides the formal framework itself and the proof of the scale properties of several commonly used IR measures, the paper also defines some brand new set-based and rank-based IR evaluation measures which ensure to be interval scales.02/05/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2840708Viral Cascade Probability Estimation and Maximization in Diffusion Networks
https://www.computer.org/csdl/trans/tk/2019/03/08367882-abs.html
People use social networks to share millions of stories every day, but these stories rarely become viral. Can we estimate the probability that a story becomes a <italic>viral cascade</italic>? If so, can we find a set of users that are more likely to trigger viral cascades? These estimation and maximization problems are very challenging since both rare-event nature of viral cascades and efficiency requirement should be considered. Unfortunately, this problem still remains largely unexplored to date. In this paper, given temporal dynamics of a network, we first develop an efficient <bold>vi</bold>ral <bold>c</bold>ascade probability <bold>e</bold>stimation method, <sc>ViCE</sc>, that leverages an special <italic>importance sampling</italic> approximation to achieve high accuracy, even in the cases of very <italic>small probability</italic> of influence. We then show that the most influential nodes in this model is NP-hard, and develop an efficient <bold>vi</bold>ral <bold>c</bold>ascade probability <bold>m</bold>aximization method, <sc>ViCEM</sc>, that maximizes a surrogate submodular function using a greedy algorithm. Experiments on both synthetic and real-world data show that <sc>ViCE</sc> can accurately estimate viral cascade probabilities using fewer samples than naive sampling by at least two orders of magnitude, and also <sc>ViCEM</sc> finds a set of users with higher viral cascade probability than alternatives. Additionally, experiments show that these algorithms are robust across different network topologies.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2840998Privacy-Preserving Social Media Data Publishing for Personalized Ranking-Based Recommendation
https://www.computer.org/csdl/trans/tk/2019/03/08367884-abs.html
Personalized recommendation is crucial to help users find pertinent information. It often relies on a large collection of user data, in particular users’ online activity (e.g., tagging/rating/checking-in) on social media, to mine user preference. However, releasing such user activity data makes users vulnerable to inference attacks, as private data (e.g., gender) can often be inferred from the users’ activity data. In this paper, we proposed PrivRank, a customizable and continuous privacy-preserving social media data publishing framework protecting users against inference attacks while enabling personalized ranking-based recommendations. Its key idea is to continuously obfuscate user activity data such that the privacy leakage of user-specified private data is minimized under a given data distortion budget, which bounds the ranking loss incurred from the data obfuscation process in order to preserve the utility of the data for enabling recommendations. An empirical evaluation on both synthetic and real-world datasets shows that our framework can efficiently provide effective and continuous protection of user-specified private data, while still preserving the utility of the obfuscated data for personalized ranking-based recommendation. Compared to state-of-the-art approaches, PrivRank achieves both a better privacy protection and a higher utility in all the ranking-based recommendation use cases we tested.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2840974Correlated Matrix Factorization for Recommendation with Implicit Feedback
https://www.computer.org/csdl/trans/tk/2019/03/08367897-abs.html
As a typical latent factor model, Matrix Factorization (MF) has demonstrated its great effectiveness in recommender systems. Users and items are represented in a shared low-dimensional space so that the user preference can be modeled by linearly combining the item factor vector <inline-formula><tex-math notation="LaTeX">$V$</tex-math><alternatives><inline-graphic xlink:href="he-ieq1-2840993.gif"/></alternatives></inline-formula> using the user-specific coefficients <inline-formula><tex-math notation="LaTeX">$U$</tex-math><alternatives><inline-graphic xlink:href="he-ieq2-2840993.gif"/></alternatives></inline-formula>. From a generative model perspective, <inline-formula><tex-math notation="LaTeX">$U$</tex-math><alternatives><inline-graphic xlink:href="he-ieq3-2840993.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$V$</tex-math><alternatives><inline-graphic xlink:href="he-ieq4-2840993.gif"/></alternatives></inline-formula> are drawn from two <italic>independent</italic> Gaussian distributions, which is not so faithful to the reality. Items are produced to maximally meet users’ requirements, which makes <inline-formula><tex-math notation="LaTeX">$U$</tex-math><alternatives><inline-graphic xlink:href="he-ieq5-2840993.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$V$</tex-math><alternatives><inline-graphic xlink:href="he-ieq6-2840993.gif"/></alternatives></inline-formula> strongly correlated. Meanwhile, the linear combination between <inline-formula><tex-math notation="LaTeX">$U$</tex-math><alternatives><inline-graphic xlink:href="he-ieq7-2840993.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$V$</tex-math><alternatives><inline-graphic xlink:href="he-ieq8-2840993.gif"/></alternatives></inline-formula> forces a bijection (one-to-one mapping), which thereby neglects the mutual correlation between the latent factors. In this paper, we address the upper drawbacks, and propose a new model, named Correlated Matrix Factorization (CMF). Technically, we apply Canonical Correlation Analysis (CCA) to map <inline-formula><tex-math notation="LaTeX">$U$</tex-math><alternatives><inline-graphic xlink:href="he-ieq9-2840993.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$V$</tex-math><alternatives><inline-graphic xlink:href="he-ieq10-2840993.gif"/></alternatives></inline-formula> into a new semantic space. Besides achieving the optimal fitting on the rating matrix, one component in each vector (<inline-formula><tex-math notation="LaTeX">$U$</tex-math><alternatives><inline-graphic xlink:href="he-ieq11-2840993.gif"/></alternatives></inline-formula> or <inline-formula><tex-math notation="LaTeX">$V$</tex-math><alternatives><inline-graphic xlink:href="he-ieq12-2840993.gif"/></alternatives></inline-formula>) is also tightly correlated with every single component in the other. We derive efficient inference and learning algorithms based on variational EM methods. The effectiveness of our proposed model is comprehensively verified on four public datasets. Experimental results show that our approach achieves competitive performance on both prediction accuracy and efficiency compared with the current state of the art.02/05/2019 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2840993Adaptive Cost-Sensitive Online Classification
https://www.computer.org/csdl/trans/tk/2019/02/08337008-abs.html
Cost-Sensitive Online Classification has drawn extensive attention in recent years, where the main approach is to directly online optimize two well-known cost-sensitive metrics: (i) weighted sum of sensitivity and specificity and (ii) weighted misclassification cost. However, previous existing methods only considered first-order information of data stream. It is insufficient in practice, since many recent studies have proved that incorporating second-order information enhances the prediction performance of classification models. Thus, we propose a family of cost-sensitive online classification algorithms with adaptive regularization in this paper. We theoretically analyze the proposed algorithms and empirically validate their effectiveness and properties in extensive experiments. Then, for better trade off between the performance and efficiency, we further introduce the sketching technique into our algorithms, which significantly accelerates the computational speed with quite slight performance loss. Finally, we apply our algorithms to tackle several online anomaly detection tasks from real world. Promising results prove that the proposed algorithms are effective and efficient in solving cost-sensitive online classification problems in various real-world domains.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2826011Applying Simulated Annealing and Parallel Computing to the Mobile Sequential Recommendation
https://www.computer.org/csdl/trans/tk/2019/02/08338062-abs.html
We speed up the solution of the mobile sequential recommendation (MSR) problem that requires searching optimal routes for empty taxi cabs through mining massive taxi GPS data. We develop new methods that combine parallel computing and the simulated annealing with novel global and local searches. While existing approaches usually involve costly offline algorithms and methodical pruning of the search space, our new methods provide direct real-time search for the optimal route without the offline preprocessing. Our methods significantly reduce computational time for the high dimensional MSR problems from days to seconds based on the real-world data as well as the synthetic ones. We efficiently provide solutions to MSR problems with thousands of pick-up points without offline training, compared to the published record of 25 pick-up points.01/10/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2827047Engineering Methods for Differentially Private Histograms: Efficiency Beyond Utility
https://www.computer.org/csdl/trans/tk/2019/02/08338405-abs.html
Publishing histograms with <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math><alternatives><inline-graphic xlink:href="kellaris-ieq1-2827378.gif"/></alternatives></inline-formula>-<italic>differential privacy</italic> has been studied extensively in the literature. Existing schemes aim at maximizing the <italic>utility</italic> of the published data, while previous experimental evaluations analyze the privacy/utility trade-off. In this paper, we provide the first experimental evaluation of differentially private methods that goes beyond utility, emphasizing also on another important aspect, namely <italic>efficiency</italic>. Towards this end, we first observe that all existing schemes are comprised of a small set of common blocks. We then optimize and choose the best implementation for each block, determine the combinations of blocks that capture the entire literature, and propose novel block combinations. We qualitatively assess the quality of the schemes based on the skyline of efficiency and utility, i.e., based on whether a method is dominated on both aspects or not. Using exhaustive experiments on four real datasets with different characteristics, we conclude that there are always trade-offs in terms of utility and efficiency. We demonstrate that the schemes derived from our novel block combinations provide the best trade-offs for time critical applications. Our work can serve as a guide to help practitioners <italic>engineer</italic> a differentially private histogram scheme depending on their application requirements.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2827378Fast Cosine Similarity Search in Binary Space with Angular Multi-Index Hashing
https://www.computer.org/csdl/trans/tk/2019/02/08340865-abs.html
Given a large dataset of binary codes and a binary query point, we address how to efficiently find <inline-formula><tex-math notation="LaTeX">$K$</tex-math><alternatives><inline-graphic xlink:href="eghbali-ieq1-2828095.gif"/></alternatives></inline-formula> codes in the dataset that yield the largest cosine similarities to the query. The straightforward answer to this problem is to compare the query with all items in the dataset, but this is practical only for small datasets. One potential solution to enhance the search time and achieve sublinear cost is to use a hash table populated with binary codes of the dataset and then look up the nearby buckets to the query to retrieve the nearest neighbors. However, if codes are compared in terms of cosine similarity rather than the Hamming distance, then the main issue is that the order of buckets to probe is not evident. To examine this issue, we first elaborate on the connection between the Hamming distance and the cosine similarity. Doing this allows us to systematically find the probing sequence in the hash table. However, solving the nearest neighbor search with a single table is only practical for short binary codes. To address this issue, we propose the angular multi-index hashing search algorithm which relies on building multiple hash tables on binary code substrings. The proposed search algorithm solves the exact angular <inline-formula><tex-math notation="LaTeX">$K$</tex-math><alternatives><inline-graphic xlink:href="eghbali-ieq2-2828095.gif"/></alternatives></inline-formula> nearest neighbor problem in a time that is often orders of magnitude faster than the linear scan baseline and even approximation methods.01/10/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2828095Harnessing Multi-Source Data about Public Sentiments and Activities for Informed Design
https://www.computer.org/csdl/trans/tk/2019/02/08341813-abs.html
The intelligence of Smart Cities (SC) is represented by its ability in collecting, managing, integrating, analyzing, and mining multi-source data for valuable insights. In order to harness multi-source data for an informed place design, this paper presents “Public Sentiments and Activities in Places” multi-source data analysis flow (PSAP) in an Informed Design Platform (IDP). In terms of key contributions, PSAP implements 1) an Interconnected Data Model (IDM) to manage multi-source data independently and integrally, 2) an efficient and effective data mining mechanism based on multi-dimension and multi-measure queries (MMQs), and 3) concurrent data processing cascades with Sentiments in Places Analysis Mechanism (SPAM) and Activities in Places Analysis Mechanism (APAM), to fuse social network data with other data on public sentiment and activity comprehensively. As proved by a holistic evaluation, both SPAM and APAM outperform compared methods. Specifically, SPAM improves its classification accuracy gradually and significantly from 72.37 to about 85 percent within nine crowd-calibration cycles, and APAM with an ensemble classifier achieves the highest precision of 92.13 percent, which is approximately 13 percent higher than the second best method. Finally, by applying MMQs on “Sentiment&Activity Linked Data”, various place design insights of our testbed are mined to improve its livability.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2828431Efficient Mining of Frequent Patterns on Uncertain Graphs
https://www.computer.org/csdl/trans/tk/2019/02/08349950-abs.html
Uncertainty is intrinsic to a wide spectrum of real-life applications, which inevitably applies to graph data. Representative uncertain graphs are seen in bio-informatics, social networks, etc. This paper motivates the problem of frequent subgraph mining on single uncertain graphs, and investigates two different - probabilistic and expected - semantics in terms of support definitions. First, we present an enumeration-evaluation algorithm to solve the problem under probabilistic semantics. By showing the support computation under probabilistic semantics is #P-complete, we develop an approximation algorithm with accuracy guarantee for efficient problem-solving. To enhance the solution, we devise computation sharing techniques to achieve better mining performance. Afterwards, the algorithm is extended in a similar flavor to handle the problem under expected semantics, where checkpoint-based pruning and validation techniques are integrated. Experiment results on real-life datasets confirm the practical usability of the mining algorithms.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2830336Predicting Consumption Patterns with Repeated and Novel Events
https://www.computer.org/csdl/trans/tk/2019/02/08353141-abs.html
There are numerous contexts where individuals typically consume a few items from a large selection of possible items. Examples include purchasing products, listening to music, visiting locations in physical or virtual environments, and so on. There has been significant prior work in such contexts on developing predictive modeling techniques for recommending new items to individuals, often using techniques such as matrix factorization. There are many situations, however, where making predictions for both previously-consumed and new items for an individual is important, rather than just recommending new items. We investigate this problem and find that widely-used matrix factorization methods are limited in their ability to capture important details in historical behavior, resulting in relatively low predictive accuracy for these types of problems. As an alternative we propose an interpretable and scalable mixture model framework that balances individual preferences in terms of exploration and exploitation. We evaluate our model in terms of accuracy in user consumption predictions using several real-world datasets, including location data, social media data, and music listening data. Experimental results show that the mixture model approach is systematically more accurate and more efficient for these problems compared to a variety of state-of-the-art matrix factorization methods.01/11/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2832132Community Detection in Multi-Layer Networks Using Joint Nonnegative Matrix Factorization
https://www.computer.org/csdl/trans/tk/2019/02/08353344-abs.html
Many complex systems are composed of coupled networks through different layers, where each layer represents one of many possible types of interactions. A fundamental question is how to extract communities in multi-layer networks. The current algorithms either collapses multi-layer networks into a single-layer network or extends the algorithms for single-layer networks by using consensus clustering. However, these approaches have been criticized for ignoring the connection among various layers, thereby resulting in low accuracy. To attack this problem, a quantitative function (multi-layer modularity density) is proposed for community detection in multi-layer networks. Afterward, we prove that the trace optimization of multi-layer modularity density is equivalent to the objective functions of algorithms, such as kernel <inline-formula><tex-math notation="LaTeX">$K$</tex-math><alternatives><inline-graphic xlink:href="ma-ieq1-2832205.gif"/></alternatives></inline-formula>-means, nonnegative matrix factorization (NMF), spectral clustering and multi-view clustering, for multi-layer networks, which serves as the theoretical foundation for designing algorithms for community detection. Furthermore, a <underline>S</underline>emi-<underline>S</underline>upervised <underline>j</underline>oint <underline>N</underline>onnegative <underline>M</underline>atrix <underline>F</underline>actorization algorithm (<italic>S2-jNMF</italic>) is developed by simultaneously factorizing matrices that are associated with multi-layer networks. Unlike the traditional semi-supervised algorithms, the partial supervision is integrated into the objective of the S2-jNMF algorithm. Finally, through extensive experiments on both artificial and real world networks, we demonstrate that the proposed method outperforms the state-of-the-art approaches for community detection in multi-layer networks.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2832205Supergraph Search in Graph Databases via Hierarchical Feature-Tree
https://www.computer.org/csdl/trans/tk/2019/02/08354892-abs.html
Supergraph search is a fundamental problem in graph databases that is widely applied in many application scenarios. Given a graph database and a query-graph, supergraph search retrieves all data-graphs contained in the query-graph from the graph database. Most existing solutions for supergraph search follow the pruning-and-verification framework, which prune false answers based on features in the pruning phase and perform subgraph isomorphism testings on the remaining graphs in the verification phase. However, they are not scalable to handle large-sized data-graphs and query-graphs due to three drawbacks. First, they rely on a frequent subgraph mining algorithm to select features which is expensive and cannot generate large features. Second, they require a costly verification phase. Third, they process features in a fixed order without considering their relationships to the query-graph. In this paper, we address the three drawbacks and propose new indexing and query processing algorithms. In indexing, we select features directly from the data-graphs without expensive frequent subgraph mining. The features form a feature-tree that contains all-sized features and both the cost sharing and pruning power of the features are considered. In query processing, we propose a new algorithm, where the order to process features is query-dependent by considering both the cost sharing and the pruning power. We explore two optimization strategies to further improve the algorithm efficiency. The first strategy applies a lightweight graph compression technique and the second strategy optimizes the inclusion of answers. We further introduce how to efficiently maintain the index incrementally when the graph database is updated dynamically. Moreover, we propose an approximation approach to significantly reduce the computational cost for large data-graphs and/or query-graphs while preserving a high result quality. Finally, we conduct extensive performance studies on two real large datasets to demonstrate the efficiency and effectiveness of our algorithms.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2833124Efficient Vertical Mining of High Average-Utility Itemsets Based on Novel Upper-Bounds
https://www.computer.org/csdl/trans/tk/2019/02/08355591-abs.html
Mining High Average-Utility Itemsets (HAUIs) in a quantitative database is an extension of the traditional problem of frequent itemset mining, having several practical applications. Discovering HAUIs is more challenging than mining frequent itemsets using the traditional support model since the average-utilities of itemsets do not satisfy the downward-closure property. To design algorithms for mining HAUIs that reduce the search space of itemsets, prior studies have proposed various upper-bounds on the average-utilities of itemsets. However, these algorithms can generate a huge amount of unpromising HAUI candidates, which result in high memory consumption and long runtimes. To address this problem, this paper proposes four tight average-utility upper-bounds, based on a vertical database representation, and three efficient pruning strategies. Furthermore, a novel generic framework for comparing average-utility upper-bounds is presented. Based on these theoretical results, an efficient algorithm named dHAUIM is introduced for mining the complete set of HAUIs. dHAUIM represents the search space and quickly compute upper-bounds using a novel IDUL structure. Extensive experiments show that dHAUIM outperforms four state-of-the-art algorithms for mining HAUIs in terms of runtime on both real-life and synthetic databases. Moreover, results show that the proposed pruning strategies dramatically reduce the number of candidate HAUIs.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2833478Heterogeneous Information Network Embedding for Recommendation
https://www.computer.org/csdl/trans/tk/2019/02/08355676-abs.html
Due to the flexibility in modelling data heterogeneity, heterogeneous information network (HIN) has been adopted to characterize complex and heterogeneous auxiliary data in recommender systems, called <italic>HIN based recommendation</italic>. It is challenging to develop effective methods for HIN based recommendation in both extraction and exploitation of the information from HINs. Most of HIN based recommendation methods rely on path based similarity, which cannot fully mine latent structure features of users and items. In this paper, we propose a novel heterogeneous network embedding based approach for HIN based recommendation, called HERec. To embed HINs, we design a meta-path based random walk strategy to generate meaningful node sequences for network embedding. The learned node embeddings are first transformed by a set of fusion functions, and subsequently integrated into an extended matrix factorization (MF) model. The extended MF model together with fusion functions are jointly optimized for the rating prediction task. Extensive experiments on three real-world datasets demonstrate the effectiveness of the HERec model. Moreover, we show the capability of the HERec model for the cold-start problem, and reveal that the transformed embedding information from HINs can improve the recommendation performance.01/14/2019 9:57 am PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2833443Collaboratively Tracking Interests for User Clustering in Streams of Short Texts
https://www.computer.org/csdl/trans/tk/2019/02/08355681-abs.html
In this paper, we aim at tackling the problem of user clustering in the context of their published short text streams. Clustering users by short text streams is more challenging than in the case of long documents associated with them as it is difficult to track users’ dynamic interests in streaming sparse data. To obtain better user clustering performance, we propose two user collaborative interest tracking models that aim at tracking changes of each user's dynamic topic distributions in collaboration with their followees’ dynamic topic distributions, based both on the content of current short texts and the previously estimated distributions. Our models can be either short-term or long-term dependency topic models. Short-term dependency model collaboratively tracks users’ interests based on users’ topic distributions at the previous time period only, whereas long-term dependency model collaboratively tracks users’ interests based on users’ topic distributions at multiple time periods in the past. We also propose two collapsed Gibbs sampling algorithms for collaboratively inferring users’ dynamic interests for their clustering in our short-term and long-term dependency topic models, respectively. We evaluate our proposed models via a benchmark dataset consisting of Twitter users and their tweets. Experimental results validate the effectiveness of our proposed models that integrate both users’ and their collaborative interests for user clustering by short text streams.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2832211An Efficient Semi-Supervised Multi-label Classifier Capable of Handling Missing Labels
https://www.computer.org/csdl/trans/tk/2019/02/08356120-abs.html
Multi-label classification has received considerable interest in recent years. Multi-label classifiers usually need to address many issues including: handling large-scale datasets with many instances and a large set of labels, compensating missing label assignments in the training set, considering correlations between labels, as well as exploiting unlabeled data to improve prediction performance. To tackle datasets with a large set of labels, embedding-based methods represent the label assignments in a low-dimensional space. Many state-of-the-art embedding-based methods use a linear dimensionality reduction to map the label assignments to a low-dimensional space. However, by doing so, these methods actually neglect the tail labels - labels that are infrequently assigned to instances. In this paper, we propose an embedding-based method that non-linearly embeds the label vectors using a stochastic approach, thereby predicting the tail labels more accurately. Moreover, the proposed method has excellent mechanisms for handling missing labels, dealing with large-scale datasets, as well as exploiting unlabeled data. Experiments on real-world datasets show that our method outperforms state-of-the-art multi-label classifiers by a large margin, in terms of prediction performance, as well as training time. Our implementation of the proposed method is available online at:<uri>https://github.com/Akbarnejad/ESMC_ Implementation</uri>.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2833850A Correlation-Based Feature Weighting Filter for Naive Bayes
https://www.computer.org/csdl/trans/tk/2019/02/08359364-abs.html
Due to its simplicity, efficiency, and efficacy, naive Bayes (NB) has continued to be one of the top 10 algorithms in the data mining and machine learning community. Of numerous approaches to alleviating its conditional independence assumption, feature weighting has placed more emphasis on highly predictive features than those that are less predictive. In this paper, we argue that for NB highly predictive features should be highly correlated with the class (maximum mutual relevance), yet uncorrelated with other features (minimum mutual redundancy). Based on this premise, we propose a correlation-based feature weighting (CFW) filter for NB. In CFW, the weight for a feature is a sigmoid transformation of the difference between the feature-class correlation (mutual relevance) and the average feature-feature intercorrelation (average mutual redundancy). Experimental results show that NB with CFW significantly outperforms NB and all the other existing state-of-the-art feature weighting filters used to compare. Compared to feature weighting wrappers for improving NB, the main advantages of CFW are its low computational complexity (no search involved) and the fact that it maintains the simplicity of the final model. Besides, we apply CFW to text classification and have achieved remarkable improvements.01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.28364402018 Reviewers List<formula formulatype="inline"><tex Notation="TeX"/></formula>
https://www.computer.org/csdl/trans/tk/2019/02/08606810-abs.html
01/10/2019 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2885263Platform-Independent Robust Query Processing
https://www.computer.org/csdl/trans/tk/2019/01/07843652-abs.html
To address the classical selectivity estimation problem for OLAP queries in relational databases, a radically different approach called <monospace>PlanBouquet</monospace> was recently proposed in <xref ref-type="bibr" rid="ref1"> [1]</xref> , wherein the estimation process is completely abandoned and replaced with a calibrated discovery mechanism. The beneficial outcome of this new construction is that provable guarantees on worst-case performance, measured as Maximum Sub-Optimality (<italic>MSO</italic>), are obtained thereby facilitating robust query processing. The <monospace> PlanBouquet</monospace> formulation suffers, however, from a systemic drawback—the MSO bound is a function of not only the query, but also the optimizer's behavioral profile over the underlying database platform. As a result, there are adverse consequences: (i) the bound value becomes highly variable, depending on the specifics of the current operating environment, and (ii) it becomes infeasible to compute the value without substantial investments in preprocessing overheads. In this paper, we first present <monospace>SpillBound</monospace>, a new query processing algorithm that retains the core strength of the <monospace>PlanBouquet</monospace> discovery process, but reduces the bound dependency to only the query. It does so by incorporating plan termination and selectivity monitoring mechanisms in the database engine. Specifically, <monospace>SpillBound</monospace> delivers a worst-case multiplicative bound, of <inline-formula><tex-math notation="LaTeX">$D^2+3D$</tex-math><alternatives> <inline-graphic xlink:href="venkatesh-ieq1-2664827.gif"/></alternatives></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$D$</tex-math><alternatives><inline-graphic xlink:href="venkatesh-ieq2-2664827.gif"/> </alternatives></inline-formula> is simply the number of error-prone predicates in the user query. Consequently, the bound value becomes independent of the optimizer and the database platform, and the guarantee can be issued simply by query inspection. We go on to prove that <monospace>SpillBound</monospace> is within an <inline-formula> <tex-math notation="LaTeX">$O(D)$</tex-math><alternatives><inline-graphic xlink:href="venkatesh-ieq3-2664827.gif"/> </alternatives></inline-formula> factor of the <italic>best possible</italic> deterministic selectivity discovery algorithm in its class. We next devise techniques to bridge this quadratic-to-linear MSO gap by introducing the notion of <italic>contour alignment</italic>, a characterization of the nature of plan structures along the <italic> boundaries</italic> of the selectivity space. Specifically, we propose a variant of <monospace>SpillBound</monospace>, called <monospace>AlignedBound</monospace>, which exploits the alignment property and provides a guarantee in the range <inline-formula><tex-math notation="LaTeX">$\mathbf {[2D+2,D^2+3D]}$</tex-math><alternatives> <inline-graphic xlink:href="venkatesh-ieq4-2664827.gif"/></alternatives></inline-formula>. Finally, a detailed empirical evaluation over the standard decision-support benchmarks indicates that: (i) <monospace>SpillBound </monospace> provides markedly superior performance w.r.t. MSO as compared to <monospace>PlanBouquet</monospace>, and (ii) <monospace>AlignedBound</monospace> provides additional benefits for query instances that are challenging for <monospace>SpillBound</monospace>, often coming close to the ideal of MSO linearity in <inline-formula> <tex-math notation="LaTeX">$D$</tex-math><alternatives><inline-graphic xlink:href="venkatesh-ieq5-2664827.gif"/> </alternatives></inline-formula>. From an absolute perspective, <monospace>AlignedBound</monospace> evaluates virtually all the benchmark queries considered in our study with MSO of around <bold>10</bold> or lesser. Therefore, in an overall sense, <monospace>SpillBound</monospace> and <monospace>AlignedBound</monospace> offer a substantive step forward in the long-standing quest for robust query processing.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2664827Inferring Higher-Order Structure Statistics of Large Networks from Sampled Edges
https://www.computer.org/csdl/trans/tk/2019/01/07884989-abs.html
Recently exploring locally connected subgraphs (also known as motifs or graphlets) of complex networks attracts a lot of attention. Previous work made the strong assumption that the graph topology of interest is known in advance. In practice, sometimes researchers have to deal with the situation where the graph topology is unknown because it is expensive to collect and store all topological information. Hence, typically what is available to researchers is only a snapshot of the graph, i.e., a subgraph of the graph. Crawling methods such as breadth first sampling can be used to generate the snapshot. However, these methods fail to sample a streaming graph represented as a high speed stream of edges. Therefore, graph mining applications such as network traffic monitoring usually use random edge sampling (i.e., sample each edge with a fixed probability) to collect edges and generate a sampled graph, which we call a “ <italic>RESampled graph</italic>”. Clearly, a RESampled graph's motif statistics may be quite different from those of the original graph. To resolve this, we propose a framework Minfer, which takes the given RESampled graph and accurately infers the underlying graph's motif statistics. Experiments using large scale datasets show the accuracy and efficiency of our method.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2685584Interactive Data Exploration with Smart Drill-Down
https://www.computer.org/csdl/trans/tk/2019/01/07885129-abs.html
We present <italic>smart drill-down</italic>, an operator for interactively exploring a relational table to discover and summarize “interesting” groups of tuples. Each group of tuples is described by a <italic>rule</italic> . For instance, the rule <inline-formula><tex-math notation="LaTeX">$(a, b, \star, 1000)$</tex-math><alternatives> <inline-graphic xlink:href="joglekar-ieq1-2685998.gif"/></alternatives></inline-formula> tells us that there are 1,000 tuples with value <inline-formula><tex-math notation="LaTeX">$a$</tex-math><alternatives> <inline-graphic xlink:href="joglekar-ieq2-2685998.gif"/></alternatives></inline-formula> in the first column and <inline-formula><tex-math notation="LaTeX">$b$</tex-math><alternatives> <inline-graphic xlink:href="joglekar-ieq3-2685998.gif"/></alternatives></inline-formula> in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are <sc>NP-Hard</sc>, and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2685998<inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives> <inline-graphic xlink:href="kim-ieq1-2698461.gif"/></alternatives></inline-formula>-Injection: Toward Effective Collaborative Filtering Using Uninteresting Items
https://www.computer.org/csdl/trans/tk/2019/01/07913668-abs.html
We develop a novel framework, named as <inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives> <inline-graphic xlink:href="kim-ieq2-2698461.gif"/></alternatives></inline-formula>-injection, to address the sparsity problem of recommender systems. By carefully injecting low values to a selected set of unrated user-item pairs in a user-item matrix, we demonstrate that top-<italic>N</italic> recommendation accuracies of various collaborative filtering (CF) techniques can be significantly and consistently improved. We first adopt the notion of <italic>pre-use preferences</italic> of users toward a vast amount of <italic>unrated</italic> items. Using this notion, we identify <italic>uninteresting</italic> items that have not been rated yet but are likely to receive low ratings from users, and selectively impute them as low values. As our proposed approach is method-agnostic, it can be easily applied to a variety of CF algorithms. Through comprehensive experiments with three real-life datasets (e.g., Movielens, Ciao, and Watcha), we demonstrate that our solution consistently and universally enhances the accuracies of existing CF algorithms (e.g., item-based CF, SVD-based CF, and SVD++) by 2.5 to 5 times on average. Furthermore, our solution improves the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy. The datasets and codes that we used in the experiments are available at: <uri>https://goo.gl/KUrmip</uri>.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2698461Passive and Partially Active Fault Tolerance for Massively Parallel Stream Processing Engines
https://www.computer.org/csdl/trans/tk/2019/01/07959652-abs.html
Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2720602A Hardware-Accelerated Solution for Hierarchical Index-Based Merge-Join
https://www.computer.org/csdl/trans/tk/2019/01/08330031-abs.html
Hardware acceleration through <italic>field programmable gate arrays (FPGAs)</italic> has recently become a technique of growing interest for many data-intensive applications. Join query is one of the most fundamental database query types useful in relational database management systems. However, the available solutions so far have been beset by higher costs in comparison to other query types. In this paper, we develop a novel solution to accelerate the processing of sort-merge join queries with low match rates. Specifically, our solution makes use of hierarchical indexes to identify result-yielding regions in the solution space in order to take advantage of result sparseness. Further, in addition to one-dimensional <italic>equi-join</italic> query processing, our solution supports processing of multidimensional similarity join queries. Experimental results show that our solution is superior to the best existing method in a low match rate setting; the method achieves a speedup factor of 4.8 for join queries with a match rate of 5 percent.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822707Order-Sensitive Imputation for Clustered Missing Values
https://www.computer.org/csdl/trans/tk/2019/01/08330055-abs.html
The issue of missing values (MVs) has appeared widely in real-world datasets and hindered the use of many statistical or machine learning algorithms for data analytics due to their incompetence in handling incomplete datasets. To address this issue, several MV imputation algorithms have been developed. However, these approaches do not perform well when most of the incomplete tuples are clustered with each other, coined here as the <italic> Clustered Missing Values Phenomenon</italic>, which attributes to the lack of sufficient complete tuples near an MV for imputation. In this paper, we propose the <italic>Order-Sensitive Imputation for Clustered Missing values</italic> (OSICM) framework, in which missing values are imputed sequentially such that the values filled earlier in the process are also used for later imputation of other MVs. Obviously, the order of imputations is critical to the effectiveness and efficiency of OSICM framework. We formulate the searching of the optimal imputation order as an optimization problem, and show its NP-hardness. Furthermore, we devise an algorithm to find the exact optimal solution and propose two approximate/heuristic algorithms to trade off effectiveness for efficiency. Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our OSICM framework.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822662Achieving Data Truthfulness and Privacy Preservation in Data Markets
https://www.computer.org/csdl/trans/tk/2019/01/08330057-abs.html
As a significant business paradigm, many online information platforms have emerged to satisfy society's needs for person-specific data, where a service provider collects raw data from data contributors, and then offers value-added data services to data consumers. However, in the data trading layer, the data consumers face a pressing problem, i.e., how to verify whether the service provider has truthfully collected and processed data? Furthermore, the data contributors are usually unwilling to reveal their sensitive personal data and real identities to the data consumers. In this paper, we propose TPDM, which efficiently integrates <underline>T</underline>ruthfulness and <underline>P</underline>rivacy preservation in <underline>D</underline>ata <underline>M</underline>arkets. TPDM is structured internally in an Encrypt-then-Sign fashion, using partially homomorphic encryption and identity-based signature. It simultaneously facilitates batch verification, data processing, and outcome verification, while maintaining identity preservation and data confidentiality. We also instantiate TPDM with a profile matching service and a data distribution service, and extensively evaluate their performances on Yahoo! Music ratings dataset and 2009 RECS dataset, respectively. Our analysis and evaluation results reveal that TPDM achieves several desirable properties, while incurring low computation and communication overheads when supporting large-scale data markets.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822727Top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="semertzidis-ieq1-2823754.gif"/></alternatives></inline-formula> Durable Graph Pattern Queries on Temporal Graphs
https://www.computer.org/csdl/trans/tk/2019/01/08332489-abs.html
Graphs offer a natural model for the relationships and interactions among entities, such as those occurring among users in social and cooperation networks, and proteins in biological networks. Since most such networks are dynamic, to capture their evolution over time, we assume a sequence of graph snapshots where each graph snapshot represents the state of the network at a different time instance. Given this sequence, we seek to find the top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="semertzidis-ieq2-2823754.gif"/> </alternatives></inline-formula> <italic>most durable matches</italic> of an input graph pattern query, that is, the matches that exist for the longest period of time. The straightforward way to address this problem is to apply a state-of-the-art graph pattern matching algorithm at each snapshot and then aggregate the results. However, for large networks and long sequences, this approach is computationally expensive, since all matches have to be generated at each snapshot, including those appearing only once. We propose a new approach that uses a compact representation of the sequence of graph snapshots, appropriate time indexes to prune the search space, and strategies to determine the duration of the seeking matches. Finally, we present experiments with real datasets that illustrate the efficiency and effectiveness of our approach.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2823754Finding Optimal Skyline Product Combinations under Price Promotion
https://www.computer.org/csdl/trans/tk/2019/01/08332494-abs.html
Nowadays, with the development of e-commerce, a growing number of customers choose to go shopping online. To find attractive products from online shopping marketplaces, the skyline query is a useful tool which offers more interesting and preferable choices for customers. The skyline query and its variants have been extensively investigated. However, to the best of our knowledge, they have not taken into account the requirements of customers in certain practical application scenarios. Recently, online shopping marketplaces usually hold some price promotion campaigns to attract customers and increase their purchase intention. Considering the requirements of customers in this practical application scenario, we are concerned about product selection under price promotion. We formulate a constrained optimal product combination (COPC) problem. It aims to find out the skyline product combinations which both meet a customer's willingness to pay and bring the maximum discount rate. The COPC problem is significant to offer powerful decision support for customers under price promotion, which is certified by a customer study. To process the COPC problem effectively, we first propose a two list exact (TLE) algorithm. The COPC problem is proven to be NP-hard, and the TLE algorithm is not scalable because it needs to process an exponential number of product combinations. Additionally, we design a lower bound approximate (LBA) algorithm that has a guarantee about the accuracy of the results and an incremental greedy (IG) algorithm that has good performance. The experiment results demonstrate the efficiency and effectiveness of our proposed algorithms.12/11/2018 4:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2823707An Efficient Method for High Quality and Cohesive Topical Phrase Mining
https://www.computer.org/csdl/trans/tk/2019/01/08332520-abs.html
A phrase is a natural, meaningful, and essential semantic unit. In topic modeling, visualizing phrases for individual topics is an effective way to explore and understand unstructured text corpora. However, from phrase quality and topical cohesion perspectives, the outcomes of existing approaches remain to be improved. Usually, the process of topical phrase mining is twofold: phrase mining and topic modeling. For phrase mining, existing approaches often suffer from order sensitive and inappropriate segmentation problems, which make them often extract inferior quality phrases. For topic modeling, traditional topic models do not fully consider the constraints induced by phrases, which may weaken the cohesion. Moreover, existing approaches often suffer from losing domain terminologies since they neglect the impact of domain-level topical distribution. In this paper, we propose an efficient method for high quality and cohesive topical phrase mining. A high quality phrase should satisfy frequency, phraseness, completeness, and appropriateness criteria. In our framework, we integrate quality guaranteed phrase mining method, a novel topic model incorporating the constraint of phrases, and a novel document clustering method into an iterative framework to improve both phrase quality and topical cohesion. We also describe efficient algorithmic designs to execute these methods efficiently. The empirical verification demonstrates that our method outperforms the state-of-the-art methods from the aspects of both interpretability and efficiency.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2823758HCBC: A Hierarchical Case-Based Classifier Integrated with Conceptual Clustering
https://www.computer.org/csdl/trans/tk/2019/01/08333767-abs.html
The structured case representation improves case-based reasoning (CBR) by exploring structures in the case base and the relevance of case structures. Recent CBR classifiers have mostly been built upon the attribute-value case representation rather than structured case representation, in which the structural relations embodied in their representation structure are accordingly overlooked in improving the similarity measure. This results in retrieval inefficiency and limitations on the performance of CBR classifiers. This paper proposes a hierarchical case-based classifier, HCBC, which introduces a concept lattice to hierarchically organize cases. By exploiting structural case relations in the concept lattice, a novel dynamic weighting model is proposed to enhance the concept similarity measure. Based on this similarity measure, HCBC retrieves the top-K concepts that are most similar to a new case by using a bottom-up pruning-based recursive retrieval (PRR) algorithm. The concepts extracted in this way are applied to suggest a class label for the case by a weighted majority voting. Experimental results show that HCBC outperforms other classifiers in terms of classification performance and robustness on categorical data, and also works confidently well on numeric datasets. In addition, PRR effectively reduces the search space and greatly improves the retrieval efficiency of HCBC.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2824317I/O Efficient Core Graph Decomposition: Application to Degeneracy Ordering
https://www.computer.org/csdl/trans/tk/2019/01/08354806-abs.html
Core decomposition is a fundamental graph problem with a large number of applications. Most existing approaches for core decomposition assume that the graph is kept in memory of a machine. Nevertheless, many real-world graphs are too big to reside in memory. In this paper, we study I/O efficient core decomposition following a semi-external model, which only allows node information to be loaded in memory. We propose a semi-external algorithm and an optimized algorithm for I/O efficient core decomposition. To handle dynamic graph updates, we firstly show that our algorithm can be naturally extended to handle edge deletion. Then, we propose an I/O efficient core maintenance algorithm to handle edge insertion, and an improved algorithm to further reduce I/O and CPU cost. In addition, based on our core decomposition algorithms, we further propose an I/O efficient semi-external algorithm for degeneracy ordering, which is an important graph problem that is highly related to core decomposition. We also consider how to maintain the degeneracy order. We conduct extensive experiments on 12 real large graphs. Our optimal core decomposition algorithm significantly outperforms the existing I/O efficient algorithm in terms of both processing time and memory consumption. They are very scalable to handle web-scale graphs. As an example, we are the first to handle a web graph with 978.5 million nodes and 42.6 billion edges using less than 4.2 GB memory. We also show that our proposed algorithms for degeneracy order computation and maintenance can handle big graphs efficiently with small memory overhead.12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2833070A Note on the Behavior of Majority Voting in Multi-Class Domains with Biased Annotators
https://www.computer.org/csdl/trans/tk/2019/01/08375733-abs.html
Majority voting is a popular and robust strategy to aggregate different opinions in learning from crowds, where each worker labels examples according to their own criteria. Although it has been extensively studied in the binary case, its behavior with multiple classes is not completely clear, specifically when annotations are biased. This paper attempts to fill that gap. The behavior of the majority voting strategy is studied in-depth in multi-class domains, emphasizing the effect of annotation bias. By means of a complete experimental setting, we show the limitations of the standard majority voting strategy. The use of three simple techniques that infer global information from the annotations and annotators allows us to put the performance of the majority voting strategy in context.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.28454002018 Index IEEE Transactions on Knowledge and Data Engineering Vol. 30
https://www.computer.org/csdl/trans/tk/2019/01/08566022-abs.html
Presents the 2018 subject/author index for this publication.12/07/2018 8:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2882359Special Section on the International Conference on Data Engineering 2016
https://www.computer.org/csdl/trans/tk/2019/01/08566026-abs.html
12/07/2018 8:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2876580Dynamic Data Exchange in Distributed RDF Stores
https://www.computer.org/csdl/trans/tk/2018/12/08323222-abs.html
When RDF datasets become too large to be managed by centralised systems, they are often distributed in a cluster of shared-nothing servers, and queries are answered using a distributed join algorithm. Although such solutions have been extensively studied in relational and RDF databases, we argue that existing approaches exhibit two drawbacks. First, they usually decide <italic>statically</italic>(i.e., at query compile time) how to shuffle the data, which can lead to missed opportunities for local computation. Second, they often materialise large intermediate relations whose size is determined by the entire dataset (and not the data stored in each server), so these relations can easily exceed the memory of individual servers. As a possible remedy, we present a novel distributed join algorithm for RDF. Our approach decides when to shuffle data <italic>dynamically</italic>, which ensures that query answers that can be wholly produced within a server involve only local computation. It also uses a novel flow control mechanism to ensure that every query can be answered even if each server has a bounded amount of memory that is much smaller than the intermediate relations. We complement our algorithm with a new query planning approach that balances the cost of communication against the cost of local processing at each server. Moreover, as in several existing approaches, we distribute RDF data using graph partitioning so as to maximise local computation, but we refine the partitioning algorithm to produce more balanced partitions. We show empirically that our techniques can outperform the state of the art by orders of magnitude in terms of query evaluation times, network communication, and memory use. In particular, bounding the memory use in individual servers can mean the difference between success and failure for answering queries with large answer sets.11/11/2018 7:15 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818696Partially Related Multi-Task Clustering
https://www.computer.org/csdl/trans/tk/2018/12/08323233-abs.html
Multi-task clustering improves the clustering performance of each task by transferring knowledge across related tasks. Most existing multi-task clustering methods are based on the ideal assumption that the tasks are completely related. However, in real applications, the tasks are usually partially related. In these cases, brute-force transfer may cause negative effect which degrades the clustering performance. In this paper, we propose two multi-task clustering methods for partially related tasks: the self-adapted multi-task clustering (SAMTC) method and the manifold regularized coding multi-task clustering (MRCMTC) method, which can automatically identify and transfer related instances among the tasks, thus avoiding negative transfer. Both SAMTC and MRCMTC construct the similarity matrix for each target task by exploiting useful information from the source tasks through related instances transfer, and adopt spectral clustering to get the final clustering results. But, they learn the related instances from the source tasks in different ways. Experimental results on real data sets show the superiorities of the proposed algorithms over traditional single-task clustering methods and existing multi-task clustering methods on both completely and partially related tasks.11/11/2018 7:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818705Semi-Supervised Ensemble Clustering Based on Selected Constraint Projection
https://www.computer.org/csdl/trans/tk/2018/12/08323237-abs.html
Traditional cluster ensemble approaches have several limitations. (1) Few make use of prior knowledge provided by experts. (2) It is difficult to achieve good performance in high-dimensional datasets. (3) All of the weight values of the ensemble members are equal, which ignores different contributions from different ensemble members. (4) Not all pairwise constraints contribute to the final result. In the face of this situation, we propose double weighting semi-supervised ensemble clustering based on selected constraint projection(DCECP) which applies constraint weighting and ensemble member weighting to address these limitations. Specifically, DCECP first adopts the random subspace technique in combination with the constraint projection procedure to handle high-dimensional datasets. Second, it treats prior knowledge of experts as pairwise constraints, and assigns different subsets of pairwise constraints to different ensemble members. An adaptive ensemble member weighting process is designed to associate different weight values with different ensemble members. Third, the weighted normalized cut algorithm is adopted to summarize clustering solutions and generate the final result. Finally, nonparametric statistical tests are used to compare multiple algorithms on real-world datasets. Our experiments on 15 high-dimensional datasets show that DCECP performs better than most clustering algorithms.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818729Privacy-Preserving Collaborative Model Learning: The Case of Word Vector Training
https://www.computer.org/csdl/trans/tk/2018/12/08325493-abs.html
Nowadays, machine learning is becoming a new paradigm for mining hidden knowledge in big data. The collection and manipulation of big data not only create considerable values, but also raise serious privacy concerns. To protect the huge amount of potentially sensitive data, a straightforward approach is to encrypt data with specialized cryptographic tools. However, it is challenging to utilize or operate on encrypted data, especially to perform machine learning algorithms. In this paper, we investigate the problem of training high quality word vectors over large-scale encrypted data (from distributed data owners) with the privacy-preserving collaborative neural network learning algorithms. We leverage and also design a suite of arithmetic primitives (e.g., multiplication, fixed-point representation, sigmoid function computation, etc.) on encrypted data, served as components of our construction. We theoretically analyze the security and efficiency of our proposed construction, and conduct extensive experiments on representative real-world datasets to verify its practicality and effectiveness.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2819673Uncertain Graph Sparsification
https://www.computer.org/csdl/trans/tk/2018/12/08325513-abs.html
Uncertain graphs are prevalent in several applications including communications systems, biological databases, and social networks. The ever increasing size of the underlying data renders both graph storage and query processing extremely expensive. Sparsification has often been used to reduce the size of deterministic graphs by maintaining only the important edges. However, adaptation of deterministic sparsification methods fails in the uncertain setting. To overcome this problem, we introduce the first sparsification techniques aimed explicitly at uncertain graphs. The proposed methods reduce the number of edges and redistribute their probabilities in order to decrease the graph size, while preserving its underlying structure. The resulting graph can be used to efficiently and accurately approximate any query and mining tasks on the original graph. An extensive experimental evaluation with real and synthetic datasets illustrates the effectiveness of our techniques on several common graph tasks, including clustering coefficient, page rank, reliability, and shortest path distance.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2819651Inferring Cognitive Wellness from Motor Patterns
https://www.computer.org/csdl/trans/tk/2018/12/08326510-abs.html
Changes in the motor pattern have been shown to be useful advanced indicators of cognitive disorders, such as Parkinson's disease (PD) and cerebral small vessel disease (SVD). It would be highly advantageous to tap into data containing people's motor patterns from motion sensing devices to analyze subtle changes in cognitive abilities, thereby providing personalized interventions before the actual onset of such conditions. However, this goal is very challenging due to two main technical problems: 1) the size of data labeled by doctors is small, and 2) the available data tends to be highly imbalanced (the vast majority tend to be from normal subjects with only a small fraction from subjects with cognitive disorder). In order to effectively deal with these challenges to infer cognitive wellness from motor patterns with high accuracy, we propose the MOtor-Cognitive Analytics (MOCA) framework. The proposed MOCA first uses the random oversampling iterative random forest based feature selection method to reduce the feature space dimensionality and avoid overfitting, and then adds a bias in the optimization problem of weighted extreme learning machine to achieve good generalization ability in handling imbalanced small-sampling dataset. Experimental results on two real-world datasets including SVD and stroke patients show that MOCA can effectively reduce the rate of misdiagnosis and significantly outperform state-of-the-art methods in inferring people's cognitive capabilities. This work opens up opportunities for population-level pre-screening using motion sensing devices and can inform current discussions on reforming the health-care infrastructure.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2820024Attributed Social Network Embedding
https://www.computer.org/csdl/trans/tk/2018/12/08326519-abs.html
Embedding network data into a low-dimensional vector space has shown promising performance for many real-world applications, such as node classification and entity retrieval. However, most existing methods focused only on leveraging network structure. For social networks, besides the network structure, there also exists rich information about social actors, such as user profiles of friendship networks and textual content of citation networks. These rich attribute information of social actors reveal the homophily effect, exerting huge impacts on the formation of social networks. In this paper, we explore the rich evidence source of attributes in social networks to improve network embedding. We propose a generic Attributed Social Network Embedding framework (<italic>ASNE</italic>), which learns representations for social actors (i.e., nodes) by preserving both the <italic>structural proximity</italic> and <italic>attribute proximity</italic>. While the <italic>structural proximity</italic> captures the global network structure, the <italic>attribute proximity</italic> accounts for the homophily effect. To justify our proposal, we conduct extensive experiments on four real-world social networks. Compared to the state-of-the-art network embedding approaches, <italic>ASNE</italic> can learn more informative representations, achieving substantial gains on the tasks of link prediction and node classification. Specifically, <italic>ASNE</italic> significantly outperforms <italic> node2vec</italic> with an 8.2 percent relative improvement on the link prediction task, and a 12.7 percent gain on the node classification task.11/11/2018 7:15 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2819980Topology-Driven Diversity for Targeted Influence Maximization with Application to User Engagement in Social Networks
https://www.computer.org/csdl/trans/tk/2018/12/08326536-abs.html
Research on influence maximization ofter has to cope with marketing needs relating to the propagation of information towards specific users. However, little attention has been paid to the fact that the success of an information diffusion campaign might depend not only on the number of the initial influencers to be detected but also on their <italic>diversity</italic> w.r.t. the target of the campaign. Our main hypothesis is that if we learn seeds that are not only capable of influencing but also are linked to more diverse (groups of) users, then the influence triggers will be diversified as well, and hence the target users will get higher chance of being engaged. Upon this intuition, we define a novel problem, named <italic>Diversity-sensitive Targeted Influence Maximization (DTIM)</italic>, which assumes to model user diversity by exploiting only topological information within a social graph. To the best of our knowledge, we are the first to bring the concept of topology-driven diversity into targeted IM problems, for which we define two alternative definitions. Accordingly, we propose approximate solutions of DTIM, which detect a size- <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="tagarelli-ieq1-2820010.gif"/></alternatives></inline-formula> set of users that maximizes the diversity-sensitive capital objective function, for a given selection of target users. We evaluate our DTIM methods on a special case of user engagement in online social networks, which concerns users who are not actively involved in the community life. Experimental evaluation on real networks has demonstrated the meaningfulness of our approach, also highlighting the opportunity of further development of solutions for DTIM applications.11/11/2018 7:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2820010A Thorough Evaluation of Distance-Based Meta-Features for Automated Text Classification
https://www.computer.org/csdl/trans/tk/2018/12/08326556-abs.html
We address the problem of automatically learning to classify texts by exploiting information derived from meta-features, i.e., features derived from the original bag-of-words representation. Specifically, we provide an in-depth analysis on the recently proposed distance-based meta-features, a <italic>data engineering</italic> technique that relies on the distance between documents to transform the original feature space into a new one, potentially smaller and more informed. Despite its potential, the meta-feature space may be unnecessarily complex and highly dimensional, which increases the tendency of overfitting, limits the application of meta-features in different contexts, and increases computational costs. In this work, we propose the use of multi-objective strategies to reduce the number of meta-features while maximizing the classification effectiveness, when considering the adequacy of the selected meta-features to a particular dataset or classification method. We present effective and efficient proposals for meta-feature selection that can substantially reduce the number of meta-features by up to 89 percent while keeping or improving the classification effectiveness, something not possible with any of the evaluated baselines. We also use our selection strategies as evaluation tools to analyze different combinations of meta-features. We found very compact combinations of meta-features that can achieve high classification effectiveness in most datasets, despite their peculiarities.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2820051Diverse Relevance Feedback for Time Series with Autoencoder Based Summarizations
https://www.computer.org/csdl/trans/tk/2018/12/08327532-abs.html
We present a relevance feedback based browsing methodology using different representations for time series data. The outperforming representation type, e.g., among dual-tree complex wavelet transformation, Fourier, symbolic aggregate approximation (SAX), is learned based on user annotations of the presented query results with representation feedback. We present the use of autoencoder type neural networks to summarize time series or its representations into sparse vectors, which serves as another representation learned from the data. Experiments on 85 real data sets confirm that diversity in the result set increases precision, representation feedback incorporates item diversity and helps to identify the appropriate representation. The results also illustrate that the autoencoders can enhance the base representations, and achieve comparably accurate results with reduced data sizes.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2820119A General Framework for Implicit and Explicit Social Recommendation
https://www.computer.org/csdl/trans/tk/2018/12/08328917-abs.html
Research of social recommendation aims at exploiting social information to improve the quality of a recommender system. It can be further divided into two classes. Explicit social recommendation assumes the existence of not only the users’ ratings on items, but also the explicit social connections between users. Implicit social recommendation assumes the availability of only the ratings but not the social connections between users, and attempts to infer implicit social connections between users with the goal to boost recommendation accuracy. This paper proposes a unified framework that is applicable to both explicit and implicit social recommendation. We propose an optimization framework to learn the degree of social correlation and rating prediction jointly, so these two tasks can mutually boost the performance of each other. Furthermore, a well-known challenge for implicit social recommendation is that it takes quadratic time to learn the strength of pairwise connections. This paper further proposes several practical tricks to reduce the complexity of our model to be linear to the observed ratings. The experiments show that the proposed model, with only two parameters, can significantly outperform the state-of-the-art solutions for both explicit and implicit social recommender systems.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2821174Characterizing and Predicting Early Reviewers for Effective Product Marketing on E-Commerce Websites
https://www.computer.org/csdl/trans/tk/2018/12/08329164-abs.html
Online reviews have become an important source of information for users before making an informed purchase decision. Early reviews of a product tend to have a high impact on the subsequent product sales. In this paper, we take the initiative to study the behavior characteristics of early reviewers through their posted reviews on two real-world large e-commerce platforms, i.e., Amazon and Yelp. In specific, we divide product lifetime into three consecutive stages, namely <italic>early</italic>, <italic>majority</italic>, and <italic>laggards</italic>. A user who has posted a review in the early stage is considered as an early reviewer. We quantitatively characterize early reviewers based on their rating behaviors, the helpfulness scores received from others and the correlation of their reviews with product popularity. We have found that (1) an early reviewer tends to assign a higher average rating score; and (2) an early reviewer tends to post more helpful reviews. Our analysis of product reviews also indicates that early reviewers’ ratings and their received helpfulness scores are likely to influence product popularity. By viewing the review posting process as a multiplayer competition game, we propose a novel margin-based embedding model for early reviewer prediction. Extensive experiments on two different e-commerce datasets have shown that our proposed approach outperforms a number of competitive baselines.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2821671Ensemble Learning for Multi-Type Classification in Heterogeneous Networks
https://www.computer.org/csdl/trans/tk/2018/12/08329517-abs.html
Heterogeneous networks are networks consisting of different types of objects and links. They can be found in several fields, ranging from the Internet to social sciences, biology, epidemiology, geography, finance, and many others. In the literature, several methods have been proposed for the analysis of network data, but they usually focus on homogeneous networks, where all the objects are of the same type, and links among them describe a single type of relationship. More recently, the complexity of real scenarios has impelled researchers to design methods for the analysis of heterogeneous networks, especially focused on classification and clustering tasks. However, they often make assumptions on the structure of the network that are too restrictive or do not fully exploit different forms of network correlation and autocorrelation. Moreover, when nodes which are the main subject of the classification task are linked to several nodes of the network having missing values, standard methods can lead to either building incomplete classification models or to discarding possibly relevant dependencies (correlation or autocorrelation). In this paper, we propose an ensemble learning approach for multi-type classification. We adopt the system Mr-SBC, which is originally able to analyze heterogeneous networks of arbitrary structure, within an ensemble learning approach. The ensemble allows us to improve the classification accuracy of Mr-SBC by exploiting i) the possible presence of correlation and autocorrelation phenomena, and ii) the classification of instances (which contain missing values) of other node types in the network. As a beneficial side effect, we have also that the models are more stable in terms of standard deviation of the accuracy, over different samples used for training. Experiments performed on real-world datasets show that the proposed method is able to significantly outperform the standard implementation of Mr-SBC. Moreover, it gives Mr-SBC the advantage of outperforming four other well-known algorithms for the classification of data organized in a network.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822307Deep Air Learning: Interpolation, Prediction, and Feature Analysis of Fine-Grained Air Quality
https://www.computer.org/csdl/trans/tk/2018/12/08333777-abs.html
The interpolation, prediction, and feature analysis of fine-gained air quality are three important topics in the area of urban air computing. The solutions to these topics can provide extremely useful information to support air pollution control, and consequently generate great societal and technical impacts. Most of the existing work solves the three problems separately by different models. In this paper, we propose a general and effective approach to solve the three problems in one model called the Deep Air Learning (DAL). The main idea of DAL lies in embedding feature selection and semi-supervised learning in different layers of the deep learning network. The proposed approach utilizes the information pertaining to the unlabeled spatio-temporal data to improve the performance of the interpolation and the prediction, and performs feature selection and association analysis to reveal the main relevant features to the variation of the air quality. We evaluate our approach with extensive experiments based on real data sources obtained in Beijing, China. Experiments show that DAL is superior to the peer models from the recent literature when solving the topics of interpolation, prediction, and feature analysis of fine-gained air quality.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2823740Similarity Metrics for SQL Query Clustering
https://www.computer.org/csdl/trans/tk/2018/12/08352666-abs.html
Database access logs are the starting point for many forms of database administration, from database performance tuning, to security auditing, to benchmark design, and many more. Unfortunately, query logs are also large and unwieldy, and it can be difficult for an analyst to extract broad patterns from the set of queries found therein. Clustering is a natural first step towards understanding the massive query logs. However, many clustering methods rely on the notion of pairwise similarity, which is challenging to compute for SQL queries, especially when the underlying data and database schema is unavailable. We investigate the problem of computing similarity between queries, relying only on the query structure. We conduct a rigorous evaluation of three query similarity heuristics proposed in the literature applied to query clustering on multiple query log datasets, representing different types of query workloads. To improve the accuracy of the three heuristics, we propose a generic feature engineering strategy, using classical query rewrites to standardize query structure. The proposed strategy results in a significant improvement in the performance of all three similarity heuristics.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2831214NAIS: Neural Attentive Item Similarity Model for Recommendation
https://www.computer.org/csdl/trans/tk/2018/12/08352808-abs.html
Item-to-item collaborative filtering (<italic>aka.</italic>item-based CF) has been long used for building recommender systems in industrial settings, owing to its interpretability and efficiency in real-time personalization. It builds a user's profile as her historically interacted items, recommending new items that are similar to the user's profile. As such, the key to an item-based CF method is in the estimation of item similarities. Early approaches use statistical measures such as cosine similarity and Pearson coefficient to estimate item similarities, which are less accurate since they lack tailored optimization for the recommendation task. In recent years, several works attempt to learn item similarities from data, by expressing the similarity as an underlying model and estimating model parameters by optimizing a recommendation-aware objective function. While extensive efforts have been made to use shallow linear models for learning item similarities, there has been relatively less work exploring nonlinear neural network models for item-based CF. In this work, we propose a neural network model named <italic>Neural Attentive Item Similarity model</italic> (NAIS) for item-based CF. The key to our design of NAIS is an attention network, which is capable of distinguishing which historical items in a user profile are more important for a prediction. Compared to the state-of-the-art item-based CF method <italic>Factored Item Similarity Model</italic> (FISM) <xref ref-type="bibr" rid="ref1">[1]</xref> , our NAIS has stronger representation power with only a few additional parameters brought by the attention network. Extensive experiments on two public benchmarks demonstrate the effectiveness of NAIS. This work is the first attempt that designs neural network models for item-based CF, opening up new research possibilities for future developments of neural recommender systems.11/11/2018 7:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2831682Approximate Order-Sensitive <italic>k</italic>-NN Queries over Correlated High-Dimensional Data
https://www.computer.org/csdl/trans/tk/2018/11/08307089-abs.html
The <italic>k</italic> Nearest Neighbor (<italic>k</italic>-NN) query has been gaining more importance in extensive applications involving information retrieval, data mining, and databases. Specifically, in order to trade off accuracy for efficiency, approximate solutions for the <italic>k</italic>-NN query are extensively explored. However, the precision is usually order-insensitive, which is defined on the result set instead of the result sequence. In many situations, it cannot reasonably reflect the query result quality. In this paper, we focus on the approximate <italic> k</italic>-NN query problem with the order-sensitive precision requirement and propose a novel scheme based on the projection-filter-refinement framework. Basically, we adopt PCA to project the high-dimensional data objects into the low-dimensional space. Then, a filter condition is inferred to execute efficient pruning over the projected data. In addition, an index strategy named OR-tree is proposed to reduce the I/O cost. The extensive experiments based on several real-world data sets and a synthetic data set are conducted to verify the effectiveness and efficiency of the proposed solution. Compared to the state-of-the-art methods, our method can support order-sensitive <italic>k</italic> -NN queries with higher result precision while retaining satisfactory CPU and I/O efficiency.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2812153Efficient Detection of Soft Concatenation Mapping
https://www.computer.org/csdl/trans/tk/2018/11/08307189-abs.html
In modern big data warehouse systems, we observe a common phenomenon that a column of data values can be derived from one or several other columns by transforming and concatenating these columns. We call this relationship between columns a Soft Concatenation Mapping (SCM). SCMs imply significant redundancy in the schema or data, and therefore can be exploited for data integration or data compression. In this paper, we formalize the problem of SCM detection and prove it is NP-hard. We then propose efficient approximate algorithms to detect all SCMs or an optimal set of SCMs in a table. Our experiments on both real-world and synthetic datasets show promising results.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2812822A New Query Recommendation Method Supporting Exploratory Search Based on Search Goal Shift Graphs
https://www.computer.org/csdl/trans/tk/2018/11/08315046-abs.html
Exploratory search is an increasingly important activity for Web searchers. However, the current search system can not provide sufficient support for exploratory search. Therefore, we made in-depth analysis for exploratory search processes, and found that there are a lot of search goal shift phenomena in exploratory search. Based on this fact, we have designed a new query recommendation method to support exploratory search. Firstly, according to the behavioral characteristics of searchers in the search goal shift processes, all the queries submitted in the search goal shift processes are extracted from search engine logs using machine learning. And then, we have used the queries to build a search goal shift graph; finally, the random walk algorithm is used to obtain the query recommendations in the search goal shift graph. In addition, we demonstrated the effectiveness of the method for exploratory search by comparing experiments with the other methods.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2815544Efficient Detection of Overlapping Communities Using Asymmetric Triangle Cuts
https://www.computer.org/csdl/trans/tk/2018/11/08315057-abs.html
Real social networks contain many communities, where members within each community are densely connected with each other, while they are sparsely connected with the members outside of the community. Since each member can join multiple communities simultaneously, communities in social networks are usually overlapping with each other. How to efficiently and effectively identify overlapping communities in a large social network becomes a fundamental problem in the big data era. Most existing studies on community finding focused on non-overlapping communities based on several well-known community fitness metrics. However, recent investigations have shown that these fitness metrics may suffer free rider and separation effects where the overlapping region of two communities always belongs to the denser one, rather to both of them. In this paper, we study the overlapping community detection problem in social networks that not only takes the quality of the found overlapping communities but also incorporate both free rider and separation effects on the found communities into consideration. Specifically, in this paper, we first propose a novel community fitness metric - triangle based fitness metric, for overlapping community detection that can minimize the free rider and separation effects on found overlapping communities, and show that the problem is NP-hard. We then propose an efficient yet scalable algorithm for the problem that can deliver a feasible solution. We finally validate the effectiveness of the proposed fitness metric and evaluate the performance of the proposed algorithm, through conducting extensive experiments on real-world datasets with over 100 million vertices and edges. Experimental results demonstrate that the proposed algorithm is very promising.10/05/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2815554Matching Heterogeneous Event Data
https://www.computer.org/csdl/trans/tk/2018/11/08315460-abs.html
Identifying events from different sources is essential to various business process applications such as provenance querying or process mining. Distinct features of heterogeneous events, including opaque names and dislocated traces, prevent existing data integration techniques from performing well. To address these issues, in this paper, (1) we propose an event similarity function by iteratively evaluating similar neighbors. (2) In addition to event nodes, we further employ the similarity of edges (indicating relationships among events) in event matching. We prove NP-hardness of finding the optimal event matching w.r.t. node and edge similarities, and propose an efficient heuristic for event matching. Experiments demonstrate that the proposed event matching approach can achieve significantly higher accuracy than state-of-the-art matching methods. In particular, by considering the event edge similarity, our heuristic matching algorithm further improves the matching accuracy without introducing much overhead.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2815695Rule-Based Entity Resolution on Database with Hidden Temporal Information
https://www.computer.org/csdl/trans/tk/2018/11/08316959-abs.html
In this paper, we deal with the problem of rule-based entity resolution on imprecise temporal data. Entity resolution (ER) is widely explored in research community, but the problem on temporal data, especially without available timestamps, has not been studied well yet. Because of the elapsing of time, records referring to the same entity observed in different time periods may be different. Besides traditional similarity-based ER approaches, by carefully exploring several data quality rules, e.g., matching dependency and data currency, much information can be obtained to facilitate to cope with this problem. In this paper, we use such rules to derive temporal records’ information of time order and trend of their attributes’ evolvement with elapsing of time. Specifically, we first block records into smaller blocks, and then by exploring data currency constraints, we propose a temporal clustering approach with two steps, i.e., the skeleton clustering and the banding clustering. Experimental results on both real and synthetic data show that our entity resolution method can achieve both high accuracy and efficiency on datasets with hidden temporal information.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2816018Identifying Genetic Risk Factors for Alzheimer's Disease via Shared Tree-Guided Feature Learning Across Multiple Tasks
https://www.computer.org/csdl/trans/tk/2018/11/08317000-abs.html
The genome-wide association study (GWAS) is a popular approach to identify disease-associated genetic factors for Alzhemer's Disease (AD). However, it remains challenging because of the small number of samples, very high feature dimensionality and complex structures. To accurately identify genetic risk factors for AD, we propose a novel method based on an in-depth exploration of the hierarchical structure among the features and the commonality across related tasks. Specifically, we first extract and encode the tree hierarchy among features; then, we integrate the tree structures with multi-task feature learning (MTFL) to learn the shared features—that are predictive of AD—among related tasks simultaneously. Thus, we can unify the strength of both the prior structure information and MTFL to boost the prediction performance. However, due to the highly complex regularizer that encodes the tree structure and the extremely high feature dimensionality, the learning process can be computationally prohibitive. To address this, we further develop a novel safe screening rule to quickly identify and remove the irrelevant features before training. Experiment results demonstrate that the proposed approach significantly outperforms the state-of-the-art in detecting genetic risk factors of AD and the speedup gained by the proposed screening can be several orders of magnitude.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2816029Conditional Reliability in Uncertain Graphs
https://www.computer.org/csdl/trans/tk/2018/11/08318619-abs.html
Network reliability is a well-studied problem that requires to measure the probability that a target node is reachable from a source node in a probabilistic (or uncertain) graph, i.e., a graph where every edge is assigned a probability of existence. Many approaches and problem variants have been considered in the literature, with the majority of them assuming that edge-existence probabilities are fixed. Nevertheless, in real-world graphs, edge probabilities typically depend on external conditions. In metabolic networks, a protein can be converted into another protein with some probability depending on the presence of certain enzymes. In social influence networks, the probability that a tweet of some user will be re-tweeted by her followers depends on whether the tweet contains specific hashtags. In transportation networks, the probability that a network segment will work properly or not, might depend on external conditions such as weather or time of the day. In this paper, we overcome this limitation and focus on <italic>conditional reliability</italic>, that is, assessing reliability when edge-existence probabilities depend on a set of conditions. In particular, we study the problem of determining the top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="khan-ieq1-2816653.gif"/> </alternatives></inline-formula> conditions that maximize the reliability between two nodes. We deeply characterize our problem and show that, even employing polynomial-time reliability-estimation methods, it is <inline-formula> <tex-math notation="LaTeX">$\mathbf {NP}$</tex-math><alternatives><inline-graphic xlink:href="khan-ieq2-2816653.gif"/> </alternatives></inline-formula>-hard, does not admit any <inline-formula><tex-math notation="LaTeX">$\mathbf {PTAS}$ </tex-math><alternatives><inline-graphic xlink:href="khan-ieq3-2816653.gif"/></alternatives></inline-formula>, and the underlying objective function is non-submodular. We then devise a practical method that targets both accuracy and efficiency. We also study natural generalizations of the problem with multiple source and target nodes. An extensive empirical evaluation on several large, real-life graphs demonstrates effectiveness and scalability of our methods.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2816653SLA Definition for Multi-Tenant DBMS and its Impact on Query Optimization
https://www.computer.org/csdl/trans/tk/2018/11/08319945-abs.html
In the cloud context, users are often called tenants. A cloud DBMS shared by many tenants is called a multi-tenant DBMS. The resource consolidation in such a DBMS allows the tenants to only pay for the resources that they consume, while providing the opportunity for the provider to increase its economic gain. For this, a Service Level Agreement (SLA) is usually established between the provider and a tenant. However, in the current systems, the SLA is often defined by the provider, while the tenant should agree with it before using the service. In addition, only the availability objective is described in the SLA, but not the performance objective. In this paper, an SLA negotiation framework is proposed, in which the provider and the tenant define the performance objective together in a fair way. To demonstrate the feasibility and the advantage of this framework, we evaluate its impact on query optimization. We formally define the problem by including the cost-efficiency aspect, we design a cost model and study the plan search space for this problem, we revise two search methods to adapt to the new context, and we propose a heuristic to solve the resource contention problem caused by concurrent queries of multiple tenants. We also conduct a performance evaluation to show that, our optimization approach (i.e., driven by the SLA) can be much more cost-effective than the traditional approach which always minimizes the query completion time.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2817235BEATS: Blocks of Eigenvalues Algorithm for Time Series Segmentation
https://www.computer.org/csdl/trans/tk/2018/11/08319952-abs.html
The massive collection of data via emerging technologies like the Internet of Things (IoT) requires finding optimal ways to reduce the observations in the time series analysis domain. The IoT time series require aggregation methods that can preserve and represent the key characteristics of the data. In this paper, we propose a segmentation algorithm that adapts to unannounced mutations of the data (i.e., data drifts). The algorithm splits the data streams into blocks and groups them in square matrices, computes the Discrete Cosine Transform (DCT), and quantizes them. The key information is contained in the upper-left part of the resulting matrix. We extract this sub-matrix, compute the modulus of its eigenvalues, and remove duplicates. The algorithm, called BEATS, is designed to tackle dynamic IoT streams, whose distribution changes over time. We implement experiments with six datasets combining real, synthetic, real-world data, and data with drifts. Compared to other segmentation methods like Symbolic Aggregate approXimation (SAX), BEATS shows significant improvements. Trying it with classification and clustering algorithms it provides efficient results. BEATS is an effective mechanism to work with dynamic and multi-variate data, making it suitable for IoT data sources. The datasets, code of the algorithm and the analysis results can be accessed publicly at: <uri> https://github.com/auroragonzalez/BEATS</uri>.10/05/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2817229Online Product Quantization
https://www.computer.org/csdl/trans/tk/2018/11/08320306-abs.html
Approximate nearest neighbor (ANN) search has achieved great success in many tasks. However, existing popular methods for ANN search, such as hashing and quantization methods, are designed for static databases only. They cannot handle well the database with data distribution evolving dynamically, due to the high computational effort for retraining the model based on the new database. In this paper, we address the problem by developing an online product quantization (online PQ) model and incrementally updating the quantization codebook that accommodates to the incoming streaming data. Moreover, to further alleviate the issue of large scale computation for the online PQ update, we design two budget constraints for the model to update partial PQ codebook instead of all. We derive a loss bound which guarantees the performance of our online PQ model. Furthermore, we develop an online PQ model over a sliding window with both data insertion and deletion supported, to reflect the real-time behavior of the data. The experiments demonstrate that our online PQ model is both time-efficient and effective for ANN search in dynamic large scale databases compared with baseline methods and the idea of partial PQ codebook update further reduces the update cost.10/05/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2817526Classifier Ensemble by Exploring Supplementary Ordering Information
https://www.computer.org/csdl/trans/tk/2018/11/08322055-abs.html
Supplementary information has been proven to be particularly useful in many machine learning tasks. In ensemble learning for a set of trained base classifiers, there also exists abundant implicit supplementary information about the performance orderings for the trained base classifiers in previous literature. However, few classifier ensemble studies consider exploring and utilizing supplementary information. The current study proposes a new learning method for stack classifier ensembles by considering the implicit supplementary ordering information regarding a set of trained classifiers. First, a new metric learning algorithm for measuring the similarities between two arbitrary learning tasks is introduced. Second, supplementary ordering information for the trained classifiers of a given learning task is inferred based on the learned similarities and related performance results reported in the previous literature. Third, a set of ordered soft constraints is generated based on the supplementary ordering information, and achieving the optimal combination weights of the trained classifiers is formalized into a goal programming problem. The optimal combination weights are then obtained. Finally, the experimental results verify the effectiveness of the proposed new classifier ensemble method.10/05/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818138NHAD: Neuro-Fuzzy Based Horizontal Anomaly Detection in Online Social Networks
https://www.computer.org/csdl/trans/tk/2018/11/08322278-abs.html
Use of social network is the basic functionality of today's life. With the advent of more and more online social media, the information available and its utilization have come under the threat of several anomalies. Anomalies are the major cause of online frauds which allow information access by unauthorized users as well as information forging. One of the anomalies that act as a silent attacker is the horizontal anomaly. These are the anomalies caused by a user because of his/her variable behavior towards different sources. Horizontal anomalies are difficult to detect and hazardous for any network. In this paper, a self-healing neuro-fuzzy approach (NHAD) is used for the detection, recovery, and removal of horizontal anomalies efficiently and accurately. The proposed approach operates over the five paradigms, namely, missing links, reputation gain, significant difference, trust properties, and trust score. The proposed approach is evaluated with three datasets: DARPA'98 benchmark dataset, synthetic dataset, and real-time traffic. Results show that the accuracy of the proposed NHAD model for 10 to 30 percent anomalies in synthetic dataset ranges between 98.08 and 99.88 percent. The evaluation over DARPA'98 dataset demonstrates that the proposed approach is better than the existing solutions as it provides 99.97 percent detection rate for anomalous class. For real-time traffic, the proposed NHAD model operates with an average accuracy of 99.42 at 99.90 percent detection rate.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818163EMOMA: Exact Match in One Memory Access
https://www.computer.org/csdl/trans/tk/2018/11/08323198-abs.html
An important function in modern routers and switches is to perform a lookup for a key. Hash-based methods, and in particular cuckoo hash tables, are popular for such lookup operations, but for large structures stored in off-chip memory, such methods have the downside that they may require more than one off-chip memory access to perform the key lookup. Although the number of off-chip memory accesses can be reduced using on-chip approximate membership structures such as Bloom filters, some lookups may still require more than one off-chip memory access. This can be problematic for some hardware implementations, as having only a single off-chip memory access enables a predictable processing of lookups and avoids the need to queue pending requests. We provide a data structure for hash-based lookups based on cuckoo hashing that uses only one off-chip memory access per lookup, by utilizing an on-chip pre-filter to determine which of multiple locations holds a key. We make particular use of the flexibility to move elements within a cuckoo hash table to ensure the pre-filter always gives the correct response. While this requires a slightly more complex insertion procedure and some additional memory accesses during insertions, it is suitable for most packet processing applications where key lookups are much more frequent than insertions. An important feature of our approach is its simplicity. Our approach is based on simple logic that can be easily implemented in hardware, and hardware implementations would benefit most from the single off-chip memory access per lookup.10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2818716High-Order Proximity Preserved Embedding for Dynamic Networks
https://www.computer.org/csdl/trans/tk/2018/11/08329541-abs.html
Network embedding, aiming to embed a network into a low dimensional vector space while preserving the inherent structural properties of the network, has attracted considerable attention. However, most existing embedding methods focus on the static network while neglecting the evolving characteristic of real-world networks. Meanwhile, most of previous methods cannot well preserve the high-order proximity, which is a critical structural property of networks. These problems motivate us to seek an effective and efficient way to preserve the high-order proximity in embedding vectors when the networks evolve over time. In this paper, we propose a novel method of Dynamic High-order Proximity preserved Embedding (DHPE). Specifically, we adopt the generalized SVD (GSVD) to preserve the high-order proximity. Then, by transforming the GSVD problem to a generalized eigenvalue problem, we propose a generalized eigen perturbation to incrementally update the results of GSVD to incorporate the changes of dynamic networks. Further, we propose an accelerated solution to the DHPE model so that it achieves a linear time complexity with respect to the number of nodes and number of changed edges in the network. Our empirical experiments on one synthetic network and several real-world networks demonstrate the effectiveness and efficiency of the proposed method.10/05/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2822283Correction to A Survey of Location Prediction on Twitter
https://www.computer.org/csdl/trans/tk/2018/11/08482519-abs.html
10/05/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2867987Influence Maximization on Social Graphs: A Survey
https://www.computer.org/csdl/trans/tk/2018/10/08295265-abs.html
Influence Maximization (IM), which selects a set of <inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="li-ieq1-2807843.gif"/></alternatives></inline-formula> users (called seed set) from a social network to maximize the expected number of influenced users (called influence spread), is a key algorithmic problem in social influence analysis. Due to its immense application potential and enormous technical challenges, IM has been extensively studied in the past decade. In this paper, we survey and synthesize a wide spectrum of existing studies on IM from an <italic>algorithmic perspective</italic>, with a special focus on the following key aspects: (1) a review of well-accepted diffusion models that capture the information diffusion process and build the foundation of the IM problem, (2) a fine-grained taxonomy to classify existing IM algorithms based on their design objectives, (3) a rigorous theoretical comparison of existing IM algorithms, and (4) a comprehensive study on the applications of IM techniques in combining with novel context features of social networks such as topic, location, and time. Based on this analysis, we then outline the key challenges and research directions to expand the boundary of IM research.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807843Mining Summaries for Knowledge Graph Search
https://www.computer.org/csdl/trans/tk/2018/10/08300649-abs.html
Querying heterogeneous and large-scale knowledge graphs is expensive. This paper studies a graph summarization framework to facilitate knowledge graph search. (1) We introduce a class of <italic>reduced summaries</italic>. Characterized by approximate graph pattern matching, these summaries are capable of summarizing entities in terms of their neighborhood similarity up to a certain hop, using small and informative graph patterns. (2) We study a <italic> diversified graph summarization</italic> problem. Given a knowledge graph, it is to discover top-<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="song-ieq1-2807442.gif"/> </alternatives></inline-formula> summaries that maximize a bi-criteria function, characterized by both informativeness and diversity. We show that diversified summarization is feasible for large graphs, by developing both sequential and parallel summarization algorithms. (a) We show that there exists a 2-approximation algorithm to discover diversified summaries. We further develop an anytime sequential algorithm which discovers summaries under resource constraints. (b) We present a new parallel algorithm with quality guarantees. The algorithm is parallel scalable, which ensures its feasibility in distributed graphs. (3) We also develop a summary-based query evaluation scheme, which only refers to a small number of summaries. Using real-world knowledge graphs, we experimentally verify the effectiveness and efficiency of our summarization algorithms, and query processing using summaries.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2807442Top-<italic>k</italic> Critical Vertices Query on Shortest Path
https://www.computer.org/csdl/trans/tk/2018/10/08300661-abs.html
Shortest path query is one of the most fundamental and classic problems in graph analytics, which returns the complete shortest path between any two vertices. However, in many real-life scenarios, only critical vertices on the shortest path are desirable and it is unnecessary to search for the complete path. This paper investigates the shortest path sketch by defining a top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq1-2808495.gif"/></alternatives></inline-formula> critical vertices (<inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="ma-ieq2-2808495.gif"/></alternatives> </inline-formula>CV) query on the shortest path. Given a source vertex <inline-formula><tex-math notation="LaTeX">$s$ </tex-math><alternatives><inline-graphic xlink:href="ma-ieq3-2808495.gif"/></alternatives></inline-formula> and target vertex <inline-formula><tex-math notation="LaTeX">$t$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq4-2808495.gif"/></alternatives></inline-formula> in a graph, <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="ma-ieq5-2808495.gif"/></alternatives> </inline-formula>CV query can return the top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq6-2808495.gif"/></alternatives></inline-formula> significant vertices on the shortest path <inline-formula><tex-math notation="LaTeX">$SP(s,t)$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq7-2808495.gif"/></alternatives></inline-formula>. The significance of the vertices can be predefined. The key strategy for seeking the sketch is to apply off-line preprocessed distance oracle to accelerate on-line real-time queries. This allows us to omit unnecessary vertices and obtain the most representative sketch of the shortest path directly. We further explore a series of methods and optimizations to answer <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="ma-ieq8-2808495.gif"/></alternatives></inline-formula>CV query on both centralized and distributed platforms, using exact and approximate approaches, respectively. We evaluate our methods in terms of time, space complexity and approximation quality. Experiments on large-scale real-world networks validate that our algorithms are of high efficiency and accuracy.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808495Self-Tuned Descriptive Document Clustering Using a Predictive Network
https://www.computer.org/csdl/trans/tk/2018/10/08301532-abs.html
Descriptive clustering consists of automatically organizing data instances into clusters and generating a descriptive summary for each cluster. The description should inform a user about the contents of each cluster without further examination of the specific instances, enabling a user to rapidly scan for relevant clusters. Selection of descriptions often relies on heuristic criteria. We model descriptive clustering as an auto-encoder network that predicts features from cluster assignments and predicts cluster assignments from a subset of features. The subset of features used for predicting a cluster serves as its description. For text documents, the occurrence or count of words, phrases, or other attributes provides a sparse feature representation with interpretable feature labels. In the proposed network, cluster predictions are made using logistic regression models, and feature predictions rely on logistic or multinomial regression models. Optimizing these models leads to a completely self-tuned descriptive clustering approach that automatically selects the number of clusters and the number of features for each cluster. We applied the methodology to a variety of short text documents and showed that the selected clustering, as evidenced by the selected feature subsets, are associated with a meaningful topical organization.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2781721Locality Reconstruction Models for Book Representation
https://www.computer.org/csdl/trans/tk/2018/10/08301545-abs.html
Books, as a representative of lengthy documents, convey rich semantics. Traditional document modeling methods, such as bag-of-words models, have difficulty capturing such rich semantics when only considering term-frequency features. In order to explore term spatial distributions over a book, a tree-structured book representation is investigated in this paper. Moreover, an efficient learning framework, Tree2Vector, is introduced for mapping tree-structured book data into vectorial space. In particular, we present two types of locality reconstruction (LR) models: Euclidean-type and cosine-type, during the transformation process of tree structures into vectorial representations. The LR is used for modeling the reconstruction process, in which each parent node in a tree is supposed to be reconstructed by its child nodes. The prominent advantage of this Tree2Vector framework is that it solely utilizes the local information within a single book tree. In addition, extensive experimental results demonstrate that Tree2Vector is able to deliver comparable or better performance in comparison to methods that consider the information of all trees in a database globally. Experimental results also suggest that cosine-type LR consistently performs better than Euclidean-type LR in applications of book and author recommendations.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808953<inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq1-2808971.gif"/></alternatives></inline-formula>: A Scalable Method for in-Memory <italic>k</italic>NN Search over Moving Objects in Road Networks
https://www.computer.org/csdl/trans/tk/2018/10/08301596-abs.html
Nowadays, many location-based applications require the ability of querying <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="cao-ieq2-2808971.gif"/> </alternatives></inline-formula>-nearest neighbors over a very large scale of moving objects in road networks, e.g., taxi-calling and ride-sharing services. Traditional grid index with equal-sized cells can not adapt to the skewed distribution of moving objects in real scenarios. Thus, to obtain the fast querying response time, the grid needs to be split into more smaller cells which introduces the side-effect of higher memory cost, i.e., maintaining such a large volume of cells requires a much larger memory space at the server side. In this paper, we present <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq3-2808971.gif"/></alternatives></inline-formula>, a scalable and in-memory <italic>k </italic>NN query processing technique. <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math> <alternatives><inline-graphic xlink:href="cao-ieq4-2808971.gif"/></alternatives></inline-formula> is dual-index driven, where we adopt a R-tree to store the topology of the road network and a <italic>hierarchical grid model</italic> to manage the moving objects in non-uniform distribution. To answer a <italic>k</italic>NN query in real time, <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives> <inline-graphic xlink:href="cao-ieq5-2808971.gif"/></alternatives></inline-formula> adopts the strategy that incrementally enlarges the search area for network distance based nearest neighbor evaluation. It is far from trivial to perform the space expansion within the hierarchical grid index. For a given cell, we first define its neighbors in different directions, then propose a cell communication technique which allows each cell in the hierarchical grid index to be aware of its neighbors at any time. Accordingly, an efficient space expansion algorithm to generate the estimation area is proposed. The experimental evaluation shows that <inline-formula><tex-math notation="LaTeX">$\sf {SIMkNN}$</tex-math><alternatives><inline-graphic xlink:href="cao-ieq6-2808971.gif"/></alternatives></inline-formula> outperforms the baseline algorithm in terms of time and memory efficiency.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2808971Efficient Parallel Skyline Query Processing for High-Dimensional Data
https://www.computer.org/csdl/trans/tk/2018/10/08302507-abs.html
Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous se of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing, as well as the ensuing high computation cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each compute node partitions the input data points into disjoint subsets, and then performs the skyline computation on each subset to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2809598Tensor-Based Big Data Management Scheme for Dimensionality Reduction Problem in Smart Grid Systems: SDN Perspective
https://www.computer.org/csdl/trans/tk/2018/10/08302840-abs.html
Smart grid (SG) is an integration of traditional power grid with advanced information and communication infrastructure for bidirectional energy flow between grid and end users. A huge amount of data is being generated by various smart devices deployed in SG systems. Such a massive data generation from various smart devices in SG systems may lead to various challenges for the networking infrastructure deployed between users and the grid. Hence, an efficient data transmission technique is required for providing desired QoS to the end users in this environment. Generally, the data generated by smart devices in SG has high dimensions in the form of multiple heterogeneous attributes, values of which are changed with time. The high dimensions of data may affect the performance of most of the designed solutions in this environment. Most of the existing schemes reported in the literature have complex operations for the data dimensionality reduction problem which may deteriorate the performance of any implemented solution for this problem. To address these challenges, in this paper, a tensor-based big data management scheme is proposed for dimensionality reduction problem of big data generated from various smart devices. In the proposed scheme, first the Frobenius norm is applied on high-order tensors (used for data representation) to minimize the reconstruction error of the reduced tensors. Then, an empirical probability-based control algorithm is designed to estimate an optimal path to forward the reduced data using software-defined networks for minimization of the network load and effective bandwidth utilization. The proposed scheme minimizes the transmission delay incurred during the movement of the dimensionally reduced data between different nodes. The efficacy of the proposed scheme has been evaluated using extensive simulations carried out on the data traces using ‘R’ programming and Matlab. The big data traces considered for evaluation consist of more than two million entries (2,075,259) collected at one minute sampling rate having hetrogenous features such as–voltage, energy, frequency, electric signals, etc. Moreover, a comparative study for different data traces and a real SG testbed is also presented to prove the efficacy of the proposed scheme. The results obtained depict the effectiveness of the proposed scheme with respect to the parameters such as- network delay, accuracy, and throughput.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2809747Planning with Spatio-Temporal Search Control Knowledge
https://www.computer.org/csdl/trans/tk/2018/10/08303742-abs.html
Knowledge based approaches developed for AI planning can convert an intractable planning problem to a tractable one. Current techniques often use temporal logics to express Search Control Knowledge (SCK) in logic based planning. However, traditional temporal logics are limited in expressiveness since they are unable to express spatial constraints which are as important as temporal ones in many planning domains. To this end, we propose a two-dimensional (spatial and temporal) logic namely PPTL<inline-formula><tex-math notation="LaTeX">$^{\mathrm{SL}}$ </tex-math><alternatives><inline-graphic xlink:href="lu-ieq1-2810144.gif"/></alternatives></inline-formula> by temporalizing separation logic with PPTL (Propositional Projection Temporal Logic) which is well-suited to specify SCK involving both spatial and temporal constraints in planning. We prove that PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq2-2810144.gif"/> </alternatives></inline-formula> is decidable essentially via an equisatisfiable translation from PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq3-2810144.gif"/> </alternatives></inline-formula> to its restricted form. Moreover, we implement a tool, <italic>S-TSolver</italic>, which effectively computes plans under the guidance of the spatio-temporal SCK expressed by PPTL<inline-formula> <tex-math notation="LaTeX">$^{\mathrm{SL}}$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq4-2810144.gif"/> </alternatives></inline-formula> formulas. The effectiveness of the tool is evaluated on selected benchmark domains from the International Planning Competition.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810144Semi-Supervised Feature Selection via Insensitive Sparse Regression with Application to Video Semantic Recognition
https://www.computer.org/csdl/trans/tk/2018/10/08304684-abs.html
Feature selection plays a significant role in dealing with high-dimensional data to avoid the curse of dimensionality. In many real applications, like video semantic recognition, handling few labeled and large unlabeled data samples from the same population is a recently addressed challenge in feature selection. To solve this problem, we propose a novel semi-supervised feature selection method via insensitive sparse regression (ISR). Specifically, we compute the soft label matrix by the special label propagation, which can predict the labels of the unlabeled data. To guarantee the robustness of ISR to the false labeled instances or outliers, we propose Insensitive Regression Model (IRM) by capped <inline-formula><tex-math notation="LaTeX">$l_2$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq1-2810286.gif"/></alternatives></inline-formula>-<inline-formula> <tex-math notation="LaTeX">$l_p$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq2-2810286.gif"/></alternatives></inline-formula>-norm loss. The soft label is imposed as the weights of IRM to fully utilize the label information. Meanwhile, to perform feature selection, we incorporate <inline-formula><tex-math notation="LaTeX"> $l_{2,q}$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq3-2810286.gif"/></alternatives></inline-formula> -norm regularizer with IRM as the structural sparsity constraint when <inline-formula><tex-math notation="LaTeX"> $0<q\leq 1$</tex-math><alternatives> <inline-graphic xlink:href="hou-ieq4-2810286.gif"/></alternatives> </inline-formula>. Moreover, we put forward an effective approach for solving the formulated non-convex optimization problem. We analyze the performance of convergence rigorously and discuss the parameter determination problem. Extensive experimental results on several public data sets verify the effectiveness of our proposed algorithm in comparison with the state-of-art feature selection methods. Finally, we apply our method to video semantic recognition successfully.09/12/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810286We Like, We Post: A Joint User-Post Approach for Facebook Post Stance Labeling
https://www.computer.org/csdl/trans/tk/2018/10/08305481-abs.html
Web post and user stance labeling is challenging not only because of the informality and variation in language on the Web but also because of the lack of labeled data on fast-emerging new topics—even the labeled data we do have are usually heavily skewed. In this paper, we propose a joint user-post approach for stance labeling to mitigate the latter two difficulties. In labeling post stance, the proposed approach considers post content as well as posting and liking behavior, which involves users. Sentiment analysis is applied to posts to acquire their initial stance, and then the post and user stance are updated iteratively with correlated posting-related actions. The whole process works with limited labeled data, which solves the first problem. We use real interaction between authors and readers for stance labeling. Experimental results show that the proposed approach not only substantially improves content-based post stance labeling, but also yields better performance for the minor stance class, which solves the second problem.09/13/2018 4:37 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810875Multi-Label Learning with Emerging New Labels
https://www.computer.org/csdl/trans/tk/2018/10/08305522-abs.html
In a multi-label learning task, an object possesses multiple concepts where each concept is represented by a class label. Previous studies on multi-label learning have focused on a fixed set of class labels, i.e., the class label set of test data is the same as that in the training set. In many applications, however, the environment is dynamic and new concepts may emerge in a data stream. In order to maintain a good predictive performance in this environment, a multi-label learning method must have the ability to detect and classify instances with emerging new labels. To this end, we propose a new approach called Multi-label learning with Emerging New Labels (<monospace>MuENL</monospace>). It has three functions: classify instances on currently known labels, detect the emergence of a new label, and construct a new classifier for each new label that works collaboratively with the classifier for known labels. In addition, we show that <monospace>MuENL</monospace> can be easily extended to handle sparse high dimensional data streams by simply reducing the original dimensionality, and then applying <monospace>MuENL</monospace> on the reduced dimensional space. Our empirical evaluation shows the effectiveness of <monospace>MuENL</monospace> on several benchmark datasets and <monospace>MuENLHD</monospace> on the sparse high dimensional Weibo dataset.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810872Supervised Search Result Diversification via Subtopic Attention
https://www.computer.org/csdl/trans/tk/2018/10/08305531-abs.html
Search result diversification aims to retrieve diverse results to satisfy as many different information needs as possible. Supervised methods have been proposed recently to learn ranking functions and they have been shown to produce superior results to unsupervised methods. However, these methods use implicit approaches based on the principle of Maximal Marginal Relevance (MMR). In this paper, we propose a learning framework for explicit result diversification where subtopics are explicitly modeled. Based on the information contained in the sequence of selected documents, we use the attention mechanism to capture the subtopics to be focused on while selecting the next document, which naturally fits our task of document selection for diversification. As a preliminary attempt, we employ recurrent neural networks and max pooling to instantiate the framework. We use both distributed representations and traditional relevance features to model documents in the implementation. The framework is flexible to model query intent in either a flat list or a hierarchy. Experimental results show that the proposed method significantly outperforms all the existing search result diversification approaches.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2810873Automated Phrase Mining from Massive Text Corpora
https://www.computer.org/csdl/trans/tk/2018/10/08306825-abs.html
As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, <inline-formula> <tex-math notation="LaTeX">$\mathsf{AutoPhrase}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq1-2812203.gif"/></alternatives></inline-formula>, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, <inline-formula><tex-math notation="LaTeX"> $\mathsf{AutoPhrase}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq2-2812203.gif"/></alternatives> </inline-formula> has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, <inline-formula><tex-math notation="LaTeX">$\mathsf{AutoPhrase}$ </tex-math><alternatives><inline-graphic xlink:href="shang-ieq3-2812203.gif"/></alternatives></inline-formula> can be extended to model single-word quality phrases.09/13/2018 4:38 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2812203AMDO: An Over-Sampling Technique for Multi-Class Imbalanced Problems
https://www.computer.org/csdl/trans/tk/2018/09/08065074-abs.html
Multi-class imbalanced problems have attracted growing attention from the real-world classification tasks in engineering. The underlying skewed distribution of multiple classes poses difficulties for learning algorithms, which becomes more challenging when considering overlapping between classes, lack of representative data, and mixed-type data. In this work, we address this problem in a data-oriented way. Motivated by a recently proposed over-sampling technique designed for numeric data sets, Mahalanobis Distance-based Over-sampling (MDO), we use this technique to capture the covariance structure of the minority class and to generate synthetic samples along the probability contours for learning algorithms. Based on MDO, we further improve the over-sampling strategy and generalize it for mixed-type data sets. The established technique, Adaptive Mahalanobis Distance-based Over-sampling (AMDO), introduces GSVD (Generalized Singular Value Decomposition) for mixed-type data, develops a partially balanced resampling scheme and optimizes the sample synthesis. Theoretical analysis is conducted to demonstrate the reasonability of AMDO. Extensive experimental testing is performed on 15 multi-class imbalanced benchmarks and two data sets for precipitation phase recognition in comparison with several state-of-the-art multi-class imbalanced learning methods. The results validate the effectiveness and robustness of our proposal.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2017.2761347Privacy Characterization and Quantification in Data Publishing
https://www.computer.org/csdl/trans/tk/2018/09/08276593-abs.html
The increasing interest in collecting and publishing large amounts of individuals’ data as public for purposes such as medical research, market analysis, and economical measures has created major privacy concerns about individual's sensitive information. To deal with these concerns, many Privacy-Preserving Data Publishing (PPDP) techniques have been proposed in literature. However, they lack a proper privacy characterization and measurement. In this paper, we first present a novel multi-variable privacy characterization and quantification model. Based on this model, we are able to analyze the prior and posterior adversarial belief about attribute values of individuals. We can also analyze the sensitivity of any identifier in privacy characterization. Then, we show that privacy should not be measured based on one metric. We demonstrate how this could result in privacy misjudgment. We propose two different metrics for quantification of privacy leakage, distribution leakage, and entropy leakage. Using these metrics, we analyzed some of the most well-known PPDP techniques such as <inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="ibrahim-ieq1-2797092.gif"/></alternatives></inline-formula>-anonymity, <inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives> <inline-graphic xlink:href="ibrahim-ieq2-2797092.gif"/></alternatives></inline-formula>-diversity, and <inline-formula> <tex-math notation="LaTeX">$t$</tex-math><alternatives><inline-graphic xlink:href="ibrahim-ieq3-2797092.gif"/> </alternatives></inline-formula>-closeness. Based on our framework and the proposed metrics, we can determine that all the existing PPDP schemes have limitations in privacy characterization. Our proposed privacy characterization and measurement framework contributes to better understanding and evaluation of these techniques. Thus, this paper provides a foundation for design and analysis of PPDP schemes.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2797092On Generalizing Collective Spatial Keyword Queries
https://www.computer.org/csdl/trans/tk/2018/09/08278270-abs.html
With the proliferation of spatial-textual data such as location-based services and geo-tagged websites, spatial keyword queries are ubiquitous in real life. One example of spatial-keyword query is the so-called <italic>collective spatial keyword query</italic> (CoSKQ) which is to find for a given query consisting a query location and several query keywords a set of objects which <italic>covers</italic> the query keywords collectively and has the smallest <italic>cost</italic> wrt the query location. In the literature, many different functions were proposed for defining the <inline-formula><tex-math notation="LaTeX">${cost}$</tex-math><alternatives> <inline-graphic xlink:href="chan-ieq1-2800746.gif"/></alternatives></inline-formula> and correspondingly, many different approaches were developed for the CoSKQ problem. In this paper, we study the CoSKQ problem systematically by proposing <italic>a unified cost function</italic> and <italic>a unified approach</italic> for the CoSKQ problem (with the unified cost function). The unified cost function includes all existing cost functions as special cases and the unified approach solves the CoSKQ problem with the unified cost function in a unified way. Experiments were conducted on both real and synthetic datasets which verified our proposed approach.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2800746On Power Law Growth of Social Networks
https://www.computer.org/csdl/trans/tk/2018/09/08280512-abs.html
What is the growth dynamics of social networks, like Facebook or WeChat? Does it truly exhibit exponential early-growth, as predicted by the celebrated models, like the Bass model? How about the dynamics of links, for which there are few published models? For the first time, we examine the growth of WeChat which is the largest online social network in China, together with several other real social networks. We observe Power-Law growth dynamics for both nodes and links, a fact that breaks the textbook models featuring Sigmoid curves. We propose <sc>NetTide</sc>, along with differential equations for the growth of nodes and links. Our model fits the growth dynamics of real social networks well; it encompasses many traditional growth dynamics as special cases, while remaining parsimonious in parameters. The <sc>NetTide</sc> for link growth is the first one of its kind, accurately fitting real data, and capturing densification phenomenon. We further formulate two stochastic generators, which interpret the growth of nodes and links through survival analysis and micro-level interactions within a social network, respectively. The proposed generators reproduce realistic growth dynamics of social networks. When applied on the WeChat data, our <sc> NetTide</sc> forecasted <inline-formula><tex-math notation="LaTeX">$\geq$</tex-math><alternatives> <inline-graphic xlink:href="zang-ieq1-2801844.gif"/></alternatives></inline-formula> 730 days ahead with 3 percent error.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2801844Querying a Collection of Continuous Functions
https://www.computer.org/csdl/trans/tk/2018/09/08283622-abs.html
We introduce a new query primitive called <italic>Function Query</italic> (FQ). An FQ operates on a set of math functions and retrieves the functions whose output with a given input satisfies a query condition (e.g., being among top k, within a given range). While FQ finds its natural uses in querying a database of math functions, it can also be applied on a database of discrete values. We show that by interpreting the database as a set of user-defined functions, FQ can achieve the same functionality as existing analytic queries such as top-k query and scalar product query. We address the challenge of efficient execution of FQ. The core of our solution is a novel data structure called <italic>Intersection-tree</italic>. Our research takes advantage of the fact that 1) the intersections of a set of continuous functions partition their domain into a number of <italic>subdomains</italic>, and 2) in each of these subdomains, the functions can be sorted based on their output. We evaluate the performance of the proposed techniques through analysis, prototyping, and experiments using both synthetic and real-world data. When querying a database of functions, our techniques scale well. When applied on a database of discrete values, our techniques are more versatile and outperform existing techniques in terms of various performance metrics.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2802936On Efficiently Answering Why-Not Range-Based Skyline Queries in Road Networks
https://www.computer.org/csdl/trans/tk/2018/09/08283816-abs.html
The range-based skyline (r-skyline) query on road networks retrieves the skyline objects for each of the query points that are within a road region, considering the objects’ spatial and non-spatial attributes. However, reasoning about missing query results, specified by <italic>why-not questions</italic>, has not till recently received the attention it is worth of. In this paper, we systematically carry out the study of why-not questions on the r-skyline query in the road network environment (abbrev. as the <italic>why-not RSQ problem</italic>). We present three modification strategies, including modifying the query range, modifying the why-not point, and modifying both of them, for supporting the why-not RSQ problem. We also propose three efficient algorithms to tackle the why-not RSQ problem, where several newly presented effective concepts/techniques are leveraged, such as the concepts of <italic> skyline scope</italic> and <italic>skyline dominance region</italic>, <italic>non-spatial attribute modification pruning</italic>, and <italic>G-tree index</italic>. Extensive experimental evaluation using both real and synthetic data sets demonstrates the performance of our proposed algorithms.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2803821A Fast Parallel Community Discovery Model on Complex Networks Through Approximate Optimization
https://www.computer.org/csdl/trans/tk/2018/09/08283822-abs.html
Community discovery plays an essential role in the analysis of the structural features of complex networks. Since online networks grow increasingly large and complex over time, the methods traditionally used for community discovery cannot efficiently handle large-scale network data. This introduces the important problem of how to effectively and efficiently discover large communities from complex networks. In this study, we propose a fast parallel community discovery model called picaso (a <bold>p</bold>arallel commun<bold>i</bold>ty dis<bold>c</bold>overy <bold>a</bold> lgorithm ba<bold>s</bold>ed on approximate <bold>o</bold>ptimization), which integrates two new techniques: (1) Mountain model, which works by utilizing graph theory to approximate the selection of nodes needed for merging, and (2) Landslide algorithm, which is used to update the modularity increment based on the approximated optimization. In addition, the GraphX distribution computing framework is employed in order to achieve parallel community detection over complex networks. In the proposed model, clustering on modularity is used to initialize the Mountain model as well as to compute the weight of each edge in the networks. The relationships among the communities are then simplified by applying the Landslide algorithm, which allows us to obtain the community structures of the complex networks. Extensive experiments were conducted on real and synthetic complex network datasets, and the results demonstrate that the proposed algorithm can outperform the state of the art methods, in effectiveness and efficiency, when working to solve the problem of community detection. Moreover, we demonstratively prove that overall time performance approximates to four times faster than similar approaches. Effectively our results suggest a new paradigm for large-scale community discovery of complex networks.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2803818Privacy Enhanced Matrix Factorization for Recommendation with Local Differential Privacy
https://www.computer.org/csdl/trans/tk/2018/09/08290673-abs.html
Recommender systems are collecting and analyzing user data to provide better user experience. However, several privacy concerns have been raised when a recommender knows user's set of items or their ratings. A number of solutions have been suggested to improve privacy of legacy recommender systems, but the existing solutions in the literature can protect either items or ratings only. In this paper, we propose a recommender system that protects both user's items and ratings. For this, we develop novel matrix factorization algorithms under local differential privacy (LDP). In a recommender system with LDP, individual users randomize their data themselves to satisfy differential privacy and send the perturbed data to the recommender. Then, the recommender computes aggregates of the perturbed data. This framework ensures that both user's items and ratings remain private from the recommender. However, applying LDP to matrix factorization typically raises utility issues with i) high dimensionality due to a large number of items and ii) iterative estimation algorithms. To tackle these technical challenges, we adopt dimensionality reduction technique and a novel binary mechanism based on sampling. We additionally introduce a factor that stabilizes the perturbed gradients. With MovieLens and LibimSeTi datasets, we evaluate recommendation accuracy of our recommender system and demonstrate that our algorithm performs better than the existing differentially private gradient descent algorithm for matrix factorization under stronger privacy requirements.08/07/2018 12:33 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2805356170