IEEE Transactions on Computers
https://www.computer.org/csdl/trans/tc/index.html
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers, brief contributions, and comments on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; (g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.en-ushttps://s3.amazonaws.com/ieeecs.cdn.common/images/computer.org/CSLogo_dark.pngIEEE Computer Society Digital LibraryList of 100 recently published journal articles.
https://www.computer.org/csdl
Designing Checksums for Detecting Errors in Fast Unitary Transforms
https://www.computer.org/csdl/trans/tc/2018/04/08039530-abs.html
Parity computations, checksums, over the input and output data of fast unitary transforms are compared, down to roundoff noise levels, to detect the effects from a single error on any one line between stages of the fast algorithm. Error spaces and their dual spaces guide the design process.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2753774<sc>Adas</sc> on <sc>Cots</sc> with OpenCL: A Case Study with Lane Detection
https://www.computer.org/csdl/trans/tc/2018/04/08057795-abs.html
The concept of autonomous cars is driving a boost for car electronics and the size of automotive electronics market is foreseen to double by 2025. How to benefit from this boost is an interesting question. This article presents a case study to test the feasibility of using OpenCL as the programming language and <sc>Cots</sc> components as the underlying computing platforms for <sc>Adas</sc> development. For representative <sc>Adas</sc> applications, a scalable lane detection is developed that can tune the trade-off between detection accuracy and speed. Our OpenCL implementation is tested on 14 video streams from different data-sets with different road scenarios on 5 <sc>Cots</sc> platforms. We demonstrate that the <sc>Cots</sc> platforms can provide more than sufficient computing power for the lane detection in the meanwhile our OpenCL implementation can exploit the massive parallelism provided by the <sc>Cots </sc> platforms.03/09/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2759203MC-Fluid: Multi-Core Fluid-Based Mixed-Criticality Scheduling
https://www.computer.org/csdl/trans/tc/2018/04/08059775-abs.html
Owing to growing complexity and scale, safety-critical real-time systems are generally designed using the concept of mixed-criticality, wherein applications with different criticality or importance levels are hosted on the same hardware platform. To guarantee non-interference between these applications, the hardware resources, in particular the processor, are statically partitioned among them. To overcome the inefficiencies in resource utilization of such a static scheme, the concept of mixed-criticality real-time scheduling has emerged as a promising solution. Although there are several studies on such scheduling strategies for uniprocessor platforms, the problem of efficient scheduling for the multiprocessor case has largely remained open. In this work, we design a fluid-model based mixed-criticality scheduling algorithm for multiprocessors, in which multiple tasks are allowed to execute on the same processor simultaneously. We derive an exact schedulability test for this algorithm, and also present an optimal strategy for assigning the fractional execution rates to tasks. Since fluid-model based scheduling is not implementable on real hardware, we also present a transformation algorithm from fluid-schedule to a non-fluid one. We also show through experimental evaluation that the designed algorithms outperform existing scheduling algorithms in terms of their ability to schedule a variety of task systems.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2759765Mapping and Scheduling Mixed-Criticality Systems with On-Demand Redundancy
https://www.computer.org/csdl/trans/tc/2018/04/08066347-abs.html
Embedded systems in several domains such as avionics and automotive are subject to inspection from certification authorities. These authorities are interested in verifying the safety-critical aspects of a system and, typically, do not certify non-critical parts. The design of such Mixed-Criticality Systems (MCS) has received increasing attention in recent years. However, although MCS must be designed to overcome transient faults, their susceptibility to transient faults is often overlooked. In this paper, we consider the problem of mapping and scheduling efficient, certifiable MCS that can survive transient faults. We generalize previous MCS models and analysis to support On-Demand Redundancy (ODR). A task set transformation is proposed to generate a modified task set that supports various forms of ODR while satisfying reliability and certification requirements. The analysis is incorporated into a design space exploration algorithm that supports a wide range of fault-tolerance mechanisms and heterogeneous platforms. Experiments show that ODR can improve Quality of Service (QoS) provided to non-critical tasks by 29 percent on average, compared to lockstep execution. Moreover, combining several fault-tolerance mechanisms can lead to additional improvements in schedulability and QoS.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2762293Utilization-Based Scheduling of Flexible Mixed-Criticality Real-Time Tasks
https://www.computer.org/csdl/trans/tc/2018/04/08068215-abs.html
Mixed-criticality models are an emerging paradigm for the design of real-time systems because of their significantly improved resource efficiency. However, formal mixed-criticality models have traditionally been characterized by two impractical assumptions: once <italic>any</italic> high-criticality task overruns, <italic>all</italic> low-criticality tasks are suspended and <italic>all other</italic> high-criticality tasks are assumed to exhibit high-criticality behaviors at the same time. In this paper, we propose a more realistic mixed-criticality model, called the flexible mixed-criticality (FMC) model, in which these two issues are addressed in a combined manner. In this new model, only the overrun task itself is assumed to exhibit high-criticality behavior, while other high-criticality tasks remain in the same mode as before. The guaranteed service levels of low-criticality tasks are gracefully degraded with the overruns of high-criticality tasks. We derive a utilization-based technique to analyze the schedulability of this new mixed-criticality model under EDF-VD scheduling. During run time, the proposed test condition serves an important criterion for dynamic service level tuning, by means of which the maximum available execution budget for low-criticality tasks can be directly determined with minimal overhead while guaranteeing mixed-criticality schedulability. Experiments demonstrate the effectiveness of the FMC scheme compared with state-of-the-art techniques.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2763133Reliability Optimization on Multi-Core Systems with Multi-Tasking and Redundant Multi-Threading
https://www.computer.org/csdl/trans/tc/2018/04/08094023-abs.html
Using Redundant Multithreading (RMT) for error detection and recovery is a prominent technique to mitigate soft-error effects in multi-core systems. Simultaneous Redundant Threading (SRT) on the same core or Chip-level Redundant Multithreading (CRT) on different cores can be adopted to implement RMT. However, only a few previously proposed approaches use adaptive CRT managements on the system level and none of them considers both SRT and CRT on the task level. In this paper, we propose to use a combination of SRT and CRT, called Mixed Redundant Threading (MRT), as an additional option on the task level. In our coarse-grained approach, we consider SRT, CRT, and MRT on the system level simultaneously, while the existing results only apply either SRT or CRT on the system level, but not simultaneously. In addition, we consider further fine-grained task level optimizations to improve the system reliability under hard real-time constraints. To optimize the system reliability, we develop several dynamic programming approaches to select the redundancy levels under Federated Scheduling. The simulation results illustrate that our approaches can significantly improve the system reliability compared to the state-of-the-art techniques.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2769044Efficient Scheduling for Multi-Block Updates in Erasure Coding Based Storage Systems
https://www.computer.org/csdl/trans/tc/2018/04/08094270-abs.html
This paper considers the problem of how to reduce the I/O overhead of data update operations in erasure coding based storage systems. To this end, we first analyze the I/O overhead of update operations with current update approaches. We find the key to reduce such I/O overhead is designing a scheduling algorithm to construct the sequence of update operations. Such an algorithm needs to execute with a time limit, since update requests work under a stringent latency constraint. To quickly schedule the order of update operations, we propose an efficient algorithm, namely UCODR. Our theoretical analysis verifies that UCODR can effectively reduce the I/O overhead of update operations when multiple blocks are updated. To further confirm its effectiveness, we implement a prototype storage system to deploy UCODR with different erasure codes. Extensive experiments are conducted on the prototype storage system with real-world traces. The experimental results show that UCODR can reduce the time of update operations by up to 35 percent and improve the throughput of the storage system by up to 67 percent, compared with the state-of-the-art update approaches.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2769051Static Instruction Scheduling for High Performance on Limited Hardware
https://www.computer.org/csdl/trans/tc/2018/04/08094900-abs.html
Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14 percent for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications’ high-performance.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2769641Thermal-Aware Application Mapping Strategy for Network-on-Chip Based System Design
https://www.computer.org/csdl/trans/tc/2018/04/08096996-abs.html
Rapid progress in technology scaling makes transistors smaller and faster over successive generations, and consequently core count in a system gets increased. However, transistor power consumption no longer scales commensurately. Increased power density calls for better thermal safety of the multi-core systems, in which a flexible and scalable packet-switched architecture — Network-on-Chip (NoC) — is commonly used for communication among the cores. This paper proposes a strategy to increase the thermal safety of NoC-based systems by a graceful decrease in communication cost and an Integer Linear Programming (ILP) formulation to deal with the problem. To overcome huge computational overhead of ILP, another solution strategy, based on meta-heuristic technique, Particle Swarm Optimization (PSO) is also proposed. Several innovative augmentations have been introduced into the basic PSO to generate better quality solutions. A thermal-aware mapping heuristic is proposed to generate some intelligent solutions, which become a part of the initial population in the PSO. A trade-off has been established between communication cost and peak temperature of the die. Experiments on Big data and Graph analytical workloads are reported. The results obtained are better than those of many contemporary approaches, reported in the literature.03/09/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2770130Simultaneous and Speculative Thread Migration for Improving Energy Efficiency of Heterogeneous Core Architectures
https://www.computer.org/csdl/trans/tc/2018/04/08097407-abs.html
This paper proposes a microarchitectural mechanism to minimize the latency of thread migration for a tightly-coupled heterogeneous core, which has two execution backends (e.g., in-order and out-of-order execution pipelines). The proposed mechanism examines the dependencies between all in-flight instructions that reside in one of the backend pipelines, and allows both pipelines to simultaneously perform the instruction execution. At the microarchitectural level, instruction dispatching and instruction execution are seamlessly performed across thread migration, and therefore, this simultaneous backend execution can accelerate the program execution, which cannot be achieved with an existing migration mechanism. Accelerating thread migration will increase the overall performance with low power overhead, providing high energy efficiency. As compared to a baseline heterogeneous core with an existing migration mechanism, the simultaneous backend execution reduces 8.2 percent of the total execution cycle and consumes 2.9 percent lower total energy on average across SPEC CPU2006 benchmarks, which results in an improved energy efficiency of 10.9 percent in terms of the energy-delay product.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2770126Selective I/O Bypass and Load Balancing Method for Write-Through SSD Caching in Big Data Analytics
https://www.computer.org/csdl/trans/tc/2018/04/08100898-abs.html
Fast network quality analysis in the telecom industry is an important method used to provide quality service. SK Telecom, based in South Korea, built a Hadoop-based analytical system consisting of a hundred nodes, each of which only contains hard disk drives (HDDs). Because the analysis process is a set of parallel I/O intensive jobs, adding solid state drives (SSDs) with appropriate settings is the most cost-efficient way to improve the performance, as shown in previous studies. Therefore, we decided to configure SSDs as a write-through cache instead of increasing the number of HDDs. To improve the cost-per-performance of the SSD cache, we introduced a selective I/O bypass (SIB) method, redirecting the automatically calculated number of read I/O requests from the SSD cache to idle HDDs when the SSDs are I/O over-saturated, which means the disk utilization is greater than 100 percent. To precisely calculate the disk utilization, we also introduced a combinational approach for SSDs because the current method used for HDDs cannot be applied to SSDs because of their internal parallelism. In our experiments, the proposed approach achieved a maximum 2x faster performance than other approaches.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2771491Spectral Features of Higher-Order Side-Channel Countermeasures
https://www.computer.org/csdl/trans/tc/2018/04/08103813-abs.html
This brief deals with the problem of mathematically formalizing hardware circuits’ vulnerability to side-channel attacks. We investigate whether spectral analysis is a useful analytical tool for this purpose by building a mathematically sound theory of the vulnerability phenomenon. This research was originally motivated by the need for deeper, more formal knowledge around vulnerable nonlinear circuits. However, while building this new theoretical framework, we discovered that it can consistently integrate known results about linear ones as well. Eventually, we found it adequate to formally model side-channel leakage in several significant scenarios. In particular, we have been able to find the vulnerability perimeter of a known cryptographic primitive (i.e., Keccak <xref ref-type="bibr" rid="ref1">[1]</xref> ) and thus tackle the analysis of vulnerability when signal glitches are present. We believe the conceptual framework we propose will be useful for researchers and practitioners in the field of applied cryptography and side-channel attacks.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2772231Dynamic Scheduling with Service Curve for QoS Guarantee of Large-Scale Cloud Storage
https://www.computer.org/csdl/trans/tc/2018/04/08107532-abs.html
With the growing popularity of cloud storage, more and more diverse applications with diverse service level agreements (SLAs) are being accommodated into it. The quality of service (QoS) support for applications in a shared cloud storage becomes important. However, performance isolation, diverse performance requirements, especially harsh latency guarantees and high system utilization, are all challenging and desirable for QoS design. In this paper, we propose a service curve-based QoS algorithm to support latency guarantee applications, IOPS guarantee applications and best-effort applications at the same storage system, which not only provides a QoS guarantee for applications, but also pursues better system utilization. Three priority queues are exploited and different service curves are applied for different types of applications. I/O requests from different applications are scheduled and dispatched among the three queues according to their service curves and I/O urgency status, so that QoS requirements of all applications can be guaranteed on the shared storage system. Our experimental results show that our algorithm not only simultaneously guarantees the QoS targets of latency and throughput (IOPS), but also improves the utilization of storage resources.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2773511On Practical Discrete Gaussian Samplers for Lattice-Based Cryptography
https://www.computer.org/csdl/trans/tc/2018/03/07792671-abs.html
Lattice-based cryptography is one of the most promising branches of quantum resilient cryptography, offering versatility and efficiency. Discrete Gaussian samplers are a core building block in most, if not all, lattice-based cryptosystems, and optimised samplers are desirable both for high-speed and low-area applications. Due to the inherent structure of existing discrete Gaussian sampling methods, lattice-based cryptosystems are vulnerable to side-channel attacks, such as timing analysis. In this paper, the first comprehensive evaluation of discrete Gaussian samplers in hardware is presented, targeting FPGA devices. Novel optimised discrete Gaussian sampler hardware architectures are proposed for the main sampling techniques. An independent-time design of each of the samplers is presented, offering security against side-channel timing attacks, including the first proposed constant-time Bernoulli, Knuth-Yao, and discrete Ziggurat sampler hardware designs. For a balanced performance, the Cumulative Distribution Table (CDT) sampler is recommended, with the proposed hardware CDT design achieving a throughput of 59.4 million samples per second for encryption, utilising just 43 slices on a Virtex 6 FPGA and 16.3 million samples per second for signatures with 179 slices on a Spartan 6 device.02/08/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2016.2642962Hardware/Software Co-Design of an Accelerator for FV Homomorphic Encryption Scheme Using Karatsuba Algorithm
https://www.computer.org/csdl/trans/tc/2018/03/07797469-abs.html
Somewhat Homomorphic Encryption (SHE) schemes allow to carry out operations on data in the cipher domain. In a cloud computing scenario, personal information can be processed secretly, inferring a high level of confidentiality. For many years, practical parameters of SHE schemes were overestimated, leading to only consider the FFT algorithm to accelerate SHE in hardware. Nevertheless, recent work demonstrates that parameters can be lowered without compromising the security <xref ref-type="bibr" rid="ref1">[1]</xref> . Following this trend, this work investigates the benefits of using Karatsuba algorithm instead of FFT for the Fan-Vercauteren (FV) Homomorphic Encryption scheme. The proposed accelerator relies on an hardware/software co-design approach, and is designed to perform fast arithmetic operations on degree 2,560 polynomials with 135 bits coefficients, allowing to compute small algorithms homomorphically. Compared to a functionally equivalent design using FFT, our accelerator performs an homomorphic multiplication in 11.9 ms instead of 15.46 ms, and halves the size of logic utilization and registers on the FPGA.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2016.2645204Hardware-Based Trusted Computing Architectures for Isolation and Attestation
https://www.computer.org/csdl/trans/tc/2018/03/07807249-abs.html
Attackers target many different types of computer systems in use today, exploiting software vulnerabilities to take over the device and make it act maliciously. Reports of numerous attacks have been published, against the constrained embedded devices of the Internet of Things, mobile devices like smartphones and tablets, high-performance desktop and server environments, as well as complex industrial control systems. Trusted computing architectures give users and remote parties like software vendors guarantees about the behaviour of the software they run, protecting them against software-level attackers. This paper defines the security properties offered by them, and presents detailed descriptions of twelve hardware-based attestation and isolation architectures from academia and industry. We compare all twelve designs with respect to the security properties and architectural features they offer. The presented architectures have been designed for a wide range of devices, supporting different security properties.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2647955Bitstream Fault Injections (BiFI)–Automated Fault Attacks Against SRAM-Based FPGAs
https://www.computer.org/csdl/trans/tc/2018/03/07809042-abs.html
This contribution is concerned with the question whether an adversary can automatically manipulate an unknown FPGA bitstream realizing a cryptographic primitive such that the underlying secret key is revealed. In general, if an attacker has full knowledge about the bitstream structure and can make changes to the target FPGA design, she can alter the bitstream leading to key recovery. However, this requires challenging reverse-engineering steps in practice. We argue that this is a major reason why bitstream fault injection attacks have been largely neglected in the past. In this paper, we show that malicious bitstream modifications are i) much easier to conduct than commonly assumed and ii) surprisingly powerful. We introduce a novel class of bitstream fault injection (BiFI) attacks which does <italic>not </italic> require any reverse-engineering. Our attacks can be automatically mounted without any detailed knowledge about either the bitstream format or the design of the crypto primitive which is being attacked. Bitstream encryption features do not necessarily prevent our attack if the integrity of the encrypted bitstream is not carefully checked. We have successfully verified the feasibility of our attacks in practice by considering several publicly available AES designs. As target platforms, we have conducted our experiments on Spartan-6 and Virtex-5 Xilinx FPGAs.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2016.2646367Hybrid Obfuscation to Protect Against Disclosure Attacks on Embedded Microprocessors
https://www.computer.org/csdl/trans/tc/2018/03/07809080-abs.html
The risk of code reverse-engineering is particularly acute for embedded processors which often have limited available resources to protect program information. Previous efforts involving code obfuscation provide some additional security against reverse- engineering of programs, but the security benefits are typically limited and not quantifiable. Hence, new approaches to code protection and creation of associated metrics are highly desirable. This paper has two main contributions. We propose the first hybrid diversification approach for protecting embedded software and we provide statistical metrics to evaluate the protection. Diversification is achieved by combining hardware obfuscation at the microarchitecture level and the use of software-level obfuscation techniques tailored to embedded systems. Both measures are based on a compiler which generates obfuscated programs, and an embedded processor implemented in an FPGA with a randomized Instruction Set Architecture (ISA) encoding to execute the hybrid obfuscated program. We employ a fine-grained, hardware-enforced access control mechanism for information exchange with the processor and hardware-assisted booby traps to actively counteract manipulation attacks. It is shown that our approach is effective against a wide variety of possible information disclosure attacks in case of a physically present adversary. Moreover, we propose a novel statistical evaluation methodology that provides a security metric for hybrid-obfuscated programs.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2649520GliFreD: Glitch-Free Duplication Towards Power-Equalized Circuits on FPGAs
https://www.computer.org/csdl/trans/tc/2018/03/07827086-abs.html
Designers of secure hardware are required to harden their implementations against physical threats, such as power analysis attacks. In particular, cryptographic hardware circuits need to decorrelate their current consumption from the information inferred by processing (secret) data. A common technique to achieve this goal is the use of special logic styles that aim at equalizing the current consumption at each single processing step. However, since all hiding techniques like Dual-Rail Precharge (DRP) were originally developed for ASICs, the deployment of such countermeasures on FPGA devices with fixed and predefined logic structure poses a particular challenge. In this work, we propose and practically evaluate a new DRP scheme (GliFreD) that has been exclusively designed for FPGA platforms. GliFreD overcomes the well-known early propagation issue, prevents glitches, uses an isolated dual-rail concept, and mitigates imbalanced routings. With all these features, GliFreD significantly exceeds the level of physical security achieved by any previously reported, related countermeasures for FPGAs.02/08/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2651829A Multiplexer-Based Arbiter PUF Composition with Enhanced Reliability and Security
https://www.computer.org/csdl/trans/tc/2018/03/08025790-abs.html
Arbiter Physically Unclonable Functions (APUFs), while being relatively lightweight, are extremely vulnerable to modeling attacks. Hence, various compositions of APUFs such as XOR APUF and Lightweight Secure PUF have been proposed to be secure alternatives. Previous research has demonstrated that PUF compositions have two major challenges to overcome: vulnerability against modeling and statistical attacks, and lack of reliability. In this paper, we introduce a multiplexer-based composition of APUFs, denoted as MPUF, to simultaneously overcome these challenges. In addition to the basic MPUF design, we propose two MPUF variants namely cMPUF and rMPUF to improve the robustness against cryptanalysis and reliability-based modeling attack, respectively. An rMPUF demonstrates enhanced robustness against the reliability-based modeling attack, while even the well-known XOR APUF, otherwise robust to machine learning based modeling attacks, has been modeled using the same technique with linear data and time complexities. The rMPUF can provide a good trade-off between security and hardware overhead while maintaining a significantly higher reliability level than any practical XOR APUF instance. Moreover, MPUF variants are the first APUF compositions, to the best of our knowledge, that can achieve Strict Avalanche Criterion without using any additional input network (or hardware) for challenge transformation. Finally, we validate our theoretical findings using Matlab-based simulations of MPUFs.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2749226Achieving Load Balance for Parallel Data Access on Distributed File Systems
https://www.computer.org/csdl/trans/tc/2018/03/08027054-abs.html
The distributed file system, HDFS, is widely deployed as the bedrock for many parallel big data analysis. However, when running multiple parallel applications over the shared file system, the data requests from different processes/executors will unfortunately be served in a surprisingly imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file system such as HDFS store each data unit, referred to as chunk file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher probability the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as  <italic>hard disk head</italic> and <italic>network bandwidth</italic>, resulting in a degraded I/O performance. In this paper, we first conduct a complete analysis on how remote and imbalanced read/write patterns occur and how they are affected by the size of the cluster. We then propose novel methods, referred to as Opass, to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems. Our proposed methods can benefit parallel data-intensive analysis with various parallel data access strategies. Opass adopts new matching-based algorithms to match processes to data so as to compute the maximum degree of data locality and balanced data access. Furthermore, to reduce the imbalance of parallel writes, Opass employs a heatmap for monitoring the I/O statuses of storage nodes and performs HM-LRU policy to select a local optimal storage node for serving write requests. Experiments are conducted on PRObE’s Marmot 128-node cluster testbed and the results from both benchmark and well-known parallel applications show the performance benefits and scalability of Opass.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2749229Randomized Mixed-Radix Scalar Multiplication
https://www.computer.org/csdl/trans/tc/2018/03/08031048-abs.html
A set of congruence relations is a <inline-formula><tex-math notation="LaTeX">$\mathbb {Z}$</tex-math><alternatives> <inline-graphic xlink:href="imbert-ieq1-2750677.gif"/></alternatives></inline-formula>-covering if each integer belongs to at least one congruence class from that set. In this paper, we first show that most existing scalar multiplication algorithms can be formulated in terms of covering systems of congruences. Then, using a special form of covering systems called exact <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives> <inline-graphic xlink:href="imbert-ieq2-2750677.gif"/></alternatives></inline-formula>-covers, we present a novel uniformly randomized scalar multiplication algorithm with built-in protections against most passive side-channel attacks. Our algorithm randomizes the addition chain using a mixed-radix representation of the scalar. Its reduced overhead and purposeful robustness could make it a sound replacement to several conventional countermeasures. In particular, it is significantly faster than Coron's scalar blinding technique for elliptic curves when the choice of a particular finite field tailored for speed compels to double the size of the scalar, hence the cost of the scalar multiplication.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2750677Towards the Design of Efficient and Consistent Index Structure with Minimal Write Activities for Non-Volatile Memory
https://www.computer.org/csdl/trans/tc/2018/03/08046077-abs.html
Index structures can significantly accelerate the data retrieval operations in data intensive systems, such as databases. Tree structures, such as B<inline-formula><tex-math notation="LaTeX">$^{+}$</tex-math><alternatives> <inline-graphic xlink:href="zhuge-ieq1-2754381.gif"/></alternatives></inline-formula>-tree alike, are commonly employed as index structures; however, we found that the tree structure may not be appropriate for Non-Volatile Memory (NVM) in terms of the requirements for high-performance and high-endurance. This paper studies what is the best index structure for NVM-based systems and how to design such index structures. The design of an NVM-friendly index structure faces a lot of challenges. <italic>First</italic>, in order to prolong the lifetime of NVM, the write activities on NVM should be minimized. To this end, the index structure should be as simple as possible. The index proposed in this paper is based on the simplest data structure, i.e., linked list. <italic>Second</italic>, the simple structure brings challenges to achieve high-performance data retrieval operations. To overcome this challenge, we design a novel technique by explicitly building up a contiguous virtual address space on the linked list, such that efficient search algorithms can be performed. <italic>Third</italic>, we need to carefully consider data consistency issues in NVM-based systems, because the order of memory writes may be changed and the data content in NVM may be inconsistent due to write-back effects of CPU cache. This paper devises a novel indexing scheme, called “<bold>V</bold>irtual <bold>L</bold>inear <bold>A</bold>ddressable <bold>B</bold>uckets” (VLAB). We implement VLAB in a storage engine and plug it into MySQL. Evaluations are conducted on an NVDIMM workstation using YCSB workloads and real-world traces. Results show that write activities of the state-of-the-art indexes are 6.98 times more than ours; meanwhile, VLAB achieves 2.53 times speedup.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2754381Digit Serial Methods with Applications to Division and Square Root
https://www.computer.org/csdl/trans/tc/2018/03/08060979-abs.html
We present a generic digit serial method (DSM) to compute the digits of a real number <inline-formula> <tex-math notation="LaTeX">$V$</tex-math><alternatives><inline-graphic xlink:href="ferguson-ieq1-2759764.gif"/> </alternatives></inline-formula>. Bounds on these digits, and on the errors in the associated estimates of <inline-formula><tex-math notation="LaTeX">$V$</tex-math><alternatives> <inline-graphic xlink:href="ferguson-ieq2-2759764.gif"/></alternatives></inline-formula> formed from these digits, are derived. To illustrate our results, we derive such bounds for a parameterized family of high-radix algorithms for division and square root. These bounds enable a DSM designer to determine, for example, whether a given choice of parameters allows rapid formation and rounding of its approximation to <inline-formula><tex-math notation="LaTeX">$V$ </tex-math><alternatives><inline-graphic xlink:href="ferguson-ieq3-2759764.gif"/></alternatives></inline-formula>.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2759764Special Section on Secure Computer Architectures
https://www.computer.org/csdl/trans/tc/2018/03/08287086-abs.html
02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2788658Start Simple and then Refine: Bias-Variance Decomposition as a Diagnosis Tool for Leakage Profiling
https://www.computer.org/csdl/trans/tc/2018/02/07990260-abs.html
Evaluating the resistance of cryptosystems to side-channel attacks is an important research challenge. Profiled attacks reveal the degree of resilience of a cryptographic device when an adversary examines its physical characteristics. So far, evaluation laboratories launch several physical attacks (based on engineering intuitions) in order to find one strategy that eventually extracts secret information (such as a secret cryptographic key). The certification step represents a complex task because in practice the evaluators have tight memory and time constraints. In this paper, we propose a principled way of guiding the design of the most successful evaluation strategies thanks to the (bias-variance) decomposition of a security metric of profiled attacks. Our results show that we can successfully apply our framework on unprotected and protected algorithms implemented in software and hardware.01/12/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2731342On-Chip Fault Monitoring Using Self-Reconfiguring IEEE 1687 Networks
https://www.computer.org/csdl/trans/tc/2018/02/07990262-abs.html
Efficient handling of faults during operation is highly dependent on the interval (latency) from the time embedded monitoring instruments detect faults to the time when the fault manager localizes the faults. In this article, we propose a self-reconfiguring IEEE 1687 network in which all instruments that have detected faults are automatically included in the scan path, and a fault detection and localization module in hardware that detects the configuration of the network after self-reconfiguration and extracts the error codes reported by the fault monitoring instruments. To enable self-reconfiguration, we propose a modified segment insertion bit (SIB) compliant to IEEE 1687. We provide time analyses on fault detection and fault localization for single and multiple faults, and suggest how the self-reconfiguring IEEE 1687 network should be designed such that time for fault detection and fault localization is kept low and deterministic. We show that compared with previous schemes, our proposed network significantly reduces the fault localization time. For validation, we implemented a number of self-reconfiguring networks as well as their corresponding fault detection and localization modules in hardware, and performed post-layout simulations. We show that for large number of instruments, implementing the fault detection and localization module in hardware results in less area compared with the corresponding software-based implementation. Another benefit of the hardware implementation over its software counterpart is that to achieve the same fault detection and localization time, the hardware module can be clocked at a lower frequency than the core on which the corresponding software implementation would run.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2731338Approximate DCT Image Compression Using Inexact Computing
https://www.computer.org/csdl/trans/tc/2018/02/07990539-abs.html
This paper proposes a new framework for digital image processing; it relies on inexact computing to address some of the challenges associated with the discrete cosine transform (DCT) compression. The proposed framework has three levels of processing; the first level uses approximate DCT for image compressing to eliminate all computational intensive floating-point multiplications and executing the DCT processing by integer additions and in some cases logical right/left shifts. The second level further reduces the amount of data (from the first level) that need to be processed by filtering those frequencies that cannot be detected by human senses. Finally, to reduce power consumption and delay, the third level introduces circuit level inexact adders to compute the DCT. For assessment, a set of standardized images are compressed using the proposed three-level framework. Different figures of merits (such as energy consumption, delay, power-signal-to-noise-ratio, average-difference, and absolute-maximum-difference) are compared to existing compression methods; an error analysis is also pursued confirming the simulation results. Results show very good improvements in reduction for energy and delay, while maintaining acceptable accuracy levels for image processing applications.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2731770DuCNoC: A High-Throughput FPGA-Based NoC Simulator Using Dual-Clock Lightweight Router Micro-Architecture
https://www.computer.org/csdl/trans/tc/2018/02/08000664-abs.html
On-chip interconnections play an important role in multi/many-processor systems-on-chip (MPSoCs). In order to achieve efficient optimization, each specific application must utilize a specific architecture, and consequently a specific interconnection network. For design space exploration and finding the best NoC solution for each specific application, a fast and flexible NoC simulator is necessary, especially for large design spaces. In this paper, we present an FPGA-based NoC co-simulator, which is able to be configured via software. In our proposed NoC simulator, entitled <italic>DuCNoC</italic>, we implement a <italic>Dual-Clock</italic> router micro-architecture, which demonstrates 75x<inline-formula><tex-math notation="LaTeX">$-$</tex-math><alternatives> <inline-graphic xlink:href="mardanikamali-ieq1-2735399.gif"/></alternatives></inline-formula>350x speed-up against BOOKSIM. Additionally, we implement a two-layer configurable global interconnection in our proposed architecture to (1) reduce virtualization time overhead, (2) make an efficient trade-off between the resource utilization and simulation time of the whole simulator, and especially (3) provide the capability of simulating irregular topologies. Migration of some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, and implementing a dual-clock context switching in virtualization are other major features of DuCNoC. Thanks to its dual-clock router micro-architecture, as well as TGs and TRs migration to software side, DuCNoC can simulate a 100-node (10 <inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="mardanikamali-ieq2-2735399.gif"/></alternatives></inline-formula> 10) non-virtualized or a 2048-node virtualized mesh network on Xilinx Zynq-7000.01/12/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2735399Bubble Budgeting: Throughput Optimization for Dynamic Workloads by Exploiting Dark Cores in Many Core Systems
https://www.computer.org/csdl/trans/tc/2018/02/08006237-abs.html
All the cores of a many-core chip cannot be active at the same time, due to reasons like low CPU utilization in server systems and limited power budget in dark silicon era. These free cores (referred to as bubbles) can be placed near active cores for heat dissipation so that the active cores can run at a higher frequency level, boosting the performance of applications that run on active cores. Budgeting inactive cores (bubbles) to applications to boost performance has the following three challenges. First, the number of bubbles varies due to open workloads. Second, communication distance increases when a bubble is inserted between two communicating tasks (a task is a thread or process of a parallel application), leading to performance degradation. Third, budgeting too many bubbles as coolers to running applications leads to insufficient cores for future applications. In order to address these challenges, in this paper, a bubble budgeting scheme is proposed to budget free cores to each application so as to optimize the throughput of the whole system. Throughput of the system depends on the execution time of each application and the waiting time incurred for newly arrived applications. Essentially, the proposed algorithm determines the number and locations of bubbles to optimize the performance and waiting time of each application, followed by tasks of each application being mapped to a core region. A Rollout algorithm is used to budget power to the cores as the last step. Experiments show that our approach achieves 50 percent higher throughput when compared to state-of-the-art thermal-aware runtime task mapping approaches. The runtime overhead of the proposed algorithm is in the order of 1M cycles, making it an efficient runtime task management method for large-scale many-core systems.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2735967Efficient Protection of the Register File in Soft-Processors Implemented on Xilinx FPGAs
https://www.computer.org/csdl/trans/tc/2018/02/08008792-abs.html
Soft-processors implemented on SRAM-based FPGAs are increasingly being adopted in on-board computing for space and avionics applications due to their flexibility and ease of integration. However, efficient component-level protection techniques for these processors against radiation-induced upsets are necessary otherwise as system failures could manifest. A register file is one of the critical structures that stores vital information the processor uses related to user computations and program execution. In this paper, we present a fault tolerance technique for the register file of a microprocessor implemented in Xilinx SRAM-based FPGAs. The proposed scheme leverages the inherent implementation redundancy created by the FPGA design automation tools when mapping the register file to on-chip distributed memory. A parity-based error detection and switching logic are added for fault masking against single-bit errors. The proposed scheme has been implemented and evaluated in lowRISC, a RISC-V ISA soft-processor implementation. The effectiveness of the proposed scheme was tested using fault injection. The fault masking overhead required in terms of FPGA resources was much lower than a traditional Triple Modular Redundancy protection. Therefore, the proposed scheme is an interesting option to protect the register file of soft processors that are implemented in Xilinx FPGAs.01/12/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2737996Genetic Programming for Energy-Efficient and Energy-Scalable Approximate Feature Computation in Embedded Inference Systems
https://www.computer.org/csdl/trans/tc/2018/02/08008802-abs.html
With the increasing interest in deploying embedded sensors in a range of applications, there is also interest in deploying embedded inference capabilities. Doing so under the strict and often variable energy constraints of the embedded platforms requires algorithmic, in addition to circuit and architectural, approaches to reducing energy. A broad approach that has recently received considerable attention in the context of inference systems is approximate computing. This stems from the observation that many inference systems exhibit various forms of tolerance to data noise. While some systems have demonstrated significant approximation-versus-energy knobs to exploit this, they have been applicable to specific kernels and architectures; the more generally available knobs have been relatively weak, resulting in large data noise for relatively modest energy savings (e.g., voltage overscaling, bit-precision scaling). In this work, we explore the use of genetic programming (GP) to compute approximate features. Further, we leverage a method that enhances tolerance to feature-data noise through directed retraining of the inference stage. Previous work in GP has shown that it generalizes well to enable approximation of a broad range of computations, raising the potential for broad applicability of the proposed approach. The focus on feature extraction is deliberate because they involve diverse, often highly nonlinear, operations, challenging general applicability of energy-reducing approaches. We evaluate the proposed methodologies through two case studies, based on energy modeling of a custom low-power microprocessor with a classification accelerator. The first case study is on electroencephalogram-based seizure detection. We find that the choice of two primitive functions (square root, subtraction) out of seven possible primitive functions (addition, subtraction, multiplication, logarithm, exponential, square root, and square) enables us to approximate features in 0.41<inline-formula><tex-math notation="LaTeX">$mJ$</tex-math><alternatives> <inline-graphic xlink:href="lu-ieq1-2738642.gif"/></alternatives></inline-formula> per feature vector (FV), as compared to 4.79<inline-formula><tex-math notation="LaTeX">$mJ$</tex-math><alternatives> <inline-graphic xlink:href="lu-ieq2-2738642.gif"/></alternatives></inline-formula> per FV required for baseline feature extraction. This represents a feature extraction energy reduction of 11.68<inline-formula><tex-math notation="LaTeX"> $\times$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq3-2738642.gif"/></alternatives></inline-formula>. The important system-level performance metrics for seizure detection are sensitivity, latency, and number of false alarms per hour. Our set of GP models achieves 100 percent sensitivity, 4.37 second latency, and 0.15 false alarms per hour. The baseline performance is 100 percent sensitivity, 3.84 second latency, and 0.06 false alarms per hour. The second case study is on electrocardiogram-based arrhythmia detection. In this case, just one primitive function (multiplication) suffices to approximate features in 1.13<inline-formula><tex-math notation="LaTeX">$\mu J$</tex-math> <alternatives><inline-graphic xlink:href="lu-ieq4-2738642.gif"/></alternatives></inline-formula> per FV, as compared to 11.69<inline-formula><tex-math notation="LaTeX">$\mu J$</tex-math><alternatives> <inline-graphic xlink:href="lu-ieq5-2738642.gif"/></alternatives></inline-formula> per FV required for baseline feature extraction. This represents a feature extraction energy reduction of 10.35<inline-formula><tex-math notation="LaTeX"> $\times$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq6-2738642.gif"/></alternatives></inline-formula>. The important system-level metrics in this case are sensitivity, specificity, and accuracy. Our set of GP models achieves 81.17 percent sensitivity, 80.63 percent specificity, and 81.86 percent accuracy, whereas the baseline achieves 82.05 percent sensitivity, 88.12 percent specificity, and 87.92 percent accuracy. These case studies demonstrate the possibility of a significant reduction in feature extraction energy at the expense of a slight degradation in system performance.01/12/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2738642Compact CA-Based Single Byte Error Correcting Codec
https://www.computer.org/csdl/trans/tc/2018/02/08010467-abs.html
Memory contents are usually corrupted due to soft errors caused by external radiation and hence the reliability of memory systems is reduced. In order to enhance the reliability of memory systems, error correcting codes (ECC) are widely used to detect and correct errors. Single bit error correcting with double bits errors detecting codes are generally used in memory systems. But in case of multiple cell errors, these codes are unable to detect and correct errors. Recently, single byte error correcting Reed Solomon (SEC-RS) codes are used to detect and correct single byte error in memory systems. In this paper, a new single byte error correcting (SEC) code is proposed based on the concept of cellular automata (termed as CASEC). The main aim of this work is to reduce the area and power of SEC encoder and decoder circuit without affecting delay. In this paper, CASEC(10,8,8), CASEC(18,16,8), 2xCASEC(10,8,4) and 2xCASEC(19,6,4) codecs are designed and implemented. CASEC(18,16,8) codec has 67.79 percent lesser hardware complexity compared to existing design. Proposed codecs are simulated and synthesized for both FPGA and ASIC platforms. It is found that speed of the proposed design is almost equal to the existing design but requires lesser area and power. Area-delay product (ADP) of proposed CASEC(10,8,8), CASEC(18,16,8), 2xCASEC(10,8,4) codecs are better compared to the existing designs.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2739726Bi-Objective Optimization of Data-Parallel Applications on Homogeneous Multicore Clusters for Performance and Energy
https://www.computer.org/csdl/trans/tc/2018/02/08013836-abs.html
Performance and energy are now the most dominant objectives for optimization on modern parallel platforms composed of multicore CPU nodes. The existing intra-node and inter-node optimization methods employ a large set of decision variables but do not consider problem size as a decision variable and assume a linear relationship between performance and problem size and between energy consumption and problem size. We demonstrate using experiments of real-life data-parallel applications on modern multicore CPUs that these relationships have complex (non-linear and even non-convex) properties and, therefore, that the problem size has become an important decision variable that can no longer be ignored. This key finding motivates our work in this paper. In this paper, we first formulate the bi-objective optimization problem for performance and energy (BOPPE) for data-parallel applications on homogeneous clusters of modern multicore CPUs. It contains only one but heretofore unconsidered decision variable, the problem size. We then present an efficient and exact global optimization algorithm called <italic>ALEPH</italic> that solves the <italic>BOPPE</italic>. It takes as inputs, discrete functions of performance and dynamic energy consumption against problem size and outputs the globally Pareto-optimal set of solutions. The solutions are the workload distributions, which achieve inter-node optimization of data-parallel applications for performance and energy. While existing solvers for <italic>BOPPE</italic> give only one solution when the problem size and number of processors are fixed, our algorithm gives a diverse set of globally Pareto-optimal solutions. The algorithm has time complexity of <inline-formula><tex-math notation="LaTeX">$O(m^2 \times p^2)$</tex-math><alternatives> <inline-graphic xlink:href="manumachu-ieq1-2742513.gif"/></alternatives></inline-formula> where <inline-formula> <tex-math notation="LaTeX">$m$</tex-math><alternatives><inline-graphic xlink:href="manumachu-ieq2-2742513.gif"/> </alternatives></inline-formula> is the number of points in the discrete speed/energy function and <inline-formula> <tex-math notation="LaTeX">$p$</tex-math><alternatives><inline-graphic xlink:href="manumachu-ieq3-2742513.gif"/> </alternatives></inline-formula> is the number of available processors. We experimentally study the efficiency and scalability of our algorithm for two data parallel applications, matrix multiplication and fast Fourier transform, on a modern multicore CPU and homogeneous clusters of such CPUs. Based on our experiments, we show that the average and maximum sizes of the globally Pareto-optimal sets determined by our algorithm are 15 and 34 and 7 and 20 for the two applications respectively. Comparing with load-balanced workload distribution solution, the average and maximum percentage improvements in performance and energy respectively demonstrated for the first application are (13%,97%) and (18%,71%). For the second application, these improvements are (40%,95%) and (22%,127%). Assuming 5 percent performance degradation from the optimal is acceptable, the average and maximum improvements in energy consumption demonstrated for the two applications respectively are 9 and 44 and 8 and 20 percent. Using the algorithm and its building blocks, we also present a study of interplay between performance and energy. We demonstrate how <italic>ALEPH</italic> can be combined with <italic> DVFS</italic>-based Multi-Objective Optimization (MOP) methods to give a better set of (globally Pareto-optimal) solutions.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2742513D$^{3}$ : A Dynamic Dual-Phase Deduplication Framework for Distributed Primary Storage
https://www.computer.org/csdl/trans/tc/2018/02/08015182-abs.html
Deploying deduplication for distributed primary storage is a sophisticated and challenging task, considering that the demands of low read/write latency, stable read/write performance, and efficient space saving are all of paramount importance. Unfortunately, existing schemes cannot present a satisfactory solution for the aforementioned requirements simultaneously. In this article, we propose D<inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math><alternatives> <inline-graphic xlink:href="deng-ieq2-2743199.gif"/></alternatives></inline-formula>, a dynamic dual-phase deduplication framework for distributed primary storage. Several major innovations are established in D<inline-formula> <tex-math notation="LaTeX">$^{3}$</tex-math><alternatives><inline-graphic xlink:href="deng-ieq3-2743199.gif"/> </alternatives></inline-formula>. First, we formulate a deduplication-oriented taxonomy called <italic>Dedup-Type </italic>, to group data with similar deduplication-related characteristics into larger categories. It serves as coarse-grained filter and one of the prioritizing references in D<inline-formula><tex-math notation="LaTeX">$^{3}$ </tex-math><alternatives><inline-graphic xlink:href="deng-ieq4-2743199.gif"/></alternatives></inline-formula>. Second, D<inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math><alternatives> <inline-graphic xlink:href="deng-ieq5-2743199.gif"/></alternatives></inline-formula> is a dual-phase framework—inline-phase and offline-phase deduplication processes work in concert with each other. Third, D <inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math><alternatives> <inline-graphic xlink:href="deng-ieq6-2743199.gif"/></alternatives></inline-formula> operates in a dynamic manner. We design two critical mechanisms: <italic>context-aware threshold adjustment</italic> (CTA) for local inline-phase deduplication, and <italic>deferred priority-based enforcement</italic> (DPE) for global offline-phase deduplication. The CTA mechanism enables selective deduplication under a periodically updated threshold. Data skipped during the inline phase is regarded as a candidate for offline phase, and is handled in a prioritized order under the governance of DPE mechanism. Evaluation results demonstrate that, compared with conventional inline and offline deduplication schemes, D<inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math><alternatives> <inline-graphic xlink:href="deng-ieq7-2743199.gif"/></alternatives></inline-formula> achieves more efficient and stabler read/write performance with competitive space saving.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2743199Principal Component Analysis Based Filtering for Scalable, High Precision k-NN Search
https://www.computer.org/csdl/trans/tc/2018/02/08024082-abs.html
Approximate <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="eyers-ieq1-2748131.gif"/></alternatives></inline-formula> Nearest Neighbours (A <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="eyers-ieq2-2748131.gif"/></alternatives></inline-formula>NN) search is widely used in domains such as computer vision and machine learning. However, A<inline-formula><tex-math notation="LaTeX">$k$ </tex-math><alternatives><inline-graphic xlink:href="eyers-ieq3-2748131.gif"/></alternatives></inline-formula>NN search in high-dimensional datasets does not scale well on multicore platforms, due to its large memory footprint. Parallel A <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="eyers-ieq4-2748131.gif"/></alternatives></inline-formula>NN search using space subdivision for filtering helps reduce the memory footprint, but its loss of precision is unstable. In this paper, we propose a new data filtering method—PCAF—for parallel A<inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="eyers-ieq5-2748131.gif"/></alternatives></inline-formula>NN search based on principal component analysis. PCAF improves on previous methods, demonstrating sustained, high scalability for a wide range of high-dimensional datasets on both Intel and AMD multicore platforms. Moreover, PCAF maintains highly precise A<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="eyers-ieq6-2748131.gif"/></alternatives></inline-formula>NN search results.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2748131ClusterFetch: A Lightweight Prefetcher for Intensive Disk Reads
https://www.computer.org/csdl/trans/tc/2018/02/08025580-abs.html
By overlapping disk accesses with computation-intensive operations, prefetching can reduce delays in launching an application and in loading significant amounts of data while the application is running. The key to effective prefetching is making the tradeoff between the mining accuracy of selecting relevant blocks, and the time to decide those blocks. To address this problem, we propose a new prefetcher called <italic>ClusterFetch</italic>. In its learning mode, ClusterFetch detects periods of intensive disk accesses by monitoring the speed at which read requests are queued; it re-organizes these reads and locates the file opened by the application just before each such period. During subsequent runs of the same application, ClusterFetch prefetches the data associated with the opening of a “trigger” file. Our experimental results show that ClusterFetch implemented in Linux can reduce the application launch time by up to 41.3 percent and the loading time by up to 38.2 percent, while taking up less than 200 KB of main memory.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2748939A Generic Construction of Quantum-Oblivious-Key-Transfer-Based Private Query with Ideal Database Security and Zero Failure
https://www.computer.org/csdl/trans/tc/2018/01/07962191-abs.html
Higher security and lower failure probability have always been people’s pursuits in quantum-oblivious-key-transfer-based private query (QOKT-PQ) protocols since Jacobi et al. [<italic>Phys. Rev. A </italic> 83, 022301 (2011)] proposed the first protocol of this kind. However, higher database security generally has to be obtained at the cost of a higher failure probability, and vice versa. Recently, based on a round-robin differential-phase-shift quantum key distribution protocol, Liu et al. [<italic>Sci. China-Phys. Mech. Astron. </italic>, 58, 100301 (2015)] presented a private query protocol (RRDPS-PQ protocol) utilizing ideal single-photon signal which realizes both ideal database security and zero failure probability. However, ideal single-photon source is not available today, and for large database the required pulse train is too long to implement. Here, we reexamine the security of RRDPS-PQ protocol under imperfect source and present an improved protocol using a special “low-shift and addition” (LSA) technique, which not only can be used to query from large database but also retains the features of “ideal database security” and “zero-failure” even under weak coherent source. Finally, we generalize the LSA technique and establish a generic QOKT-PQ model in which both “ideal database security” and “zero failure” are achieved via acceptable communications.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2721404Type Information Elimination from Objects on Architectures with Tagged Pointers Support
https://www.computer.org/csdl/trans/tc/2018/01/07962268-abs.html
Implementations of object-oriented programming languages associate type information with each object to perform various runtime tasks such as dynamic dispatch, type introspection, and reflection. A common means of storing such relation is by inserting a pointer to the associated type information into every object. Such an approach, however, introduces memory and performance overheads when compared with non-object-oriented languages. Recent 64-bit computer architectures have added support for <italic>tagged pointers</italic> by ignoring a number of bits - <italic>tag </italic> - of memory addresses during memory access operations and utilize them for other purposes; mainly security. This paper presents the first investigation into how this hardware support can be exploited by a Java Virtual Machine to remove type information from objects. Moreover, we propose novel hardware extensions to the address generation and load-store units to achieve low-overhead type information retrieval and tagged object pointers compression-decompression. The evaluation has been conducted after integrating the Maxine VM and the ZSim microarchitectural simulator. The results, across all the DaCapo benchmark suite, pseudo-SPECjbb2005, SLAMBench and GraphChi-PR executed to completion, show up to 26 and 10 percent geometric mean heap space savings, up to 50 and 12 percent geometric mean dynamic DRAM energy reduction, and up to 49 and 3 percent geometric mean execution time reduction with no significant performance regressions.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2709739Time-Triggered Co-Scheduling of Computation and Communication with Jitter Requirements
https://www.computer.org/csdl/trans/tc/2018/01/07967685-abs.html
The complexity of embedded application design is increasing with growing user demands. In particular, automotive embedded systems are highly complex in nature, and their functionality is realized by a set of periodic tasks. These tasks may have hard real-time requirements and communicate over an interconnect. The problem is to efficiently co-schedule task execution on cores and message transmission on the interconnect so that timing constraints are satisfied. Contemporary works typically deal with zero-jitter scheduling, which results in lower resource utilization, but has lower memory requirements. This article focuses on jitter-constrained scheduling that puts constraints on the tasks jitter, increasing schedulability over zero-jitter scheduling. The contributions of this article are: 1) Integer Linear Programming and Satisfiability Modulo Theory model exploiting problem-specific information to reduce the formulations complexity to schedule small applications. 2) A heuristic approach, employing three levels of scheduling scaling to real-world use-cases with 10,000 tasks and messages. 3) An experimental evaluation of the proposed approaches on a case-study and on synthetic data sets showing the efficiency of both zero-jitter and jitter-constrained scheduling. It shows that up to 28 percent higher resource utilization can be achieved by having up to 10 times longer computation time with relaxed jitter requirements.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2722443PowerCool: Simulation of Cooling and Powering of 3D MPSoCs with Integrated Flow Cell Arrays
https://www.computer.org/csdl/trans/tc/2018/01/07967719-abs.html
Integrated Flow-Cell Arrays (FCAs) represent a combination of integrated liquid cooling and on-chip power generation, converting chemical energy of the flowing electrolyte solutions to electrical energy. The FCA technology provides a promising way to address both heat removal and power delivery issues in 3D Multiprocessor Systems-on-Chips (MPSoCs). In this paper we motivate the benefits of FCA in 3D MPSoCs via a qualitative analysis and explore the capabilities of the proposed technology using our extended PowerCool simulator. PowerCool is a tool that performs combined compact thermal and electrochemical simulation of 3D MPSoCs with inter-tier FCA-based cooling and power generation. We validate our electrochemical model against experimental data obtained using a micro-scale FCA, and extend PowerCool with a compact thermal model (3D-ICE) and subthreshold leakage estimation. We show the sensitivity of the FCA cooling and power generation on the design-time (FCA geometry) and run-time (fluid inlet temperature, flow rate) parameters. Our results show that we can optimize the FCA to keep maximum chip temperature below 95 <inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math><alternatives> <inline-graphic xlink:href="andreev-ieq1-2695179.gif"/></alternatives></inline-formula>C for an average chip power consumption of 50 W/cm<sup>2</sup> while generating up to 3.6 W per cm<sup>2</sup> of chip area.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2695179Efficient Detection for Malicious and Random Errors in Additive Encrypted Computation
https://www.computer.org/csdl/trans/tc/2018/01/07967774-abs.html
Although data confidentiality is the primary security objective in additive encrypted computation applications, such as the aggregation of encrypted votes in electronic elections, ensuring the trustworthiness of data is equally important. And yet, integrity protections are generally orthogonal to additive homomorphic encryption, which enables efficient encrypted computation, due to the inherent malleability of homomorphic ciphertexts. Since additive homomorphic schemes are founded on modular arithmetic, our framework extends residue numbering to support fast modular reductions and homomorphic syndromes for detecting random errors inside homomorphic ALUs and data memories. In addition, our methodology detects <italic>malicious modifications</italic> of memory data, using keyed syndromes and block cipher-based integrity trees, which allow preserving the homomorphism of ALU operations, while enforcing non-malleability of memory data. Compared to traditional memory integrity protections, our tree-based syndrome generation and updating is parallelizable for increased efficiency, while requiring a small Trusted Computing Base for secret key storage and block cipher operations. Our evaluation shows more than 99.999 percent detection rate for random ALUs errors, as well as 100 percent detection rate of single bit-flips and clustered multiple bit upsets, for a runtime overhead between 1.2 and 5.5 percent, and a small area penalty.12/11/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2722440EXTREME: Exploiting Page Table for Reducing Refresh Power of 3D-Stacked DRAM Memory
https://www.computer.org/csdl/trans/tc/2018/01/07968377-abs.html
For future exascale computing systems, ultra-high-density memories would be required that consume low power to process massive data. Of the various memory devices, 3D-stacked DRAMs using TSVs are a perfect solution for this purposes. In addition to providing high capacity, these provide functional flexibility to the computing system by attaching a logic die in each 3D-stacked DRAM chip. However, the high capacity 3D-stacked DRAMs suffer from a significant loss of refresh power, which is solely required to maintain data. Although various schemes have been proposed to mitigate this issue, they cannot be adopted by commercial products due to compatibility issues. To tackle this issue, we propose EXTREME, which effectively reduces the refresh power of 3D-Stacked DRAMs. In order to retain the compatibility with OS, a simple page table manager is implemented at the logic die of 3D-stacked DRAM devices, which pins the page table to a specific memory space. The experiment results demonstrate that this reduces the refresh power at idle time by up to 98 percent with 16 KB of SRAM (static RAM) and 64 KB of DRAM register overhead for a 2 GB 3D-stacked DRAM device.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2723392Analytical Miss Rate Calculation of L2 Cache from the RD Profile of L1 Cache
https://www.computer.org/csdl/trans/tc/2018/01/07970187-abs.html
Reuse distance is an important metric for analytical estimation of cache miss rate. To find the miss rate of a particular cache, the reuse distance profile has to be measured for that particular level and configuration of the cache. Significant amount of simulation time and overhead can be reduced if we can find the miss rate of higher level cache like L2 cache from the RD profile with respect to a lower level cache (i.e., cache that is closer to the processor) such as L1. The objective of this paper is to give an analytical method to find the miss rate of L2 cache for various configurations from the RD profile with respect to L1 cache. We consider all three types of cache inclusion policies namely (i) Strictly Inclusive, (ii) Mutually Exclusive and (iii) Non-Inclusive Non-Exclusive policy. We first prove some general results relating the RD profile of L1 cache to that of L2 cache. We use probabilistic analysis for our derivations. We validate our model against simulations, using the multi-core simulator Sniper with the PARSEC and the SPLASH benchmark suites.12/11/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2723878Solving Large Problem Sizes of Index-Digit Algorithms on GPU: FFT and Tridiagonal System Solvers
https://www.computer.org/csdl/trans/tc/2018/01/07970194-abs.html
Current <italic>Graphics Processing Units</italic> (GPUs) are capable of obtaining high computational performance in scientific applications. Nevertheless, programmers have to use suitable parallel algorithms for these architectures and usually have to consider optimization techniques in the implementation in order to achieve said performance. There are many efficient proposals for limited-size problems which fit directly in the shared memory of CUDA GPUs, however, there are few GPU proposals that tackle the design of efficient algorithms for large problem sizes that exceed shared memory storage capacity. In this work, we present a tuning strategy that addresses this problem for some parallel prefix algorithms that can be represented according to a set of common permutations of the digits of each of its element indices <xref ref-type="bibr" rid="ref1">[1]</xref> , denoted as Index-Digit (ID) algorithms. Specifically, our strategy has been applied to develop flexible Multi-Stage (MS) algorithms for the Fast Fourier Transform (FFT) algorithm (<italic>MS-ID-FFT</italic>) and a tridiagonal system solver (<italic>MS-ID-TS</italic>) on the GPU. The resulting implementation is compact and outperforms other well-known and commonly used state-of-the-art libraries, with an improvement of up to 1.47x with respect to <italic> NVIDIA's</italic> complex <italic>CUFFT</italic>, and up to 33.2x in comparison with <italic>NVIDIA's</italic> <italic>CUSPARSE</italic> for real data tridiagonal systems.12/11/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2723879STABLE: Stress-Aware Boolean Matching to Mitigate BTI-Induced SNM Reduction in SRAM-Based FPGAs
https://www.computer.org/csdl/trans/tc/2018/01/07974802-abs.html
<italic>Biased-Temperature-Instability</italic> (BTI) aging mechanism reduces <italic>Static-Noise-Margin</italic> (SNM) of SRAM cells. This leads to a higher <italic>Soft-Error-Rate</italic> (SER), lower reliability, and lower SRAMs’ stability in FPGAs. SNM partially improves by leveraging the recovery phase of BTI through flipping SRAM content. We propose STABLE, a three-step post-synthesis stress-aware technique, in order to reduce the impact of BTI-induced SNM reduction in FPGA <italic>Look-up-Tables</italic> (LUTs) using the SAT-based <italic>Boolean Matching </italic> (BM) algorithm. STABLE partitions <italic>Data-Flow-Graph</italic> (DFG) of the implemented design into different cones. First, the SAT-based BM algorithm finds a new configuration for each cone while their functionalities are preserved and all SRAMs are flipped. Second, cones that did not pass the first step can benefit from unused SRAMs in their partially-used LUTs for storing the flipped configurations of such LUTs. Finally, flipped configurations of fully-used LUTs are stored in the closest unused LUTs. The main configuration of the implemented FPGA design is swapped by the new flipped configuration, periodically. Our extensive experimental analysis demonstrates 69 and 70 percent on average improvements in the SNM reduction () and the SER increase (<inline-formula> <tex-math notation="LaTeX">$\Delta SER$</tex-math></inline-formula>), respectively. Since the proposed methodology is deployed after the FPGA placement and routing of the application, the overhead is negligible.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2725952Performability Analysis of Large-Scale Multi-State Computing Systems
https://www.computer.org/csdl/trans/tc/2018/01/07981361-abs.html
Modern computing systems typically use a large number of independent, non-identical computing nodes to perform a set of coordinated computations in parallel. The computing system and its constituent computing nodes often exhibit more than two performance levels or states corresponding to different computing powers. This paper models and evaluates performability of large-scale multi-state computing systems, which is the probability that a computing system performs at a particular performance level. The heterogeneity in the constituent components of different nodes (due to factors such as different model generations, model suppliers, and operating environments) makes performability analysis difficult and challenging. In this paper a specification method for system performance level (SPL) is first introduced. A multi-valued decision diagram (MDD) based approach is then proposed for performability analysis of multi-state computing systems consisting of nodes with different state occupation probabilities, which encompasses novel and efficient MDD model generation procedures. Example and benchmark studies are performed to show that the proposed approach can offer efficient performability analysis of large-scale computing systems.12/11/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2723390Leveraging Hardware-Assisted Virtualization for Deterministic Replay on Commodity Multi-Core Processors
https://www.computer.org/csdl/trans/tc/2018/01/07982675-abs.html
Deterministic replay, which provides the ability to travel backward in time and reconstruct the past execution flow of a multiprocessor system, has many prominent applications. Prior research in this area can be classified into two categories: hardware-only schemes and software-only schemes. While hardware-only schemes deliver high performance, they require significant modifications to the existing hardware. In contrast, software-only schemes work on commodity hardware, but suffer from excessive performance overhead and huge logs. In this article, we present the design and implementation of a novel system, Samsara, which uses the hardware-assisted virtualization (HAV) extensions to achieve efficient deterministic replay without requiring any hardware modification. Unlike prior software schemes which trace every single memory access to record interleaving, Samsara leverages HAV on commodity processors to track the read-set and write-set for implementing a chunk-based recording scheme in software. By doing so, we avoid all memory access detections, which is a major source of overhead in prior works. Evaluation results show that compared with prior software-only schemes, Samsara significantly reduces the log file size to 1/70th on average, and further reduces the recording overhead from about <inline-formula><tex-math notation="LaTeX">$10 \times$</tex-math><alternatives> <inline-graphic xlink:href="ren-ieq1-2727492.gif"/></alternatives></inline-formula>, reported by state-of-the-art works, to <inline-formula><tex-math notation="LaTeX">$2.1 \times$</tex-math><alternatives> <inline-graphic xlink:href="ren-ieq2-2727492.gif"/></alternatives></inline-formula> on average.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.27274922017 Index IEEE Transactions on Computers Vol. 66
https://www.computer.org/csdl/trans/tc/2018/01/08176067-abs.html
12/11/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.27742092017 Reviewers List
https://www.computer.org/csdl/trans/tc/2018/01/08176068-abs.html
12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2773319State of the Journal
https://www.computer.org/csdl/trans/tc/2018/01/08176070-abs.html
12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2770229Arithmetical Improvement of the Round-Off for Cryptosystems in High-Dimensional Lattices
https://www.computer.org/csdl/trans/tc/2017/12/07891511-abs.html
With Lattice-based cryptography (LBC), ciphertexts are represented as points near a lattice, and Babai’s round-off algorithm allows to decrypt them when one knows the secret-key. Recently, an accelerated variant of the round-off, based on Residue Number Systems (RNSs), has been proposed. Herein, we combine this technique with the use of lattices of Optimal Hermite Normal Form (OHNF) and propose further refinements, so as to reduce the decryption complexity. This approach lends itself largely to data-level parallelism, allowing for low latency decryption operations on multi-core CPUs with Single Instruction Multiple Data (SIMD) extensions, and achieves high-throughput on GPUs. Finally, we are able to perform decryptions up to 20 times faster than the most efficient implementation in related art, which exploits the Mixed-Radix System (MRS), in an Intel i7 6700K CPU, and we are able to decrypt up to 11,832 messages/s in a Titan X GPU.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2690420Correctly Rounded Arbitrary-Precision Floating-Point Summation
https://www.computer.org/csdl/trans/tc/2017/12/07891894-abs.html
We present a fast algorithm together with its low-level implementation of correctly rounded arbitrary-precision floating-point summation. The arithmetic is the one used by the GNU MPFR library: radix 2; no subnormals; each variable (each input and the output) has its own precision. We also give a worst-case complexity of this algorithm and describe how the implementation is tested.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2690632Exponential Sums and Correctly-Rounded Functions
https://www.computer.org/csdl/trans/tc/2017/12/07891945-abs.html
The 2008 revision of the IEEE-754 standard, which governs floating-point arithmetic, recommends that a certain set of elementary functions should be correctly rounded. Successful attempts for solving the Table Maker's Dilemma in binary64 made it possible to design <monospace>CRlibm</monospace>, a library which offers correctly rounded evaluation in binary64 of some functions of the usual <monospace>libm</monospace>. It evaluates functions using a two step strategy, which relies on a folklore heuristic that is well spread in the community of mathematical functions designers. Under this heuristic, one can compute the distribution of the lengths of runs of zeros/ones after the rounding bit of the value of the function at a given floating-point number. The goal of this paper is to change, whenever possible, this heuristic into a rigorous statement. The underlying mathematical problem amounts to counting integer points in the neighborhood of a curve, which we tackle using so-called exponential sums techniques, a tool from analytic number theory.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2690850Optimization of Constant Matrix Multiplication with Low Power and High Throughput
https://www.computer.org/csdl/trans/tc/2017/12/07919250-abs.html
Constant matrix multiplication (CMM), i.e., the multiplication of a constant matrix with a vector, is a common operation in digital signal processing. It is a generalization of multiple constant multiplication (MCM) where a single variable is multiplied by a constant vector. Like MCM, CMM can be reduced to additions/subtractions and bit shifts. Finding a circuit with minimal number of add/subtract operations is known as the CMM problem. While this leads to a reduction in circuit area it may be less efficient for power consumption or throughput. It is well studied for the MCM problem that a) reducing the adder depth (AD) leads to a reduced power consumption and b) pipeline resources have to be considered during optimization to enhance throughput without wasting area. This paper addresses the optimization of CMM circuits which considers both adder depth and pipelining for the first time. For that, a heuristic is proposed which evaluates the most attractive graph topologies. It is shown that the proposed method requires 12.5% less adders with min. AD and 38.5% less pipelined operations. Synthesis results for recent FPGAs show that these reductions also translate to superior results in terms of delay and power consumption compared to the state-of-the-art.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2701365Exact Lookup Tables for the Evaluation of Trigonometric and Hyperbolic Functions
https://www.computer.org/csdl/trans/tc/2017/12/07927421-abs.html
Elementary mathematical functions are pervasively used in many applications such as electronic calculators, computer simulations, or critical embedded systems. Their evaluation is always an approximation, which usually makes use of mathematical properties, precomputed tabulated values, and polynomial approximations. Each step generally combines error of approximation and error of evaluation on finite-precision arithmetic. When they are used, tabulated values generally embed rounding error inherent to the transcendence of elementary functions. In this article, we propose a general method to use error-free values that is worthy when two or more terms have to be tabulated in each table row. For the trigonometric and hyperbolic functions, we show that Pythagorean triples can lead to such tables in little time and memory usage. When targeting correct rounding in double precision for the same functions, we also show that this method saves memory and floating-point operations by up to 29 and 42 percent, respectively.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2703870Single Precision Logarithm and Exponential Architectures for Hard Floating-Point Enabled FPGAs
https://www.computer.org/csdl/trans/tc/2017/12/07927449-abs.html
In this article we present a novel method for implementing floating point (FP) elementary functions using the new FP single precision addition and multiplication features of the Arria 10 and Stratix 10 DSP Block architecture. Our application examples are <inline-formula><tex-math notation="LaTeX">$\log (x)$</tex-math><alternatives> <inline-graphic xlink:href="pasca-ieq1-2703923.gif"/></alternatives></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\exp (x)$</tex-math><alternatives><inline-graphic xlink:href="pasca-ieq2-2703923.gif"/> </alternatives></inline-formula>, two of the most commonly required functions for emerging datacenter and computing FPGA targets. We explain why the combination of new FPGA technology, and at the same time, a massive increase in computing performance requirement, fuels the need for this work. We show a comprehensive error analysis, and discuss various implementation trade-offs that demonstrate that the hard FP (HFP) Blocks, in conjunction with the traditional flexibility and connectivity of the FPGA, can provide a robust and high performance solution. The architectures presented in this work meet OpenCL accuracy requirements. Our methods map extensively to embedded structures, and therefore result in significant reduction in logic resources and routing stress compared to current methods. The methods allow leveraging the routing architectures introduced in the Stratix 10 device which results in high-function performance.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2703923Fast Modular Arithmetic on the Kalray MPPA-256 Processor for an Energy-Efficient Implementation of ECM
https://www.computer.org/csdl/trans/tc/2017/12/07927487-abs.html
The Kalray MPPA-256 processor is based on a recent low-energy manycore architecture. In this article, we investigate its performance in multiprecision arithmetic for number-theoretic applications. We have developed a library for modular arithmetic that takes full advantage of the particularities of this architecture. This is in turn used in an implementation of the ECM, an algorithm for integer factorization using elliptic curves. For parameters corresponding to a cryptanalytic context, our implementation compares well to state-of-the-art implementations on GPU, while using much less energy.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2704082High Performance Parallel Decimal Multipliers Using Hybrid BCD Codes
https://www.computer.org/csdl/trans/tc/2017/12/07931610-abs.html
A parallel decimal multiplier with improved performance is proposed in this paper by exploiting the properties of three different binary coded decimal (BCD) codes, namely the redundant BCD excess-3 code (XS-3), the overloaded decimal digit set (ODDS) code and the BCD-4221/5211 code. The signed-digit radix-10 recoding is used to recode the BCD multiplier to the digit set [-5, 5] from [0, 9]. The redundant BCD XS-3 code is adopted to generate the multiplicand multiples in a carry-free manner. The XS-3 coded partial products (PPs) are converted to ODDS PPs to fit binary partial product reduction (PPR). In this paper, a regular decimal PPR tree using ODDS and BCD-4221/5211 codes is proposed; it consists of a binary PPR tree block, a non-fixed size BCD-4221 counter block and a BCD-4221/5211 PPR tree block. The decimal carry-save algorithm based on BCD-4221/5211 is used in the PPR tree to obtain high performance multipliers. Moreover, an improved PPG circuit and an improved parallel prefix/carry-select decimal adder are proposed to further improve the performance of the proposed multipliers. Analysis and comparison using the 45 nm technology show that the proposed decimal multipliers are faster and require less hardware area than previous designs found in the technical literature.11/08/2017 7:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2706262Hardware Division by Small Integer Constants
https://www.computer.org/csdl/trans/tc/2017/12/07933010-abs.html
This article studies the design of custom circuits for division by a small positive constant. Such circuits can be useful for specific FPGA and ASIC applications. The first problem studied is the Euclidean division of an unsigned integer by a constant, computing a quotient and remainder. Several new solutions are proposed and compared against the state-of-the-art. As the proposed solutions use small look-up tables, they match well with the hardware resources of an FPGA. The article then studies whether the division by the product of two constants is better implemented as two successive dividers or as one atomic divider. It also considers the case when only a quotient or only a remainder is needed. Finally, it addresses the correct rounding of the division of a floating-point number by a small integer constant. All these solutions, and the previous state-of-the-art, are compared in terms of timing, area, and area-timing product. In general, the relevance domains of the various techniques are different on FPGA and on ASIC.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2707488Efficient Multibyte Floating Point Data Formats Using Vectorization
https://www.computer.org/csdl/trans/tc/2017/12/07950938-abs.html
We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on two general-purpose processor (GPP) microarchitectures (Intel Ivy Bridge and Haswell), as well as on a numerical accelerator (Intel Xeon Phi). Our evaluation demonstrates that supporting reduced precision by exploiting native vector instructions can yield a low overhead custom-precision floating point solution that does not require specialized hardware support. In our experiments we find cases where our scheme is actually faster than native floating point types where the underlying vector instruction set supports efficient byte-level permutations.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2716355Introduction to the Special Issue on Computer Arithmetic
https://www.computer.org/csdl/trans/tc/2017/12/08097398-abs.html
The papers in this special issue focus on computer arithmetic which is used in many applications, usually totally silently (one should keep in mind that even when running programs that are not at all numeric, memory addresses are computed, which involves additions, multiplications, and sometimes divisions). However, in some areas, it plays a central role.11/07/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2761278Compressed Sharer Tracking and Relinquishment Coherence for Superior Directory Efficiency of Chip Multiprocessors
https://www.computer.org/csdl/trans/tc/2017/11/07911232-abs.html
To lower on-chip SRAM area overhead for chip multiprocessors (CMPs), this work treats a novel directory design which compresses <underline>p</underline>resent-bit <underline>v</underline>ectors (PVs) by dropping “runs of zeros” commonly existing and lets PVs be transformed to their variations after sharer relinquishment for hashing alternative table sets to lift table utilization. Featured with <underline>re</underline>linquishment <underline>c </underline>oherence and <underline>co</underline>mpressed <underline>s</underline>harer <underline>t</underline> racking (ReCoST), the proposed design attains superior directory efficiency and maintains “exact” directory representations, as a result of dropping abound long runs of zeros present in PVs. According to full-system simulation using gem5 for a range of core counts under PARSEC benchmarks, ReCoST is found to enjoy 3.21<inline-formula> <tex-math notation="LaTeX">$\times$</tex-math><alternatives><inline-graphic xlink:href="shu-ieq1-2698043.gif"/> </alternatives></inline-formula> (or 2.64<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="shu-ieq2-2698043.gif"/></alternatives></inline-formula>) more efficiency in directory storage than conventional bit-tracking directories (or the best directory known so far, called SCD) for a 64-core CMP under monotasking (or multitasking) workloads while ensuring execution slowdowns to stay within 2.4 percent (or 3.3 percent).10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2698043LAWC: Optimizing Write Cache Using Layout-Aware I/O Scheduling for All Flash Storage
https://www.computer.org/csdl/trans/tc/2017/11/07932926-abs.html
Flash memory-based SSD-RAIDs are swiftly replacing conventional hard disk drives by exhibiting improved performance and stability, especially in I/O-intensive environments. However, the variations in latency and throughput occurring due to uncoordinated internal garbage collection cripples further boosting of performance. In addition, the unwanted variations in each SSD can influence the overall performance of the entire flash storage adversely. This performance bottleneck can be essentially reduced by an internal write cache in the RAID controller designed prudently by considering the crucial device characteristics. The state-of-the-art cache write for the RAID controller fails to incorporate device characteristics of flash memory-based SSDs and mitigates the performance gain. In this paper, we propose a novel cache design namely Layout-Aware Write Cache (LAWC) to overcome the performance barrier inculcated by independent garbage collections. LAWC implements (i) improved I/O scheduling for logically partitioned write caches, (ii) a destage write synchronization mechanism to allow individual write caches to flush write blocks into the SSD array in a coordinated manner, and (iii) a two-level hybrid cache algorithm utilizing small front level cache for the improved write cache efficiency. LAWC shows significant reduction in response time by 82.39 percent on RAID-0 and 68.51 percent on RAID-5 types of SSDs when compared with state-of-the-art write cache algorithms.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2707408Probabilistic Error Analysis of Approximate Recursive Multipliers
https://www.computer.org/csdl/trans/tc/2017/11/07935435-abs.html
Approximate multipliers are gaining importance in energy-efficient computing and require careful error analysis. In this paper, we present the error probability analysis for recursive approximate multipliers with approximate partial products. Since these multipliers are constructed from smaller approximate multiplier building blocks, we propose to derive the error probability in an arbitrary bit-width multiplier from the probabilistic model of the basic building block and the probability distributions of inputs. The analysis is based on common features of recursive multipliers identified by carefully studying the behavioral model of state-of-the-art designs. By building further upon the analysis, Probability Mass Function (PMF) of error is computed by individually considering all possible error cases and their inter-dependencies. We further discuss the generalizations for approximate adder trees, signed multipliers, squarers and constant multipliers. The proposed analysis is validated by applying it to several state-of-the-art approximate multipliers and comparing with corresponding simulation results. The results show that the proposed analysis serves as an effective tool for predicting, evaluating and comparing the accuracy of various multipliers. Results show that for the majority of the recursive multipliers, we get accurate error performance evaluation. We also predict the multipliers’ performance in an image processing application to demonstrate its practical significance.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2709542Routing or Computing? The Paradigm Shift Towards Intelligent Computer Network Packet Transmission Based on Deep Learning
https://www.computer.org/csdl/trans/tc/2017/11/07935536-abs.html
Recent years, Software Defined Routers (SDRs) (programmable routers) have emerged as a viable solution to provide a cost-effective packet processing platform with easy extensibility and programmability. Multi-core platforms significantly promote SDRs’ parallel computing capacities, enabling them to adopt artificial intelligent techniques, i.e., deep learning, to manage routing paths. In this paper, we explore new opportunities in packet processing with deep learning to inexpensively shift the computing needs from rule-based route computation to deep learning based route estimation for high-throughput packet processing. Even though deep learning techniques have been extensively exploited in various computing areas, researchers have, to date, not been able to effectively utilize deep learning based route computation for high-speed core networks. We envision a supervised deep learning system to construct the routing tables and show how the proposed method can be integrated with programmable routers using both Central Processing Units (CPUs) and Graphics Processing Units (GPUs). We demonstrate how our uniquely characterized input and output traffic patterns can enhance the route computation of the deep learning based SDRs through both analysis and extensive computer simulations. In particular, the simulation results demonstrate that our proposal outperforms the benchmark method in terms of delay, throughput, and signaling overhead.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2709742A New Look at Counters: Don’t Run Like Marathon in a Hundred Meter Race
https://www.computer.org/csdl/trans/tc/2017/11/07936672-abs.html
In cryptography, counters (classically encoded as bit strings of fixed size for all inputs) are employed to prevent collisions on the inputs of the underlying primitive which helps us to prove the security. In this paper we present a unified notion for counters, called <italic>counter function family</italic>, and identify some necessary and sufficient conditions on counters which give (possibly) simple proof of security for various counter-based cryptographic schemes. We observe that these conditions are trivially true for the classical counters. We also identify and study two variants of the classical counter which satisfy the security conditions. The first variant has message length dependent counter size, whereas the second variant uses universal coding to generate message length independent counter size. Furthermore, these variants provide better performance for shorter messages. For instance, when the message size is <inline-formula><tex-math notation="LaTeX">$2^{19}$</tex-math><alternatives> <inline-graphic xlink:href="jha-ieq1-2710125.gif"/></alternatives></inline-formula> bits, AES-LightMAC with <inline-formula><tex-math notation="LaTeX">$64$</tex-math><alternatives> <inline-graphic xlink:href="jha-ieq2-2710125.gif"/></alternatives></inline-formula>-bit (classical) counter takes <inline-formula><tex-math notation="LaTeX">$1.51$</tex-math><alternatives> <inline-graphic xlink:href="jha-ieq3-2710125.gif"/></alternatives></inline-formula> cycles per byte (cpb), whereas it takes <inline-formula><tex-math notation="LaTeX">$0.81$</tex-math><alternatives> <inline-graphic xlink:href="jha-ieq4-2710125.gif"/></alternatives></inline-formula> cpb and <inline-formula> <tex-math notation="LaTeX">$0.89$</tex-math><alternatives><inline-graphic xlink:href="jha-ieq5-2710125.gif"/> </alternatives></inline-formula> cpb for the first and second variant, respectively. We benchmark the software performance of these variants against the classical counter by implementing them in MACs and HAIFA hash function.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2710125Enhancing Energy Efficiency of Multimedia Applications in Heterogeneous Mobile Multi-Core Processors
https://www.computer.org/csdl/trans/tc/2017/11/07937919-abs.html
Recent smart devices have adopted heterogeneous multi-core processors which have high-performance big cores and low-power small cores. Unfortunately, the conventional task scheduler for heterogeneous multi-core processors does not provide appropriate amount of CPU resources for multimedia applications (whose QoS is important to users), resulting in energy waste; it often executes multimedia applications and non-multimedia applications on the same core. In this paper, we propose an advanced task scheduler for heterogeneous multi-core processors, which provides appropriate amount of CPU resources for multimedia applications. Our proposed task scheduler isolates multimedia applications from non-multimedia applications at runtime, exploiting the fact that multimedia applications have a specific thread for video/audio playback (to play video/audio, a multimedia application should use a function that generates the specific thread). Since multimedia applications usually require a smaller amount of CPU resources than non-multimedia applications due to dedicated hardware decoders, our proposed task scheduler allocates the former to the small cores and the latter to the big cores. In our experiments on an Android-based development board, our proposed task scheduler saves system-wide (not just CPU) energy consumption by 8.9 percent, on average, compared to the conventional task scheduler, preserving QoS of multimedia applications. In addition, it improves performance of non-multimedia applications by 13.7 percent, on average, compared to the conventional task scheduler.10/06/2017 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2710317Non-Volatile Memory Based Page Swapping for Building High-Performance Mobile Devices
https://www.computer.org/csdl/trans/tc/2017/11/07938390-abs.html
Smartphones are getting increasingly high-performance with advances in mobile processors and larger main memories to support feature-rich applications. However, the storage subsystem has always been a prohibitive factor that slows down the pace of reaching even higher performance while maintaining good user experience. Despite today’s smartphones are equipped with larger-than-ever main memories, they consume more energy and still run out of memory. But the slow NAND flash based storage vetoes the possibility of swapping—an important technique to extend main memory—and leaves a system that constantly terminates user applications under memory pressure. In this paper, we propose <italic>NVM-Swap</italic> by revisiting swapping for smartphones with fast, byte-addressable, non-volatile memory (NVM) technologies. Instead of using flash, we build the swap area with NVM, to allow high performance without sacrificing user experience. NVM-Swap supports <italic>Lazy Swap-in</italic>, which can reduce memory copy operations by giving the swapped out pages a second chance to stay in byte-addressable NVM backed swap area. To avoid fast worn-out of certain NVM, we also propose <italic>Heap-Wear</italic>, a wear leveling algorithm that distributes writes in NVM more evenly. Evaluation results based on the Google Nexus 5 smartphone show that our solution can effectively enhance smartphone performance and achieve better wear-leveling of NVM.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2711620A Fully-Pipelined Hardware Design for Gaussian Mixture Models
https://www.computer.org/csdl/trans/tc/2017/11/07938761-abs.html
Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM need to be estimated from data by, for example, the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. This paper presents a novel design for the EM-GMM algorithm targeting reconfigurable platforms, with five main contributions. First, a pipeline-friendly EM-GMM with diagonal covariance matrices that can easily be mapped to hardware architectures. Second, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic. Third, our approach is extended to support a wide range of dimensions or/and components by fitting multiple pieces of smaller dimensions onto an FPGA chip. Fourth, we derive a cost and performance model that estimates logic resources. Fifth, our dataflow design targeting the Maxeler MPC-X2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU. Our design provides a practical solution to applications for training and explores better parameters for GMMs with hundreds of millions of high dimensional input instances, for low-latency and high-performance applications.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2712152Performance Evaluation of Host Aware Shingled Magnetic Recording (HA-SMR) Drives
https://www.computer.org/csdl/trans/tc/2017/11/07942044-abs.html
Shingled Magnetic Recording (SMR) drives can benefit large-scale storage systems by reducing the Total Cost of Ownership (TCO) of dealing with explosive data growth. Among all existing SMR models, Host Aware SMR (HA-SMR) looks the most promising for its backward compatibility with legacy I/O stacks and its ability to use new SMR-specific APIs to support host I/O stack optimization. Building storage systems using HA-SMR drives calls for a deep understanding of the drive’s performance characteristics. To accomplish this, we conduct in-depth performance evaluations on HA-SMR drives with a special emphasis on the performance implications of the SMR-specific APIs and how these drives can be deployed in large storage systems. We discover both favorable and adverse effects of using HA-SMR drives under various workloads. We also investigate the drive’s performance under legacy production environments using real-world enterprise traces. Finally, we propose a novel host-controlled buffer that can help to reduce the severity of the decline in HA-SMR performance under our discovered unfavorable I/O access patterns. Without a detailed comprehensive design, we show the potential of the host-controlled buffer by a case study.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2713360Tackling the Bus Turnaround Overhead in Real-Time SDRAM Controllers
https://www.computer.org/csdl/trans/tc/2017/11/07946168-abs.html
Synchronous dynamic random access memories (SDRAMs) are widely employed in multi- and many-core platforms due to their high-density and low-cost. Nevertheless, their benefits come at the price of a complex two-stage access protocol, which reflects their bank-based structure and an internal level of explicitly managed caching. In scenarios in which requestors demand real-time guarantees, these features pose a predictability challenge and, in order to tackle it, several SDRAM controllers have been proposed. In this context, recent research shows that a combination of bank privatization and <italic>open-row</italic> policy (exploiting the caching over the boundary of a single request) represents an effective way to tackle the problem. However, such approach uncovered a new challenge: the data bus turnaround overhead. In SDRAMs, a single data bus is shared by read and write operations. Alternating read and write operations is, consequently, highly undesirable, as the data bus must remain idle during a turnaround. Therefore, in this article, we propose a SDRAM controller that reorders read and write commands, which minimizes data bus turnarounds. Moreover, we compare our approach analytically and experimentally with existing real-time SDRAM controllers both from the worst-case latency and power consumption perspectives.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2714672Marginal Performance: Formalizing and Quantifying Power Over/Under Provisioning in NoC DVFS
https://www.computer.org/csdl/trans/tc/2017/11/07947135-abs.html
In network-on-chip (NoC) based CMPs, DVFS is commonly used to co-optimize performance and power. To achieve optimal efficiency, it is important to gain proportional performance growth with power. However, power over/under provisioning often exists. To properly evaluate and guide NoC DVFS techniques, it is highly desirable to formalize and quantify power over/under provisioning. In this paper, we first show that application performance does not grow linearly with network power in an NoC-based CMP. Instead, their relationship is non-linear and can be captured using performance-power characteristics curve (PPCC) with three distinct regions: an inertial region, a linear region, and a saturation region. We note that conventional DVFS metrics such as Performance Per Watt (PPW) cannot accurately evaluate such non-linear relationship. Based on PPCC, we propose a new figure of merit called Marginal Performance (MP) which evaluates the incremental performance per power increment after the inertial region. The MP concept enables to formally define power over- and under-provisioning with reference to the linear region in which an efficient NoC DVFS should operate. Applying the PPCC and MP concepts in full-system simulations with PARSEC and SPEC OMP2012 benchmarks, we are able to identify power over/under provisioning occurrences, measure and compare their statistics in two latest NoC DVFS techniques. Moreover, we show evidences that MP can accurately and consistently evaluate the NoC DVFS techniques, avoiding the misjudgement and inconsistency of PPW-based evaluations.10/06/2017 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2715018Customizing Clos Network-on-Chip for Neural Networks
https://www.computer.org/csdl/trans/tc/2017/11/07948744-abs.html
Large-scale neural network accelerators are often implemented as a many-core chip and rely on a network-on-chip to manage the huge amount of inter-neuron traffic. The baseline and different variations of the well-known mesh and tree topologies are the most popular topologies in prior many-core implementations of neural networks. However, the grid-like mesh and hierarchical tree topologies suffer from high diameter and low bisection bandwidth, respectively. In this paper, we present ClosNN, a customized <italic>Clos</italic> topology for <italic>N</italic>eural <italic>N </italic>etworks. The inherent capability of Clos to support multicast and broadcast traffic in a simple and efficient way, as well as its adaptable bisection bandwidth, is the major motivation behind proposing a customized version of this topology as the communication infrastructure of large-scale neural network implementations. We compare ClosNN with some state-of-the-art NoC topologies adopted in recent neural network hardware accelerators and show that it offers lower average message hop count and higher throughput, which directly translates to faster neural information processing.10/06/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2715158Majority Logic Formulations for Parallel Adder Designs at Reduced Delay and Circuit Complexity
https://www.computer.org/csdl/trans/tc/2017/10/07909019-abs.html
The design of high-performance adders has experienced a renewed interest in the last few years; among high performance schemes, parallel prefix adders constitute an important class. They require a logarithmic number of stages and are typically realized using AND-OR logic; moreover with the emergence of new device technologies based on majority logic, new and improved adder designs are possible. However, the best existing majority gate-based prefix adder incurs a delay of <inline-formula><tex-math notation="LaTeX">$2{\mathbf{lo}}{{\mathbf{g}}_2}(\boldsymbol{n}) - 1$</tex-math><alternatives><inline-graphic xlink:href="lombardi-ieq1-2696524.gif"/></alternatives></inline-formula> (due to the <inline-formula><tex-math notation="LaTeX">$\boldsymbol{n}$</tex-math><alternatives> <inline-graphic xlink:href="lombardi-ieq2-2696524.gif"/></alternatives></inline-formula>th carry); this is only marginally better than a design using only AND-OR gates (the latter design has a <inline-formula> <tex-math notation="LaTeX">$2{\mathbf{lo}}{{\mathbf{g}}_2}(\boldsymbol{n}) + 1$</tex-math><alternatives> <inline-graphic xlink:href="lombardi-ieq3-2696524.gif"/></alternatives></inline-formula> gate delay). This paper initially shows that this delay is caused by the output carry equation in majority gate-based adders that is still largely defined in terms of AND-OR gates. In this paper, two new majority gate-based recursive techniques are proposed. The first technique relies on a novel formulation of the majority gate-based equations in the used <italic> group generate</italic> and <italic>group propagate</italic> hardware; this results in a new definition for the output carry, thus reducing the delay. The second contribution of this manuscript utilizes recursive properties of majority gates (through a novel operator) to reduce the circuit complexity of prefix adder designs. Overall, the proposed techniques result in the calculation of the output carry of an <inline-formula><tex-math notation="LaTeX"> $\boldsymbol{n}$</tex-math><alternatives><inline-graphic xlink:href="lombardi-ieq4-2696524.gif"/></alternatives> </inline-formula>-bit adder with only a majority gate delay of <inline-formula><tex-math notation="LaTeX"> ${\mathbf{lo}}{{\mathbf{g}}_2}(\boldsymbol{n}) + 1$</tex-math><alternatives> <inline-graphic xlink:href="lombardi-ieq5-2696524.gif"/></alternatives></inline-formula>. This leads to a reduction of 40percent in delay and 30percent in circuit complexity (in terms of the number of majority gates) for multi-bit addition in comparison to the best existing designs found in the technical literature.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2696524Polysynchronous Clocking: Exploiting the Skew Tolerance of Stochastic Circuits
https://www.computer.org/csdl/trans/tc/2017/10/07911306-abs.html
In the paradigm of stochastic computing, arithmetic functions are computed on randomized bit streams. The method naturally and effectively tolerates very high clock skew. Exploiting this advantage, this paper introduces polysynchronous clocking, a design strategy in which clock domains are split at a very fine level. Each domain is synchronized by an inexpensive local clock. Alternatively, the skew requirements for a global clock distribution network can be relaxed. This allows for a higher working frequency and so lower latency. The benefits of both approaches are quantified. Polysynchronous clocking results in significant latency, area, and energy savings for wide variety of applications.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2697881HRC: A 3D NoC Architecture with Genuine Support for Runtime Thermal-Aware Task Management
https://www.computer.org/csdl/trans/tc/2017/10/07914637-abs.html
In spite of escalating thermal challenges imposed by high power consumption, most reported 3D Network-on-chip (NoC) systems that adopt classic 3D cube (mesh) topology are unable to tackle the thermal management issues directly at the architectural level. Rather, to avoid chip being overheated, tasks running in a “hot” node have to be migrated to a “cooler” one, resulting in increased distance between communicating nodes and ultimately poor performance. In this paper, we propose a new 3D NoC architecture that genuinely supports runtime thermal-aware task management. Dubbed Hierarchical Ring Cluster (HRC), this new hierarchical 3D NoC architecture has three levels across its entire network hierarchy: 1) nodes are grouped as rings, 2) rings are then grouped into cubes, and 3) multiple cubes are connected to form the whole network. Routing in a HRC system is also performed in a hierarchical manner: Paths are set up within rings using low latency circuit switching, and data that need to cross the rings or cubes are routed following dimension-order routing supported by wormhole switching. In this organization, “hot” tasks that need to migrate can move along the rings without incurring increased communication distances. Our experimental results have confirmed that the proposed HRC architecture has a much lower network latency than other known 3D NoC architectures. When working with runtime thermal-aware task migration approaches, HRC can help reduce latency by as much as 80 percent compared to thermal-aware task migration approaches applied to 3D mesh NoC topologies.09/06/2017 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2698456Towards Accurate Statistical Analysis of Security Margins: New Searching Strategies for Differential Attacks
https://www.computer.org/csdl/trans/tc/2017/10/07914659-abs.html
In today's world of the internet, billions of computer systems are connected to one another in a global network. The internet provides an unsecured channel in which hundreds of terabytes of data is being transmitted daily. Computer and software systems rely on encryption algorithms such as block ciphers to ensure that sensitive data remains confidential and secure. However, adversaries can leverage the statistical behavior of underlying ciphers to recover encryption keys. Accurate evaluation of the security margins of these encryption algorithms remains to be a big challenge. In this paper, we tackle this issue by introducing several searching strategies based on differential cryptanalysis. By clustering differential paths, the searching algorithm derives more accurate distinguishers as compared to examining individual paths, which in turn provides a more accurate estimation of cipher security margins. We verify the effectiveness of this technique on ciphers with the generalized Feistel and SPN structures, whereby the best distinguishers for each of the investigated ciphers were obtained by discovering clusters with thousands of paths. With the KATAN block cipher family as a test case, we also show how to apply the searching algorithm alongside other cryptanalysis techniques such as the boomerang attack and related-key model to obtain the best cryptanalytic results. This also depicts the flexibility of the proposed searching scheme, which can be tailored to improve upon other differential attack variants. In short, the proposed searching strategy realizes an automated security evaluation tool with higher accuracy compared to previous techniques. In addition, it is applicable to a wide range of encryption schemes which makes it a flexible tool for both academic research and industrial purposes.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2699190A Differential Fault Attack on Plantlet
https://www.computer.org/csdl/trans/tc/2017/10/07917296-abs.html
Lightweight stream ciphers have received serious attention in the last few years. The present design paradigm considers very small state (less than twice the key size) and use of the secret key bits during pseudo-random stream generation. One such effort, Sprout, had been proposed two years back and it was broken almost immediately. After carefully studying these attacks, a modified version named Plantlet has been designed very recently. While the designers of Plantlet do not provide any analysis on fault attacks, we note that Plantlet is even weaker than Sprout in terms of Differential Fault Attack (DFA). Our investigation, following the similar ideas as in the analysis against Sprout, shows that we require only around 4 faults to break Plantlet by DFA in a few hours time. While fault attack is indeed difficult to implement and our result does not provide any weakness of the cipher in normal mode, we believe that these initial results will be useful for further understanding of Plantlet.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2700469VEGa: A High Performance Vehicular Ethernet Gateway on Hybrid FPGA
https://www.computer.org/csdl/trans/tc/2017/10/07917319-abs.html
Modern vehicles employ a large amount of distributed computation and require the underlying communication scheme to provide high bandwidth and low latency. Existing communication protocols like Controller Area Network (CAN) and FlexRay do not provide the required bandwidth, paving the way for adoption of Ethernet as the next generation network backbone for in-vehicle systems. Ethernet would co-exist with safety-critical communication on legacy networks, providing a scalable platform for evolving vehicular systems. This requires a high-performance network gateway that can simultaneously handle high bandwidth, low latency, and isolation; features that are not achievable with traditional processor based gateway implementations. We present VEGa, a configurable vehicular Ethernet gateway architecture utilising a hybrid FPGA to closely couple software control on a processor with dedicated switching circuit on the reconfigurable fabric. The fabric implements isolated interface ports and an accelerated routing mechanism, which can be controlled and monitored from software. Further, reconfigurability enables the switching behaviour to be altered at run-time under software control, while the configurable architecture allows easy adaptation to different vehicular architectures using high-level parameter settings. We demonstrate the architecture on the Xilinx Zynq platform and evaluate the bandwidth, latency, and isolation using extensive tests in hardware.09/06/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2700277Vector Instruction Set Extensions for Efficient Computation of <sc>Keccak</sc>
https://www.computer.org/csdl/trans/tc/2017/10/07918507-abs.html
We investigate the design of a new instruction set for the <sc>Keccak</sc> permutation, a cryptographic kernel for hashing, authenticated encryption, keystream generation and random-number generation. <sc>Keccak</sc> is the basis of the SHA-3 standard and the newly proposed <sc>Keyak</sc> and <sc>Ketje</sc> authenticated ciphers. We develop the instruction extensions for a 128-bit interface, commonly available in the vector-processing unit of many modern processors. We examine the trade-off between flexibility and efficiency, and we propose a set of six custom instructions to support a broad range of <sc>Keccak</sc>-based cryptographic applications. We motivate our custom-instruction selections using a design space exploration that considers various methods of partitioning the state and the operations of the <sc>Keccak</sc> permutation, and we demonstrate an efficient implementation of this permutation with the proposed instructions. To evaluate their performance, we integrate a simulation model of the proposed ARM NEON vector instructions into the GEM5 micro-architecture simulator. With this simulation model, we evaluate the performance improvement for several cryptographic operations that use the <sc>Keccak</sc>  permutation. Compared to a state-of-the-art NEON software implementation, we demonstrate a performance improvement of 2.2x for SHA-3. Compared to optimized 32-bit assembly programming, we demonstrate a performance improvement of 2.6x, 1.6x, and 1.4x for <sc>River</sc> <sc>Keyak</sc>, <sc>Ketje</sc>SR and <sc>Ketje</sc>JR respectively. The proposed instructions require 4,658 gate-equivalent (GE) in 90 nm, which represents only a tiny fraction of the hardware cost of a modern processor.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2700795Single Event Transient Tolerant Bloom Filter Implementations
https://www.computer.org/csdl/trans/tc/2017/10/07921607-abs.html
Bloom filters have been used to reduce the delay in networking and computing applications when a set membership check is to be applied. Error sources can affect the behavior of Bloom filters resulting in a wrong outcome of this membership test and a possible effect in the system's output. Single event transients are a type of temporary errors altering the operation of combinational logic. A single event transient affecting the hash generation logic of a hardware-implemented Bloom filter can produce errors such as false negatives. This paper presents different approaches to build Bloom filters that are tolerant to single event transients occurring in the hash generation circuitry. They are compared to the use of traditional Modular Redundancy approaches. The results show that the new schemes can reduce significantly the circuit area needed to implement the Bloom filter.09/06/2017 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2702174Checking Big Suffix and LCP Arrays by Probabilistic Methods
https://www.computer.org/csdl/trans/tc/2017/10/07922513-abs.html
For full-text indexing of massive data, the suffix and LCP (longest common prefix) arrays have been recognized as fundamental data structures, and there are at least two needs in practice for checking their correctness, i.e., program debugging and verifying the arrays constructed by probabilistic algorithms. Two probabilistic methods are proposed to check the suffix and LCP arrays of constant or integer alphabets in external memory using a Karp-Rabin fingerprinting technique, where the checking is wrong only with a negligible error probability. The first method checks the lexicographical order and the LCP-value of two suffixes by computing and comparing the fingerprints of their LCPs. This method is general in terms of that it can verify any full or sparse suffix/LCP array of any order. The second method uses less space, it first employs the fingerprinting technique to verify a subset of the given suffix and LCP arrays, from which two new suffix and LCP arrays are induced and compared with the given arrays for verification, where the induced suffix and LCP arrays can be removed for constant alphabets to save space.09/06/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2702642NPAM: NVM-Aware Page Allocation for Multi-Core Embedded Systems
https://www.computer.org/csdl/trans/tc/2017/10/07926310-abs.html
Energy consumption is one of the prominent design constraints of multi-core embedded systems. Since the memory subsystem is responsible for a considerable portion of energy consumption of embedded systems, Non-Volatile Memories (NVMs) have been proposed as a candidate for replacing conventional memories such as SRAM and DRAM. The advantages of NVMs compared to conventional memories are that they consume less leakage power and provide higher density. However, these memories suffer from increased overhead of write operations and limited lifetime. In order to address these issues, researchers have proposed NVM-aware memory management techniques that consider the characteristics of the memories of the system when deciding on the placement of the application data. In systems equipped with memory management unit (MMU), the application data is partitioned into pages during the compile phase and the data is managed at page level during the runtime phase. In this paper we present an NVM-aware data partitioning and mapping technique for multi-core embedded systems equipped with MMU that specifies the placement of the application data based on access pattern of the data and characteristics of the memories. The experimental results show that the proposed technique improves the energy consumption of the system by 28.10 percent on average.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2703824Off-the-Hook: An Efficient and Usable Client-Side Phishing Prevention Application
https://www.computer.org/csdl/trans/tc/2017/10/07926371-abs.html
Phishing is a major problem on the Web. Despite the significant attention it has received over the years, there has been no definitive solution. While the state-of-the-art solutions have reasonably good performance, they suffer from several drawbacks including potential to compromise user privacy, difficulty of detecting phishing websites whose content change dynamically, and reliance on features that are too dependent on the training data. To address these limitations we present a new approach for detecting phishing webpages in real-time as they are visited by a browser. It relies on modeling inherent phisher limitations stemming from the constraints they face while building a webpage. Consequently, the implementation of our approach, <italic>Off-the-Hook</italic>, exhibits several notable properties including high accuracy, brand-independence and good language-independence, speed of decision, resilience to dynamic phish and resilience to evolution in phishing techniques. <italic>Off-the-Hook</italic> is implemented as a fully-client-side browser add-on, which preserves user privacy. In addition, <italic>Off-the-Hook</italic> identifies the target website that a phishing webpage is attempting to mimic and includes this target in its warning. We evaluated <italic>Off-the-Hook</italic> in two different user studies. Our results show that users prefer <italic> Off-the-Hook</italic> warnings to Firefox warnings.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2703808Improved Schedulability Analysis Using Carry-In Limitation for Non-Preemptive Fixed-Priority Multiprocessor Scheduling
https://www.computer.org/csdl/trans/tc/2017/10/07927414-abs.html
A time instant is said to be a <italic>critical instant</italic> for a task, if the task’s arrival at the instant makes the duration between the task’s arrival and completion the longest. Critical instants for a task, once revealed, make it possible to check the task’s schedulability by investigating situations associated with the critical instants. This potentially results in efficient and tight schedulability tests, which is important in real-time systems. For example, existing studies have discovered critical instants under preemptive fixed-priority scheduling (P-FP), which limit interference from carry-in jobs, yielding the state-of-the-art schedulability tests on both uniprocessor and multiprocessor platforms. However, studies on schedulability tests associated with critical instants have not matured yet for non-preemptive scheduling, especially on a multiprocessor platform. In this paper, we find necessary conditions for critical instants for non-preemptive global fixed-priority scheduling (NP-FP) on a multiprocessor platform, and develop a new schedulability test that takes advantage of the finding for reducing carry-in jobs’ interference. Evaluation results show that the proposed schedulability test finds up to 14.3 percent additional task sets schedulable by NP-FP, which are not deemed schedulable by the state-of-the-art NP-FP schedulability test.09/06/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2704083A Dual-Clock Multiple-Queue Shared Buffer
https://www.computer.org/csdl/trans/tc/2017/10/07930511-abs.html
Multiple parallel queues are versatile hardware data structures that are extensively used in modern digital systems. To achieve maximum scalability, the multiple queues are built on top of a dynamically-allocated shared buffer that allocates the buffer space to the various active queues, based on a linked-list organization. This work focuses on dynamically-allocated multiple-queue shared buffers that allow their read and write ports to operate in different clock domains. The proposed dual-clock shared buffer follows a tightly-coupled organization that merges the tasks of signal synchronization across asynchronous clock domains and queueing (buffering), in a common hardware module. When compared to other state-of-the-art dual-clock multiple-queue designs, the new architecture is demonstrated to yield a substantially lower-cost implementation. Specifically, hardware area savings of up to 55 percent are achieved, while still supporting full-throughput operation.09/06/2017 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2705141Thread Criticality Assisted Replication and Migration for Chip Multiprocessor Caches
https://www.computer.org/csdl/trans/tc/2017/10/07931700-abs.html
Non-Uniform Cache Architecture (NUCA) is a viable solution to mitigate the problem of large on-chip wire delay due to the rapid increase in the cache capacity of chip multiprocessors (CMPs). Through partitioning the last-level cache (LLC) into smaller banks connected by on-chip network, the access latency will exhibit non-uniform distribution. Various works have well explored the NUCA design, including block migration, block replication and block searching. However, all of the previous mechanisms designed for NUCA are thread-oblivious when multi-threaded applications are deployed on CMP systems. Due to the interference on shared resources, threads often demonstrate unbalanced progress wherein the lagging threads with slow progress are more critical to overall performance. In this paper, we propose a novel NUCA design called thread <underline>C</underline>riticality <underline>A</underline>ssisted <underline>R </underline>eplication and <underline>M</underline>igration (CARM). CARM exploits the runtime thread criticality information as hints to adjust the block replication and migration in NUCA. Specifically, CARM aims at boosting parallel application execution through prioritizing block replication and migration for critical threads. Full-system experimental results show that CARM reduces the execution time of a set of PARSEC workloads by 13.7 and 6.8 percent on average compared with the tradition D-NUCA and Re-NUCA respectively. Moreover, CARM also consumes much less energy compared with the evaluated schemes.09/06/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2705678MONTRES : Merge ON-the-Run External Sorting Algorithm for Large Data Volumes on SSD Based Storage Systems
https://www.computer.org/csdl/trans/tc/2017/10/07932112-abs.html
External sorting algorithms are commonly used by data-centric applications to sort quantities of data that are larger than the main-memory. Many external sorting algorithms were proposed in state-of-the-art studies to take advantage of SSD performance properties to accelerate the sorting process. In this paper, we demonstrate that unfortunately, many of those algorithms fail to scale when it comes to increasing the dataset size under memory pressure. In order to address this issue, we propose a new sorting algorithm named MONTRES. MONTRES relies on SSD performance model while decreasing the overall number of I/O operations. It does this by reducing the amount of temporary data generated during the sorting process by continuously evicting small values in the final sorted file. MONTRES scales well with growing datasets under memory pressure. We tested MONTRES using several data distributions, different amounts of main-memory workspace and three SSD models. Results showed that MONTRES outperforms state-of-the-art algorithms as it reduces the sorting execution time of TPC-H datasets by more than 30 percent when the file size to main-memory size ratio is high.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2706678A Reconfigurable Wireless NoC for Large Scale Microbiome Community Analysis
https://www.computer.org/csdl/trans/tc/2017/10/07932130-abs.html
Understanding the role of competition and cooperation among multiple interacting species of microorganisms that constitute the microbiome and decipher how they enforce homeostasis or trigger diseases requires the development of multi-scale computational models capable of capturing both intra-cell processing (i.e., gene-to-protein interactions) and inter-cell interactions. The multi-scale interdependency that governs the interactions from genes to proteins within a cell and from molecular messengers to cells to microbial communities within the environment raises numerous computation and communication challenges. Internal cell processing cannot be simulated without knowledge of the surroundings. Similarly, cell-cell communication cannot be fully abstracted without stated of internal processing and diffusion effects of molecular messengers. To address the compute- and communication-intensive nature of modeling microbial communities, in this paper, we propose a novel reconfigurable NoC-based manycore architecture capable of simulating a large scale microbial community. The reconfiguration of the NoC topology is achieved through the fractal analysis of NoC traffic and use of the on-chip wireless interfaces. More precisely, we analyze the computational and communication workloads and exploit the observed fractal characteristics for proposing a mathematical strategy for NoC reconfiguration. Experimental results demonstrate that the proposed NoC architecture achieves 56.6 and 62.8 percent improvement in energy delay product over the conventional wireline mesh and flatten butterfly-based high radix NoC architectures, respectively.09/06/2017 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2706278Time and Space-Efficient Write Parallelism in PCM by Exploiting Data Patterns
https://www.computer.org/csdl/trans/tc/2017/09/07870591-abs.html
The size of write unit in PCM, namely the number of bits allowed to be written concurrently at one time, is restricted due to high write energy consumption. It typically needs several serially executed write units to finish a cache line service when using PCM as the main memory, which results in long write latency and high energy consumption. To address the poor write performance problem, we propose a novel PCM write scheme called Min-WU (Minimize the number of Write Units). We observe data access locality that some frequent zero-extended values dominate the write data patterns in typical multi-threaded applications (more than 40 and 44.9 percent of all memory accesses in PARSEC workloads and SPEC 2006 benchmarks, respectively). By leveraging carefully designed chip-level data redistribution method, the data amount is balanced and the data pattern is the same among all PCM chips. The key idea behind Min-WU is to minimize the number of serially executed write units in a cache line service after data redistribution through sFPC (simplified Frequent Pattern Compression), eRW (efficient Reordering Write operations method) and fWP (fine-tuned Write Parallelism circuits). Using Min-WU, the zero parts of write units can be indicated with predefined prefixes and the residues can be reordered and written simultaneously under power constraints. Our design can improve the performance, energy consumption and endurance of PCM-based main memory with low space and time overhead. Experimental results of 12 multi-threaded PARSEC 2.0 workloads show that Min-WU reduces 44 percent read latency, 28 percent write latency, 32.5 percent running time and 48 percent energy while receiving 32 percent IPC improvement compared with the conventional write scheme with few memory cycles and less than 3 percent storage space overhead. Evaluation results of 8 SPEC 2006 benchmarks demonstrate that Min-WU earns 57.8/46.0 percent read/write latency reduction, 28.7 percent IPC improvement, 28 percent running time reduction and 62.1 percent energy reduction compared with the baseline under realistic memory hierarchy configurations.08/07/2017 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2677903Mastrovito Form of Non-Recursive Karatsuba Multiplier for All Trinomials
https://www.computer.org/csdl/trans/tc/2017/09/07870679-abs.html
We present a new type of bit-parallel non-recursive Karatsuba multiplier over <inline-formula> <tex-math notation="LaTeX">$GF(2^m)$</tex-math><alternatives><inline-graphic xlink:href="li-ieq1-2677913.gif"/> </alternatives></inline-formula> generated by an arbitrary irreducible trinomial. This design effectively exploits Mastrovito approach and shifted polynomial basis (SPB) to reduce the time complexity and Karatsuba algorithm to reduce its space complexity. We show that this type of multiplier is only one <inline-formula><tex-math notation="LaTeX"> $T_X$</tex-math><alternatives><inline-graphic xlink:href="li-ieq2-2677913.gif"/></alternatives></inline-formula> slower than the fastest bit-parallel multiplier for all trinomials, where <inline-formula><tex-math notation="LaTeX">$T_X$ </tex-math><alternatives><inline-graphic xlink:href="li-ieq3-2677913.gif"/></alternatives></inline-formula> is the delay of one 2-input XOR gate. Meanwhile, its space complexity is roughly 3/4 of those multipliers. To the best of our knowledge, it is the first time that our scheme has reached such a time delay bound. This result outperforms previously proposed non-recursive Karatsuba multipliers.08/07/2017 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2677913DRAM-Based Error Detection Method to Reduce the Post-Silicon Debug Time for Multiple Identical Cores
https://www.computer.org/csdl/trans/tc/2017/09/07872459-abs.html
In the post-silicon debug of multicore designs, the debug time has increased significantly because the number of cores undergoing debug has increased; however the resources available to debug the design are limited. This paper proposes a new DRAM-based error detection method to overcome this challenge. The proposed method requires only three debug sessions even if multiple cores are present. The first debug session is used to detect the error intervals of each core using golden signatures. The second session is used to detect the error clock cycles in each core using a golden data stream. Instead of storing all of the golden data, the golden data stream is generated by selecting error-free debug data for each interval which are guaranteed by the first session. Finally, the error data in all cores are only captured during the third session. The experimental results on various debug cases show significant reductions in total debug time and the amount of DRAM usage compared to previous methods.08/07/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2678504A Block-Level Log-Block Management Scheme for MLC NAND Flash Memory Storage Systems
https://www.computer.org/csdl/trans/tc/2017/09/07874142-abs.html
NAND flash memory is the major storage media for both mobile storage cards and enterprise Solid-State Drives (SSDs). Log-block-based Flash Translation Layer (FTL) schemes have been widely used to manage NAND flash memory storage systems in industry. In log-block-based FTLs, a few physical blocks called log blocks are used to hold all page updates from a large amount of data blocks. Frequent page updates in log blocks introduce big overhead so log blocks become the system bottleneck. To address this problem, this paper presents <italic>BLog</italic>, a block-level log-block management scheme for MLC NAND flash memory storage system. In BLog, with block-level management, the update pages of a data block can be collected together and put into the same log block as much as possible; therefore, we can effectively reduce the associativities of log blocks so as to reduce the garbage collection overhead. We also propose a novel partial merge operation strategy called <italic>reduced-order merge</italic> by which we can effectively postpone the garbage collection of log blocks so as to maximally utilize valid pages and reduce unnecessary erase operations in log blocks. Based on BLog, we design an FTL called <italic>BLogFTL</italic> for Multi-Level Cell (MLC) NAND flash. We conduct a set of experiments on a real hardware platform. Both representative FTL schemes and the proposed BLogFTL have been implemented in the hardware evaluation board. The experimental results show that our scheme can effectively reduce the garbage collection operations and reduce the system response time compared to the previous log-block-based FTLs for MLC NAND flash.08/07/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2679180Hardware Assisted Fully Homomorphic Function Evaluation and Encrypted Search
https://www.computer.org/csdl/trans/tc/2017/09/07884927-abs.html
In this paper we propose a scheme to perform homomorphic evaluations of arbitrary depth with the assistance of a special module <italic>recryption box</italic>. Existing somewhat homomorphic encryption schemes can only perform homomorphic operations until the noise in the ciphertexts reaches a critical bound depending on the parameters of the homomorphic encryption scheme. The classical approach of bootstrapping also allows for arbitrary depth evaluations, but has a detrimental impact on the size of the parameters, making the whole setup inefficient. We describe two different instantiations of our recryption box for assisting homomorphic evaluations of arbitrary depth. The recryption box refreshes the ciphertexts by lowering the inherent noise and can be used with any instantiation of the parameters, i.e. there is no minimum size unlike bootstrapping. To demonstrate the practicality of the proposal, we design the recryption box on a Xilinx Virtex 6 FPGA board ML605 to support the FV somewhat homomorphic encryption scheme. The recryption box requires 0.43 ms to refresh one ciphertext. Further, we use this recryption box to boost the performance of encrypted search operation. On a 40 core Intel server, we can perform encrypted search in a table of <inline-formula><tex-math notation="LaTeX">$2^{16}$</tex-math><alternatives> <inline-graphic xlink:href="sinharoy-ieq1-2686385.gif"/></alternatives></inline-formula> entries in around 20 seconds. This is roughly 20 times faster than the implementation without recryption box.08/18/2017 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2686385Extending Real-Time Analysis for Wormhole NoCs
https://www.computer.org/csdl/trans/tc/2017/09/07884964-abs.html
The delay upper-bound analysis problem is of fundamental importance to real-time applications in Network-on-Chips (NoCs). In the paper, we revisit two state-of-the-art analysis models for real-time communication in wormhole NoCs with priority-based preemptive arbitration and show that the models only support specific router architectures with large buffer sizes. We then propose an extended analysis model to estimate delay upper-bounds for all router architectures and buffer sizes by identifying and analyzing the differences between upstream and downstream indirect interferences according to the relative positions of traffic flows and taking the buffer influence into consideration. Simulated evaluations show that our model supports one more router architecture and applies to small buffer sizes compared to the previous models.08/07/2017 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2686391MemFlex: A Shared Memory Swapper for High Performance VM Execution
https://www.computer.org/csdl/trans/tc/2017/09/07885535-abs.html
Ballooning is a popular solution for dynamic memory balancing. However, existing solutions may perform poorly in the presence of heavy guest swapping. Furthermore, when the host has sufficient free memory, guest virtual machines (VMs) under memory pressure is not be able to use it in a timely fashion. Even after the guest VM has been recharged with sufficient memory via ballooning, the applications running on the VM are unable to utilize the free memory in guest VM to quickly recover from the severe performance degradation. To address these problems, we present <italic>MemFlex </italic>, a shared memory swapper for improving guest swapping performance in virtualized environment with three novel features: (1) <italic>MemFlex</italic> effectively utilizes host idle memory by redirecting the VM swapping traffic to the host-guest shared memory area. (2) <italic>MemFlex</italic> provides a hybrid memory swapping model, which treats a fast but small shared memory swap partition as the primary swap area whenever it is possible, and smoothly transits to the conventional disk-based VM swapping on demand. (3) Upon ballooned with sufficient VM memory, <italic>MemFlex </italic> provides a fast swap-in optimization, which enables the VM to proactively swap in the pages from the shared memory using an efficient batch implementation. Instead of relying on costly page faults, this optimization offers just-in-time performance recovery by enabling the memory intensive applications to quickly regain their runtime momentum. Performance evaluation results are presented to demonstrate the effectiveness of <italic>MemFlex</italic> when compared with existing swapping approaches.08/07/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2686850Rate-Selective Caching for Adaptive Streaming Over Information-Centric Networks
https://www.computer.org/csdl/trans/tc/2017/09/07887688-abs.html
The growing demand for video content is reshaping our view of the current Internet, and mandating a fundamental change for future Internet paradigms. A current focus on Information-Centric Networks (ICN) promises a novel approach to intrinsically handling large content dissemination, caching and retrieval. While ubiquitous in-network caching in ICNs can expedite video delivery, a pressing challenge lies in provisioning scalable video streaming over adaptive requests for different bit rates. In this paper, we propose novel video caching schemes in ICN, to address variable bit rates and content sizes for best cache utilization. Our objective is to maximize overall throughput to improve the Quality of Service (QoS). In order to achieve this goal, we model the dynamic characteristics of rate adaptation, deriving caps on average delay, and propose <italic>DaCPlace</italic> which optimizes cache placement decisions. Building on <italic>DaCPlace</italic>, we further present a heuristic scheme, <italic>StreamCache</italic>, for low-overhead adaptive video caching. We conduct comprehensive simulations on NS-3 (specifically under the ndnSIM module). Results demonstrate how <italic>DaCPlace</italic> enables users to achieve the least delay per bit and <italic>StreamCache</italic> outperforms existing schemes, achieving near-optimal performance to <italic>DaCPlace </italic>.08/07/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2687920An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory
https://www.computer.org/csdl/trans/tc/2017/09/07891951-abs.html
Extreme multi-threading and fast thread switching in modern GPGPU require a large, power-hungry register file (RF), which quickly becomes one of major obstacles on the upscaling path of energy-efficient GPGPU computing. In this work, we propose to implement a power-efficient GPGPU RF built on the newly emerged racetrack memory. Racetrack memory has small cell area, low dynamic power, and nonvolatility. Its unique access mechanism, however, results in a long and location-dependent access latency, which offsets the energy saving benefit it introduces and probably harms the performance. In order to conquer the adverse impacts of racetrack memory based RF designs, we first propose a register mapping scheme to reduce the average access latency. Based on the register mapping, we develop a racetrack memory aware warp scheduling (RMWS) algorithm to further suppress the access latency. RMWS design includes a new write buffer structure that improves the scheduling efficiency as well as energy saving. We also investigate and optimize the design where multiple concurrent RMWS schedulers are employed. Experiment results show that our propose techniques can keep a GPGPU performance similar to the baseline with SRAM based RF while the RF energy is significantly reduced by 48.5 percent.08/07/2017 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2690855Performance/Reliability-Aware Resource Management for Many-Cores in Dark Silicon Era
https://www.computer.org/csdl/trans/tc/2017/09/07892847-abs.html
Aggressive technology scaling has enabled the fabrication of many-core architectures while triggering challenges such as limited power budget and increased reliability issues, like aging phenomena. Dynamic power management and runtime mapping strategies can be utilized in such systems to achieve optimal performance while satisfying power constraints. However, lifetime reliability is generally neglected. We propose a novel lifetime reliability/performance-aware resource co-management approach for many-core architectures in the dark silicon era. The approach is based on a two-layered architecture, composed of a long-term runtime reliability controller and a short-term runtime mapping and resource management unit. The former evaluates the cores’ aging status w.r.t. a target reference specified by the designer, and performs recovery actions on highly stressed cores by means of power capping. The aging status is utilized in runtime application mapping to maximize system performance while fulfilling reliability requirements and honoring the power budget. Experimental evaluation demonstrates the effectiveness of the proposed strategy, which outperforms most recent state-of-the-art contributions.08/07/2017 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2691009285