IEEE Transactions on Computers
https://www.computer.org/csdl/trans/tc/index.html
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers, brief contributions, and comments on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; (g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.en-ushttps://s3.amazonaws.com/ieeecs.cdn.common/images/computer.org/CSLogo_dark.pngIEEE Computer Society Digital LibraryList of 100 recently published journal articles.
https://www.computer.org/csdl
Unbiased Rounding for HUB Floating-Point Addition
https://www.computer.org/csdl/trans/tc/2018/09/08300633-abs.html
Half-Unit-Biased (HUB) is an emerging format based on shifting the represented numbers by half Unit in the Last Place. This format simplifies two's complement and round-to-nearest operations by preventing any carry propagation. This saves power consumption, time and area. Taking into account that the IEEE floating-point standard uses an unbiased rounding as the default mode, this feature is also desirable for HUB approaches. In this paper, we study the unbiased rounding for HUB floating-point addition in both as standalone operation and within FMA. We show two different alternatives to eliminate the bias when rounding the sum results, either partially or totally. We also present an error analysis and the implementation results of the proposed architectures to help the designers to decide what their best option are.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2807429A Simulation Analysis of Redundancy and Reliability in Primary Storage Deduplication
https://www.computer.org/csdl/trans/tc/2018/09/08300656-abs.html
Deduplication has been widely used to improve storage efficiency in modern primary and secondary storage systems, yet how deduplication fundamentally affects storage system reliability remains debatable. This paper aims to analyze and compare storage system reliability with and without deduplication in primary workloads using public file system snapshots from two research groups. We first study the redundancy characteristics of the file system snapshots. We then propose a trace-driven, deduplication-aware simulation framework to analyze data loss in both chunk and file levels due to sector errors and whole-disk failures. Compared to without deduplication, our analysis shows that deduplication consistently reduces the damage of sector errors due to intra-file redundancy elimination, but potentially increases the damages of whole-disk failures if the highly referenced chunks are not carefully placed on disk. To improve reliability, we examine a deliberate copy technique that stores and repairs first the most referenced chunks in a small dedicated physical area (e.g., 1 percent of the physical capacity), and demonstrate its effectiveness through our simulation framework.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2808496Towards a Cryptographic Minimal Design: The sLiSCP Family of Permutations
https://www.computer.org/csdl/trans/tc/2018/09/08305605-abs.html
The security of highly resource constrained applications is often viewed in the literature from a single aspect of a specific cryptographic primitive. More precisely, most of the proposed lightweight cryptographic primitives focus on providing a single functionality within the available hardware area dedicated for security purposes. In this paper, we argue that for such applications, a cryptographic primitive that follows the <italic>cryptographic minimal design </italic> strategy maybe the only realistically adopted security solution where there is a constrained GE budget for all security functionalities. Indeed, it is reasonable, if not desirable, for the adopted cryptographic design to have well justified building components and to provide minimal overhead for multiple cryptographic functionalities including encryption, hashing, authentication, and pseudorandom bit generation. Following such a strategy, we propose the <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq1-2811467.gif"/></alternatives></inline-formula> family of lightweight cryptographic permutations which employs two of the most hardware efficient and extensively cryptanalyzed constructions, namely a 4-subblock Type-2 Generalized Feistel-like Structure (GFS) and round-reduced unkeyed Simeck. In addition to the hardware efficiency, we follow restrictive security design goals which enable us to provide resistance against differential and linear cryptanalysis, as well as guaranteed resistance to diffusion-based, algebraic, and self-symmetry distinguishers, and accordingly, we claim that there exist no structural distinguishers for <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq2-2811467.gif"/></alternatives></inline-formula>-<inline-formula> <tex-math notation="LaTeX">$b$</tex-math><alternatives><inline-graphic xlink:href="altawy-ieq3-2811467.gif"/> </alternatives></inline-formula> with a complexity below <inline-formula><tex-math notation="LaTeX">$2^{b/2}$</tex-math> <alternatives><inline-graphic xlink:href="altawy-ieq4-2811467.gif"/></alternatives></inline-formula> where <inline-formula><tex-math notation="LaTeX">$b$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq5-2811467.gif"/></alternatives></inline-formula> is the state size. Moreover, we present the <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq6-2811467.gif"/></alternatives></inline-formula> duplex sponge mode to illustrate how the permutations can be used in a unified design that provides (authenticated) encryption, hashing, and pseudorandom bit generation functionalities. Finally, we report two efficient parallel hardware implementations for the <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq7-2811467.gif"/></alternatives></inline-formula> unified duplex sponge mode when using <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq8-2811467.gif"/></alternatives></inline-formula>-192 (resp. <inline-formula> <tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives><inline-graphic xlink:href="altawy-ieq9-2811467.gif"/> </alternatives></inline-formula>-256) in CMOS <inline-formula><tex-math notation="LaTeX">$65$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq10-2811467.gif"/></alternatives></inline-formula> nm ASIC with area of 2289 (resp. 3039) GE and a throughput of 29.62 (resp. 44.44) kbps, and their areas in CMOS <inline-formula> <tex-math notation="LaTeX">$130$</tex-math><alternatives><inline-graphic xlink:href="altawy-ieq11-2811467.gif"/> </alternatives></inline-formula> nm are 2498 (resp. 3319) GE.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2811467Network Synthesis for Distributed Embedded Systems
https://www.computer.org/csdl/trans/tc/2018/09/08307094-abs.html
The amazing proliferation of communication technologies for embedded systems opens the way for completely new applications but forces designers to adopt new methodologies to meet time-to-market constraints. Computer-Aided Design (CAD) has been traditionally applied to computers and embedded systems <italic>in isolation</italic> without considering them as a global inter-connected system. The paper contributes to fill this gap by proposing <italic>1) </italic> a communication-aware design flow for network-interconnected embedded systems and <italic>2)</italic> a formal framework to efficiently synthesize their network aspects by formulating and solving an optimization problem. Presented case studies show the potentiality of the proposed approach to address heterogeneous scenarios, e.g., related to smart spaces up to the ever-more-mentioned Internet-of-Things.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2812797FFT-Based McLaughlin's Montgomery Exponentiation without Conditional Selections
https://www.computer.org/csdl/trans/tc/2018/09/08307235-abs.html
Modular multiplication forms the basis of many cryptographic functions such as RSA, Diffie-Hellman key exchange, and ElGamal encryption. For large RSA moduli, combining the fast Fourier transform (FFT) with McLaughlin's Montgomery modular multiplication (MLM) has been validated to offer cost-effective implementation results. However, the conditional selections in McLaughlin's algorithm are considered to be inefficient and vulnerable to timing attacks, since extra long additions or subtractions may take place and the running time of MLM varies. In this work, we restrict the parameters of MLM by a set of new bounds and present a modified MLM algorithm involving no conditional selection. Compared to the original MLM algorithm, we inhibit extra operations caused by the conditional selections and accomplish constant running time for modular multiplications with different inputs. As a result, we improve both area-time efficiency and security against timing attacks. Based on the proposed algorithm, efficient FFT-based modular multiplication and exponentiation are derived. Exponentiation architectures with dual FFT-based multipliers are designed obtaining area-latency efficient solutions. The results show that our work offers a better efficiency compared to the state-of-the-art works from and above 2048-bit operand sizes. For single FFT-based modular multiplication, we have achieved constant running time and obtained area-latency efficiency improvements up to 24.3 percent for 1,024-bit and 35.5 percent for 4,096-bit operands, respectively.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2811466A Hybrid Multicast Routing Approach with Enhanced Methods for Mesh-Based Networks-on-Chip
https://www.computer.org/csdl/trans/tc/2018/09/08309347-abs.html
Multicast communication can greatly enhance the performance of Networks-on-Chip. Currently most multicast routing algorithms are either tree-based or path-based. The former has low latency but needs to solve multicast deadlocks through additional hardware resources. The latter can avoid deadlocks easily but may require long routing paths. In this paper we propose a hybrid multicast routing approach that combines the advantages of both path- and tree-based methods. The proposed approach ensures deadlock-free multicast routing without requiring additional virtual channels or large buffers to hold large packets. High routing performance is achieved using an adaptive routing strategy considering the traffic load in nearby routers. Two techniques, namely node balancing and path balancing, are further developed to enhance this hybrid routing algorithm. Extensive experiments with different buffer sizes, packet sizes and numbers of destinations per packet under random and Rent's rule traffic at various traffic injection rates have been conducted. The results show that the average latency of our approach is lower than previous multicast routing algorithms in most cases, and the saturation points of our approach are always at much higher injection rates.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2813394WASP: Selective Data Prefetching with Monitoring Runtime Warp Progress on GPUs
https://www.computer.org/csdl/trans/tc/2018/09/08309426-abs.html
This paper proposes a new data prefetching technique for Graphics Processing Units (GPUs) called Warp Aware Selective Prefetching (WASP). The main idea of WASP is to dynamically select warps whose progress is slower than that of the current warp as prefetching target warps. Under the in-order instruction execution model of GPUs, these prefetching target warps will certainly execute the same load as the current warp. Exploiting that, WASP prefetches the data for prefetching target warps, which allows the prefetched data to be accurately accessed. To simply verify the progress of the warps, WASP monitors the counts of the dynamic load executions for all warps. When a warp executes a load, WASP searches the warps with lower load execution counts than the current warp and generates the prefetch requests for them. In our evaluation, WASP achieves a 16.8 percent speedup compared to the baseline GPU.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2813379Queuing-Based eDRAM Refreshing for Ultra-Low Power Processors
https://www.computer.org/csdl/trans/tc/2018/09/08310027-abs.html
Ultra-low power processors designed to work at very low voltage are the enablers of the internet of things (IoT) era. Their internal memories, which are usually implemented by a static random access memory (SRAM) technology, stop functioning properly at low voltage. Some recent commercial products have replaced SRAM with embedded memory (eDRAM), in which stored data are destroyed over time, thus requiring periodic refreshing that causes performance loss. This article presents a queuing-based opportunistic refreshing algorithm that eliminates most if not all of the performance loss and is shown to be optimal. The queues used for refreshing miss refreshing opportunities not only when they are saturated but also when they are empty, hence increasing the probability of performance loss. We examine the optimal policy for handling a saturated and empty queue, and the ways in which system performance depends on queue capacity and memory size. This analysis results in a closed-form performance expression capturing read/write probabilities, memory size and queue capacity leading to CPU-internal memory architecture optimization.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2811470Aging-Aware Boosting
https://www.computer.org/csdl/trans/tc/2018/09/08319494-abs.html
DVFS-based boosting techniques have been widely employed by commercial multi-core processors, due to their superiority in improving the performance. <italic>Boosting</italic>, however, is particularly stressing circuits and hence it significantly contributes to an accelerated aging process. Circuit aging has become a real reliability concern because it leads to an increase in transistor threshold voltage that may cause timing errors as a result of higher delays in critical paths. Thus, high performance is desirable but it shortens the circuit lifetime through aging leaving a choice to trade-off. Besides well-known long-term aging effects, recent research also reported short-term aging effects. Our claim is that DVFS-based boosting techniques should consider both long- and short-term aging effects. This can be circumvented by wider timing guardbands. But that would be more expensive. The goal of this work is therefore to analyze and optimize <italic>boosting</italic> under specific consideration of long-term and short-term aging effects. As a result of our findings, we propose the first comprehensive aging-aware, yet efficient boosting technique. The employed aging-aware cell libraries in this work are publicly available at <uri> http://ces.itec.kit.edu/dependable-hardware.php</uri>.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2816014A Stochastic Computational Multi-Layer Perceptron with Backward Propagation
https://www.computer.org/csdl/trans/tc/2018/09/08319953-abs.html
Stochastic computation has recently been proposed for implementing artificial neural networks with reduced hardware and power consumption, but at a decreased accuracy and processing speed. Most existing implementations are based on pre-training such that the weights are predetermined for neurons at different layers, thus these implementations lack the ability to update the values of the network parameters. In this paper, a stochastic computational multi-layer perceptron (SC-MLP) is proposed by implementing the backward propagation algorithm for updating the layer weights. Using extended stochastic logic (ESL), a reconfigurable stochastic computational activation unit (SCAU) is designed to implement different types of activation functions such as the <inline-formula><tex-math notation="LaTeX">$tanh$ </tex-math><alternatives><inline-graphic xlink:href="han-ieq1-2817237.gif"/></alternatives></inline-formula> and the rectifier function. A triple modular redundancy (TMR) technique is employed for reducing the random fluctuations in stochastic computation. A probability estimator (PE) and a divider based on the TMR and a binary search algorithm are further proposed with progressive precision for reducing the required stochastic sequence length. Therefore, the latency and energy consumption of the SC-MLP are significantly reduced. The simulation results show that the proposed design is capable of implementing both the training and inference processes. For the classification of nonlinearly separable patterns, at a slight loss of accuracy by 1.32-1.34 percent, the proposed design requires only 28.5-30.1 percent of the area and 18.9-23.9 percent of the energy consumption incurred by a design using floating point arithmetic. Compared to a fixed-point implementation, the SC-MLP consumes a smaller area (40.7-45.5 percent) and a lower energy consumption (38.0-51.0 percent) with a similar processing speed and a slight drop of accuracy by 0.15-0.33 percent. The area and the energy consumption of the proposed design is from 80.7-87.1 percent and from 71.9-93.1 percent, respectively, of a binarized neural network (BNN), with a similar accuracy.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2817237Cloudlets Activation Scheme for Scalable Mobile Edge Computing with Transmission Power Control and Virtual Machine Migration
https://www.computer.org/csdl/trans/tc/2018/09/08322166-abs.html
Mobile devices have several restrictions due to design choices that guarantee their mobility. A way of surpassing such limitations is to utilize cloud servers called cloudlets on the edge of the network through Mobile Edge Computing. However, as the number of clients and devices grows, the service must also increase its scalability in order to guarantee a latency limit and quality threshold. This can be achieved by deploying and activating more cloudlets, but this solution is expensive due to the cost of the physical servers. The best choice is to optimize the resources of the cloudlets through an intelligent choice of configuration that lowers delay and raises scalability. Thus, in this paper we propose an algorithm that utilizes Virtual Machine Migration and Transmission Power Control, together with a mathematical model of delay in Mobile Edge Computing and a heuristic algorithm called Particle Swarm Optimization, to balance the workload between cloudlets and consequently maximize cost-effectiveness. Our proposal is the first to consider simultaneously communication, computation, and migration in our assumed scale and, due to that, manages to outperform other conventional methods in terms of number of serviced users.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2818144An Erase Efficiency Boosting Strategy for 3D Charge Trap NAND Flash
https://www.computer.org/csdl/trans/tc/2018/09/08322264-abs.html
Owing to the fast-growing demands of larger and faster NAND flash devices, new manufacturing techniques have accelerated the down-scaling process of NAND flash memory. Among these new techniques, 3D charge trap flash is considered to be one of the most promising candidates for the next-generation NAND flash devices. However, the long erase latency of 3D charge trap flash becomes a critical issue. This issue is exacerbated because the distinct transient voltage shift phenomenon is worsened when the number of program/erase cycle increases. In contrast to existing works that aim to tackle the erase latency issue by reducing the number of block erases, we tackle this issue by utilizing the “multi-block erase” feature. In this work, an erase efficiency boosting strategy is proposed to boost the garbage collection efficiency of 3D charge trap flash via enabling multi-block erase inside flash chips. A series of experiments was conducted to demonstrate the capability of the proposed strategy on improving the erase efficiency and access performance of 3D charge trap flash. The results show that the erase latency of 3D charge trap flash memory is improved by 75.76 percent on average even when the P/E cycle reaches <inline-formula> <tex-math notation="LaTeX">$10^{4}$</tex-math><alternatives><inline-graphic xlink:href="chang-ieq1-2818118.gif"/> </alternatives></inline-formula>.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2818118Advanced Compressor Tree Synthesis for FPGAs
https://www.computer.org/csdl/trans/tc/2018/08/08263391-abs.html
This work presents novel methods for the optimization of compressor trees for FPGAs as required in many arithmetic computations. As demonstrated in recent work, important key elements for the design of efficient but fast compressor trees are target-optimized 4:2 compressors as well as generalized parallel counters (GPCs). However, the optimization of a compressor tree for minimal resources using both compressors and GPCs has not been addressed so far. As this combined optimization is a non-trivial task, three methods are proposed to find best solutions for a given problem size: 1) a heuristic that obtains compressor trees with typically less resources and fewer stages than state-of-the-art heuristics, 2) an integer linear programming (ILP)-based methodology that finds optimal compressor trees using the fewest stages possible, 3) a combined approach that partially solves the problem heuristically to reduce the search space for the ILP-based method. In all methods, the cost for pipeline registers can be included. Synthesis experiments show that the proposed methods provide pipelined compressor trees with about 40 percent less LUTs compared to trees of 2-input adders at the cost of being about 12 ...20 percent slower.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2795611Exploring the Design Space of Fair Scheduling Supports for Asymmetric Multicore Systems
https://www.computer.org/csdl/trans/tc/2018/08/08265024-abs.html
Although traditional CPU scheduling efficiently utilizes multiple cores with equal computing capacity, the advent of multicores with diverse capabilities pose challenges to CPU scheduling. For such asymmetric multi-core systems, scheduling is essential to exploit the efficiency of core asymmetry, by matching each application with the best core type. However, in addition to the efficiency, an important aspect of CPU scheduling is fairness in CPU provisioning. Such uneven core capability is inherently unfair to threads and causes performance variance, as applications running on fast cores receive higher capability than applications on slow cores. Depending on co-running applications and scheduling decisions, the performance of an application may vary significantly. This study investigates the fairness problem in asymmetric multi-cores, and explores the design space of OS schedulers supporting multiple fairness constraints. In this paper, we consider two fairness-oriented constraints, <italic>minimum fairness</italic> for the minimum guaranteed performance and <italic>uniformity</italic> for performance variation reduction. This study proposes four scheduling policies which guarantee a minimum performance bound while improving the overall throughput and reducing performance variation too. The proposed fairness-oriented schedulers are implemented for the Linux kernel with an online application monitoring technique. Using an emulated asymmetric multi-core with frequency scaling and a real asymmetric multi-core with the big.LITTLE architecture, the paper shows that the proposed schedulers can effectively support the specified fairness while improving overall system throughput.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2796077Checkpointing Workflows for Fail-Stop Errors
https://www.computer.org/csdl/trans/tc/2018/08/08279499-abs.html
We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (<sc>M-SPGs</sc>), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the <sc>M-SPG</sc> structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2801300Mitigating Observability Loss of Toggle-Based <italic>X</italic>-Masking via Scan Chain Partitioning
https://www.computer.org/csdl/trans/tc/2018/08/08280565-abs.html
The Toggle-based <italic>X</italic>-masking method requires a single toggle at a given cycle, there is a chance that non-<italic>X</italic> values are also masked. Hence, the non-<italic>X</italic> value over-masking problem may cause a fault coverage degradation. In this paper, a scan chain partitioning scheme is described to alleviate non-<italic>X </italic> bit over-masking problem arising from Toggle-based <italic>X</italic>-Masking method. The scan chain partitioning method finds a scan chain combination that gives the least toggling conflicts. The experimental results show that the amount of over-masked bits is significantly reduced, and it is further reduced when the proposed method is incorporated with <italic>X</italic>-canceling method. However, as the number of scan chain partitions increases, the control data for decoder increases. To reduce a control data overhead, this paper exploits a Huffman coding based data compression. Assuming two partitions, the size of control bits is even smaller than the conventional <italic>X </italic>-toggling method that uses only one decoder. In addition, selection rules of <italic>X</italic>-bits delivered to <italic>X</italic>-Canceling MISR are also proposed. With the selection rules, a significant test time increase can be prevented.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2801847LEAD: An Adaptive 3D-NoC Routing Algorithm with Queuing-Theory Based Analytical Verification
https://www.computer.org/csdl/trans/tc/2018/08/08283712-abs.html
2D-NoCs have been the mainstream approach used to interconnect multi-core systems. 3D-NoCs have emerged to compensate for deficiencies of 2D-NoCs such as long latency and power overhead. A low-latency routing algorithm for 3D-NoC is designed to accommodate high-speed communication between cores. Both simulation and analytical models are applied to estimate the communication latency of NoCs. Generally, simulations are time-consuming and slow down the design process. Analytical models provide, within a fraction of the time, nearly accurate results which can be used by simulation to fine-tune the design. In this paper, a high performance and adaptive routing algorithm has been proposed for partially connected 3D-NoCs. Latency of the routing algorithm under different traffic patterns, different number of elevators and different elevator assignment mechanisms are reported. An analytical model, tailored to the adaptivity of the algorithm and under low traffic scenarios, has been developed and the results have been verified by simulation. According to the results, simulation and analytical results are consistent within a 10 percent margin.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2801298READ: Reliability Enhancement in 3D-Memory Exploiting Asymmetric SER Distribution
https://www.computer.org/csdl/trans/tc/2018/08/08283794-abs.html
3D-memory is one of promising applications in 3D-IC technology. With a 3D integration technology, the effective density of memories can increase and the interconnect distance from processor to memory can be shortened. Due to its stacked structure, the upper dies behave as shields blocking outer particles from reaching lower dies, and it makes error rate of the top layer largest among all layers. From a heat perspective, the lower dies would suffer from reliability problems since the lower dies are placed on top of logic die. The heat dissipation can more influence lower dies than upper dies. This creates unequal a reliability distribution for each layer in 3D-memories. A novel ECC organization scheme for 3D-memory to secure reliable operations under soft error rate (SER) profiles is introduced in this paper. The proposed scheme does not require additional redundant arrays. Instead, it utilizes unused spare columns of relatively reliable layer memories to store additional check-bits of less reliable layer memories. It forms a heterogeneous ECC organization across different layers which enhances ECC capabilities in less reliable layers. In addition, redundancy sharing scheme for yield enhancement can be implemented with the proposed scheme. Experimental results show that a memory with the proposed method can tolerate more than three times of a bit-error rate compared to the conventional memory.07/10/2018 1:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2801856Energy Optimal Task Scheduling with Normally-Off Local Memory and Sleep-Aware Shared Memory with Access Conflict
https://www.computer.org/csdl/trans/tc/2018/08/08290729-abs.html
The rapid development of the Real-Time and Embedded System (RTES) has increased the requirement on the processing capabilities of sensors, mobiles and smart devices, etc. Meanwhile, energy efficiency techniques are in desperate need as most devices in RTES are battery powered. Following the above trends, this work explores the memory system energy efficiency for a general multi-core architecture. This architecture integrates a local memory in each processing core, with a large off-chip memory shared among multiple cores. Decisions need to be made on whether tasks will be executed with the shared memory or the local memory to minimize the total energy consumption within real-time constraints. This paper proposes optimal schemes as well as a polynomial-time approximation algorithm with constant ratio. The problem complexity analysis for different task and system models is also presented. Experimental results show that the proposed approximation scheme performs close to the optimal solution in average.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2805337adBoost: Thermal Aware Performance Boosting Through Dark Silicon Patterning
https://www.computer.org/csdl/trans/tc/2018/08/08292829-abs.html
Increasing power densities of many-core systems leaves a fraction of on-chip resources inactive, referred to as dark silicon. Efficient management of critical interlinked parameters - power, performance and temperature can improve resource utilization and mitigate dark silicon. In this paper, we present a run-time resource management system for thermal aware performance boosting using a dark silicon aware run-time application mapping strategy. The mapping policy patterns inactive cores among active cores for relatively lower and even distribution of operating temperatures. This provides enough thermal headroom for boosting the frequency of active cores upon performance surges and allows sustained boosting periods, improving the performance further. We design a controller for thermal aware performance boosting that decides on efficient allocation utilization of power budget and thermal headroom obtained from patterning. Our strategy yields up to 37 percent better throughput, 29 percent lower waiting time and up to 2 <inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="kanduri-ieq1-2805683.gif"/></alternatives></inline-formula> longer boosting periods, in comparison with other state-of-the-art run-time mapping policies.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2805683ARMOR: A Recompilation and Instrumentation-Free Monitoring Architecture for Detecting Memory Exploits
https://www.computer.org/csdl/trans/tc/2018/08/08295231-abs.html
Software written in programming languages that permit manual memory management, such as C and C++, are often littered with exploitable memory errors. These memory bugs enable attackers to leak sensitive information, hijack program control flow, or otherwise compromise the system and are a critical concern for computer security. Many runtime monitoring and protection approaches have been proposed to detect memory errors in C and C++ applications, however, they require source code recompilation or binary instrumentation, creating compatibility challenges for applications using proprietary or closed source code, libraries, or plug-ins. This paper introduces a new approach for detecting heap memory errors that does not require applications to be recompiled or instrumented. We show how to leverage the calling convention of a processor to track all dynamic memory allocations made by an application during runtime. We also present a transparent tracking and caching architecture to efficiently verify program heap memory accesses. Performance simulations of our architecture using SPEC benchmarks and real-world application workloads show our architecture achieves hit rates over 95 percent for a 256-entry cache, resulting in only 2.9 percent runtime overhead. Security analysis using a software prototype shows our architecture detects 98 percent of heap memory errors from selected test cases in the Juliet Test Suite and real-world exploits.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2807818Towards Formal Evaluation and Verification of Probabilistic Design
https://www.computer.org/csdl/trans/tc/2018/08/08302961-abs.html
In the nanometer regime of integrated circuit fabrication, device variability imposes serious challenges to the design and manufacturing of reliable systems. A new computation paradigm of approximate and probabilistic design has been proposed recently to accept design imperfection as a resource for certain applications. Despite recent intensive study on approximate design, probabilistic design receives relatively few attentions. This paper provides a general formulation for the evaluation and verification of probabilistic design. We establish their connection to stochastic Boolean satisfiability (SSAT), (weighted) model counting, and probabilistic model checking. Moreover, a novel SSAT solver based on binary decision diagram (BDD) is proposed, and a comparative experimental study is performed to contrast the strengths and weaknesses of different solutions. The proposed BDD-based SSAT solver obtains the best scalability among all techniques in our experiments. We also compare the BDD-based SSAT solver to a prior method based on Bayesian network modeling. Experimental results show that our method outperforms the prior method by orders of magnitude in both runtime and memory usage. Our work can be an essential step towards automated synthesis of probabilistic design.07/10/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2807431Metastability-Containing Circuits
https://www.computer.org/csdl/trans/tc/2018/08/08314764-abs.html
In digital circuits, <italic>metastability</italic> can cause deteriorated signals that neither are logical 0 nor logical 1, breaking the abstraction of Boolean logic. Synchronizers, the only traditional countermeasure, exponentially decrease the odds of maintained metastability over time. We propose a fundamentally different approach: It is possible to deterministically <italic>contain</italic> metastability by fine-grained logical masking so that it cannot infect the entire circuit. At the heart of our approach lies a time- and value-discrete model for metastability in synchronous clocked digital circuits, in which metastability is propagated in a worst-case fashion. The proposed model permits positive results and passes the test of reproducing Marino's impossibility results. We fully classify which functions can be computed by circuits with standard registers. Regarding masking registers, we show that more functions become computable with each clock cycle, and that masking registers permit exponentially smaller circuits for some tasks. Demonstrating the applicability of our approach, we present the first fault-tolerant distributed clock synchronization algorithm that deterministically guarantees correct behavior in the presence of metastability. As a consequence, clock domains can be synchronized without using synchronizers, enabling metastability-free communication between them.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2808185Memory and Communication Profiling for Accelerator-Based Platforms
https://www.computer.org/csdl/trans/tc/2018/07/08234629-abs.html
The growing demand of processing power is being satisfied mainly by an increase in the number of homogeneous and heterogeneous computing cores in a system. Efficient utilization of these architectures demands analysis of memory-access behaviour of applications and perform data-communication aware mapping of applications on these architectures. Appropriate tools are required to highlight memory-access patterns and provide detailed intra- application data-communication information to assist developers in porting existing sequential applications efficiently to these architectures. In this work, we present the design of an open-source tool which provides such a detailed profile for C/C++ applications. In contrast to prior work, our tool not only reports detailed information, but also generates this information with manageable overheads for realistic workloads. Comparison with the state- of-the-art shows that the proposed profiler has, on the average, an order of magnitude less overhead as compared to the state-of-the-art data-communication profilers for a wide range of benchmarks. The experimental results show that our proposed tool generated profiling information for image processing applications which assisted in achieving a speed-up of <inline-formula><tex-math notation="LaTeX">$6.14\times$</tex-math><alternatives> <inline-graphic xlink:href="ashraf-ieq1-2785225.gif"/></alternatives></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$2.75\times$</tex-math><alternatives><inline-graphic xlink:href="ashraf-ieq2-2785225.gif"/> </alternatives></inline-formula> for heterogeneous multi-core platforms containing an FPGA and a GPU as accelerators, respectively.06/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2785225A Compositional Approach for Verifying Protocols Running on On-Chip Networks
https://www.computer.org/csdl/trans/tc/2018/07/08239635-abs.html
In modern many-core architectures, advanced on-chip networks provide the means of communication for the cores. This greatly complicates the design and verification of the cache coherence protocols deployed by those cores. A common approach to deal with this complexity is to decompose the whole system into the protocol and the network. This decomposition is, however, not always possible. For example, unexpected deadlocks can emerge when a deadlock-free protocol and a deadlock-free network are combined. This paper proposes a compositional methodology: prove properties over a network, prove properties over a protocol, and infer properties over the system as a whole. Our methodology is based on theorems that show that such decomposition is possible by having sufficiently large local buffers at the cores. We apply this methodology to verify several protocols such as MI, MSI, MESI and MEUSI running on top of advanced interconnects with adaptive routing.06/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2786723Scheduling Weakly Consistent C Concurrency for Reconfigurable Hardware
https://www.computer.org/csdl/trans/tc/2018/07/08241825-abs.html
Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations (‘atomics’), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This article explores how these algorithms can be compiled from C to reconfigurable hardware via <italic>high-level synthesis</italic>(HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional <italic>intra-thread</italic> constraints among the memory operations. In addition, we show that we can support the pipelining of loops containing atomics by injecting further inter-iteration constraints. We implement our approach on two constraint-based scheduling HLS tools: LegUp 4.0 and LegUp 5.1. We extend both tools to support two memory models that are capable of synthesising atomics correctly. The first memory model only supports <italic>sequentially consistent</italic> (SC) atomics and the second supports <italic>weakly consistent</italic> (‘weak’) atomics as defined by the 2011 revision of the C standard. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many multi-threaded algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics in accordance with the C standard. A case study on a circular buffer suggests that on average circuits synthesised from programs that schedule atomics correctly can be 6x faster than an existing lock-based implementation of atomics, that weak atomics can yield a further 1.3x speedup, and that pipelining can yield a further 1.3x speedup.06/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2786249Scheduling Analysis of Imprecise Mixed-Criticality Real-Time Tasks
https://www.computer.org/csdl/trans/tc/2018/07/08247214-abs.html
In this paper, we study the scheduling problem of the <italic>imprecise mixed-criticality model</italic> (IMC) under <italic>earliest deadline first with virtual deadline</italic> (EDF-VD) scheduling upon uniprocessor systems. Two schedulability tests are presented. The first test is a concise utilization-based test which can be applied to the implicit deadline IMC task set. The suboptimality of the proposed utilization-based test is evaluated via a widely-used scheduling metric, <italic>speedup factors</italic>. The second test is a more effective test but with higher complexity which is based on the concept of demand bound function (DBF). The proposed DBF-based test is more generic and can apply to constrained deadline IMC task set. Moreover, in order to address the high time cost of the existing deadline tuning algorithm, we propose a novel algorithm which significantly improve the efficiency of the deadline tuning procedure. Experimental results show the effectiveness of our proposed schedulability tests, confirm the theoretical suboptimality results with respect to speedup factor, and demonstrate the efficiency of our proposed algorithm over the existing deadline tunning algorithm. In addition, issues related to the implementation of the IMC model under EDF-VD are discussed.06/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2789879Aging-Aware Workload Management on Embedded GPU Under Process Variation
https://www.computer.org/csdl/trans/tc/2018/07/08247279-abs.html
Graphics Processing Units (GPUs) have been employed in embedded systems to handle increased amounts of computation and to satisfy the timing requirement. Due to the small feature size, chip aging and within-die parameter variations have been considered to be among the challenging problems for state-of-the-art processors, including GPUs. In order to deal with the process variation, several processors use chip-level guardbanding, which uses the lowest operating frequency that results in a significant chip-level performance drop. Other processors improve their performance efficiency through core-level guardbanding that may use a different operating frequency for each core. Existing aging management techniques are based on the chip-level guardbanding, which assigns the same number of instructions to the cores that have the same aging status. However, in the presence of the process variation, existing aging management techniques have a limitation in minimizing the aging effect because each core has a different amount of stress for the same number of instructions. In order to tackle this problem, we propose a low-overhead aging and process variation aware workload management technique for embedded GPUs. The proposed technique considers the process variation and the current aging status together, and assigns a different number of instructions to clusters to minimize the aging effect in the presence of process variation. Results show that our technique improves the GPU aging in over 95 percent of cases whereas the state-of-the-art compiler-based technique improves the GPU aging in 72.25 percent of cases. Moreover, compared to the compiler-based technique, our technique reduces the performance overhead by 40 percent while achieving almost the same GPU aging improvement.06/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2789904PerfBound: Conserving Energy with Bounded Overheads in On/Off-Based HPC Interconnects
https://www.computer.org/csdl/trans/tc/2018/07/08248748-abs.html
Energy and power are key challenges in high-performance computing. System energy efficiency must be significantly improved, and this requires greater efficiency in all subcomponents. An important target of optimization is the interconnect, since network links are always on, consuming power even during idle periods. A large number of HPC machines have a primary interconnect based on Ethernet (about 40 percent of TOP500 machines), which, since 2010, has included support for saving power via Energy Efficient Ethernet (EEE). Nevertheless, it is unlikely that HPC interconnects would use these energy saving modes unless the performance overhead is known and small. This paper presents PerfBound, a self-contained technique to manage on/off-based networks such as EEE, minimizing interconnect link energy consumption subject to a bound on the performance degradation. PerfBound does not require changes to the applications and it uses only local information already available at switches and NICs without introducing additional communication messages, and is also compatible with multi-hop networks. PerfBound is evaluated using traces from a production supercomputer. For twelve out of fourteen applications, PerfBound has high energy savings, up to 70 percent for only 1 percent performance degradation. This paper also presents DynamicFastwake, which extends PerfBound to exploit multiple low-power states. DynamicFastwake achieves an energy–delay product 10 percent lower than the original PerfBound technique.06/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2790394A Scheme to Design Concurrent Error Detection Techniques for the Fast Fourier Transform Implemented in SRAM-Based FPGAs
https://www.computer.org/csdl/trans/tc/2018/07/08258993-abs.html
Soft errors are an important issue for SRAM-based Field Programmable Gate Arrays (FPGAs), since they result in permanent alterations of the mapped circuit when they affect their configuration memory. Concurrent Error Detection (CED) techniques, such as Dual Modular Redundancy (DMR), are usually employed to detect errors that affect the performance of the circuit. When trying to detect errors produced on the complex Fast Fourier Transform (FFT), the Parseval Sum of Squares (SoS) is a widely used technique. In this paper, we present a scheme to implement CED techniques for the complex FFT implemented in SRAM-based FPGAs. These techniques perform checks based on the relationships existing between one or more of the inputs and the outputs of the algorithm. Three examples of these techniques are provided to further clarify how to construct them. These techniques, along with DMR and SoS, have been tested through fault injection. An analysis on their error detection capabilities shows that they achieve high detection rates with much less resource usage than DMR and SoS. In addition, the number of false error detections for these techniques is lower than that of SoS, which leads to less unnecessary reconfigurations of the device.06/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2792445System-Wide Time versus Density Tradeoff in Real-Time Multicore Fluid Scheduling
https://www.computer.org/csdl/trans/tc/2018/07/08259245-abs.html
Recent parallel programming frameworks such as OpenCL and OpenMP allow us to enjoy the parallelization freedom for real-time tasks. The parallelization freedom creates the time versus density tradeoff problem in fluid scheduling, i.e., more parallelization reduces thread execution times but increases the density. By system-widely exercising this tradeoff, we propose optimal parameter tuning of real-time tasks aiming at maximizing the schedulability of multicore fluid scheduling. Our experimental study by both simulation and actual implementation shows that the proposed approach well balances the time and the density, and results in up to 80 percent improvement of the schedulability.06/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2793919wrJFS: A Write-Reduction Journaling File System for Byte-addressable NVRAM
https://www.computer.org/csdl/trans/tc/2018/07/08260880-abs.html
Non-volatile random-access memory (NVRAM) becomes a mainstream storage device in embedded systems due to its favorable features, such as small size, low power consumption, and short read/write latency. Unlike dynamic random access memory (DRAM), NVRAM has asymmetric performance and energy consumption on read/write operations. Generally, on NVRAM, a write operation consumes more energy and time than a read operation. Unfortunately, current mobile/embedded file systems, such as EXT2/3 and EXT4, are very unfriendly for NVRAM devices. The reason is that current mobile/embedded file systems employ a journaling mechanism for increasing its data reliability. Although a journaling mechanism raises the safety of data in a file system, it also repeatedly writes data to a data storage while data is committed and checkpointed. Though several related works have been proposed to reduce the amount of write traffic to NVRAM, they still cannot effectively minimize the write amplification of a journaling mechanism. Such observations motivate us to design a two-phase write reduction journaling file system called wrJFS. In the first phase, wrJFS classified data into two categories: Metadata and user data. As the size of metadata is usually very small (few bytes), byte-enabled journaling strategy will handle metadata during commit and checkpoint stages. In contrast, the size of user data is very large relative to metadata; thus, user data will be processed in the second phase. In the second phase, user data will be compressed by hardware encoder to reduce the write size and managed compressed-enabled journaling strategy to avoid the write amplification on NVRAM. Moreover, we analyze the overhead of wrJFS and show that the overhead is negligible. According to the experimental results, the proposed wrJFS outperforms other journaling file systems even though the experiments include the overhead of data compression.06/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2794440NV-Clustering: Normally-Off Computing Using Non-Volatile Datapaths
https://www.computer.org/csdl/trans/tc/2018/07/08263396-abs.html
With technology downscaling, static power dissipation presents a crucial challenge to multicore, many-core, and System-on-Chip (SoC) architectures due to the increased role of leakage currents in overall energy consumption and the need to support power-gating schemes. Herein, a non-Volatile (NV) flip-flop design approach, referred to as NV Clustering, is developed to realize middleware-transparent intermittent computing. First, a Logic-Embedded Flip-Flop (LE-FF) is developed to realize rudimentary Boolean logic functions along with an inherent state-holding capability within a compact footprint. Second, the NV-Clustering synthesis procedure and corresponding tool module are utilized to instantiate the LE-FF library cells within conventional Register Transfer Language (RTL) specifications. This selectively clusters together logic and NV state-holding functionality, based on energy and area minimization criteria. NV-Clustering is applied to a wide range of benchmarks including ISCAS-89, MCNS, and ITC-99 computational circuits using a LE-FF based on the Spin Hall Effect (SHE)-assisted Spin Transfer Torque (STT) Magnetic Tunnel Junction (MTJ). Simulation results validate functionality and power dissipation, area, and delay benefits. For instance, results for ISCAS-89 benchmarks indicate 15 percent area reduction on average, up to 22 percent reduction in energy consumption, and up to 14 percent reduction in delay as compared to alternative NV-FF based designs, as evaluated via SPICE simulation at the 45-nm technology node.06/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2795601Performance and Power-Efficient Design of Dense Non-Volatile Cache in CMPs
https://www.computer.org/csdl/trans/tc/2018/07/08265206-abs.html
In this paper, we present a novel cache design based on <italic>Multi-Level Cell Spin-Transfer Torque RAM</italic> (MLC STT-RAM) that can dynamically adjust the set capacity and associativity to efficiently use the full potential of MLC STT-RAM technology. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other half are write-friendly. Furthermore, we propose to opportunistically deactivate cache ways in underutilized sets to convert MLC to <italic>Single-Level Cell</italic> (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experimental evaluations show an average improvement of 43 percent in total numbers of conflict misses, 27 percent in memory access latency, 12 percent in system performance, and 26 percent in L3 access energy, with a slight degradation in lifetime (about 7 percent) compared to an SLC cache.06/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2796067FPGA-Based Data Storage System on NAND Flash Memory in RAID 6 Architecture for In-Line Pipeline Inspection Gauges
https://www.computer.org/csdl/trans/tc/2018/07/08315494-abs.html
In this manuscript, we present a redundant data storage system based on NAND flash memory chips for in-line Pipeline Inspection Gauges (PIGs). The system is the next step for a technique that reduces data from 1,024 to 37 bytes by 80 transducers used for straight-beam ultrasonic inspection. Each inspection is costly, because PIGs check pipelines up to 100 Km, collecting data every 3 mm and reaching speeds of 2 m/s. These conditions require that the storage system must be redundant, and able to maintain a minimum data flow, thus avoiding bottlenecks. To achieve this, we analyzed the variables that influence the inspection process in real-time, and we structured our Flash Translation Layer (FTL) to eliminate the latencies generated by the computation of the Error Correcting Codes (ECC) and redundancy bytes. Our controller computes the ECC and redundancy bytes while it transfers the information to the cache register of the selected die in the memory chips. At the hardware level, we interleaved 8 NAND flash chips in a Redundant Array of Independent Disks (RAID) type-6 architecture. We tested the storage system considering the incorrect response of up to 2 chips and ensuring a throughput up to 7.28 MB/s. Finally, we expanded the analysis of the data flow, whereby this system is profitable for different pipeline diameters or compression techniques.06/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2794986Lightweight Hardware Transactional Memory for GPU Scratchpad Memory
https://www.computer.org/csdl/trans/tc/2018/06/08119530-abs.html
Graphics Processing Units (GPUs) have become the accelerator of choice for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction - Multiple Thread (SIMT) fashion. Using OpenCL terminology, GPUs offer a global memory space shared by all the threads in the GPU, as well as a local memory space shared by only a subset of the threads. Programmers can use local memory as a scratchpad to improve the performance of their applications due to its lower latency as compared to global memory. In the SIMT execution model, data locking mechanisms used to protect shared data limit scalability. To take full advantage of the lower latency that local memory affords, and to provide an efficient synchronization mechanism, we propose GPU-LocalTM as a lightweight and efficient transactional memory (TM) for GPU local memory. To minimize the storage resources required for TM support, GPU-LocalTM allocates transactional metadata in the existing memory resources. Additionally, GPU-LocalTM implements different conflict detection mechanisms that can be used to match the characteristics of the application. For the workloads studied in our simulation-based evaluation, GPU-LocalTM provides from 1.1X up to 100X speedup over serialized critical sections.05/08/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2776908CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs
https://www.computer.org/csdl/trans/tc/2018/06/08119900-abs.html
The key to the high performance on GPUs lies in the massive threading to enable thread switching and hide long latencies. GPUs are equipped with a large register file to enable fast context switch. However, thread throttling techniques that are designed to mitigate cache contention, lead to under-utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP. In this paper, we propose Coordinated Register Allocation and Thread-level parallelism (<italic>CRAT </italic>) to explore the optimization space of register allocation and TLP management on GPUs. CRAT employs both compile-time(CRAT-static) and run-time techniques(CRAT-dyn) to exhaust the design space. CRAT-static works statically to explore TLP and register allocation trade-off and CRAT-dyn exploits dynamic register allocation for further improvement. Experiments indicate that CRAT-static achieves an average 1.25X speedup over existing TLP management technique. On four register-limited applications, CRAT-dyn further improves the performance speedup of CRAT-static from 1.51X to 1.70X.05/08/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2776272Fast Bit-Parallel Binary Multipliers Based on Type-I Pentanomials
https://www.computer.org/csdl/trans/tc/2018/06/08125152-abs.html
In this paper, a fast implementation of bit-parallel polynomial basis (PB) multipliers over the binary extension field <inline-formula><tex-math notation="LaTeX">$GF(2^m)$</tex-math><alternatives> <inline-graphic xlink:href="imana-ieq1-2778730.gif"/></alternatives></inline-formula> generated by type-I irreducible pentanomials is presented. Explicit expressions for the coordinates of the multipliers and a detailed example are given. Complexity analysis shows that the multipliers here presented have the lowest delay in comparison to similar bit-parallel PB multipliers found in the literature based on this class of irreducible pentanomials. In order to prove the theoretical complexities, hardware implementations over Xilinx FPGAs have also been performed. Experimental results show that the approach here presented exhibits the lowest delay with a balanced <inline-formula> <tex-math notation="LaTeX">$Area\times Time$</tex-math><alternatives> <inline-graphic xlink:href="imana-ieq2-2778730.gif"/></alternatives></inline-formula> complexity when it is compared with similar multipliers.05/08/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2778730Sanitizer: Mitigating the Impact of Expensive ECC Checks on STT-MRAM Based Main Memories
https://www.computer.org/csdl/trans/tc/2018/06/08126226-abs.html
DRAM density scaling has become increasingly difficult due to challenges in maintaining a sufficiently high storage capacitance and a sufficiently low leakage current at nanoscale feature sizes. Non-volatile memories (NVMs) have drawn significant attention as potential DRAM replacements because they represent information using resistance rather than electrical charge. Spin-torque transfer magnetoresistive RAM (STT-MRAM) is one of the most promising NVM technologies due to its relatively low write energy, high speed, and high endurance. However, STT-MRAM suffers from its own scaling problems. As the size of the storage element decreases with technology scaling, STT-MRAM retention error rates are expected to increase, which will require multi-bit error-correcting code (ECC) and periodic scrubbing. We introduce the <italic>Sanitizer</italic> architecture, which mitigates the performance and energy overheads of ECC and scrubbing in future STT-MRAM based main memories. To reduce the scrubbing rate, a coarse-grained, multi-bit ECC mechanism with a 12.5 percent storage overhead is used. To avoid fetching multiple blocks from memory and performing costly ECC checks on every read, the memory regions that will likely be accessed in the near future are predicted and proactively scrubbed. Compared to a conventional STT-MRAM system, Sanitizer improves performance by 1.22<inline-formula> <tex-math notation="LaTeX">$\times$</tex-math><alternatives><inline-graphic xlink:href="guo-ieq1-2779151.gif"/> </alternatives></inline-formula> and reduces end-to-end system energy by 22 percent.05/08/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2779151Optimizing Soft Error Reliability Through Scheduling on Heterogeneous Multicore Processors
https://www.computer.org/csdl/trans/tc/2018/06/08126846-abs.html
Reliability to soft errors is an increasingly important issue as technology continues to shrink. In this paper, we show that applications exhibit different reliability characteristics on big, high-performance cores versus small, power-efficient cores, and that there is significant opportunity to improve system reliability through reliability-aware scheduling on heterogeneous multicore processors. We monitor the reliability characteristics of all running applications, and dynamically schedule applications to the different core types in a heterogeneous multicore to maximize system reliability. Reliability-aware scheduling improves reliability by 25.4 percent on average (and up to 60.2 percent) compared to performance-optimized scheduling on a heterogeneous multicore processor with two big cores and two small cores, while degrading performance by 6.3 percent only. We also introduce a novel system-level reliability metric for multiprogram workloads on (heterogeneous) multicores. We provide a trade-off analysis among reliability-, power- and performance-optimized scheduling, and evaluate reliability-aware scheduling under performance constraints and for unprotected L1 caches. In addition, we also extend our scheduling mechanisms to multithreaded programs. The hardware cost in support of our reliability-aware scheduler is limited to 296 bytes per core.05/08/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2779480StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIM
https://www.computer.org/csdl/trans/tc/2018/06/08168367-abs.html
GPU has become popular with a large amount of parallelism found in learning. While the GPU has been effective for many learning tasks, still many GPU learning applications have low execution efficiency due to sparse data. Sparse data induces divergent memory accesses with low locality, thereby consuming a large fraction of execution time transferring data across the memory hierarchy. Although a considerable effort has been devoted to reducing the memory divergence, iterative-convergent learning provides a unique opportunity to achieve full potential in modern GPUs that it allows different threads to continue computation using stale values. In this paper, we propose StaleLearn, a learning acceleration mechanism to reduce the memory divergence overhead of GPU learning by utilizing the stale value tolerance of the iterative-convergent learning. Based on the stale value tolerance, StaleLearn transforms the problem of divergent memory accesses into the synchronization problem by replicating the model and reduces the synchronization overhead by asynchronous synchronization on Processor-in-Memory (PIM). The stale value tolerance enables a clear task decomposition between the GPU and PIM, which can effectively exploit parallelism between PIM and GPU. On average, our approach accelerates representative GPU learning applications by 3.17 times with existing PIM proposals.05/08/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2780237Analytic Multi-Core Processor Model for Fast Design-Space Exploration
https://www.computer.org/csdl/trans/tc/2018/06/08168422-abs.html
Simulators help computer architects optimize system designs. The limited performance of simulators even of moderate size and detail makes the approach infeasible for design-space exploration of future exascale systems. Analytic models, in contrast, offer very fast turn-around times. In this paper we propose an analytic multi-core processor-performance model that takes as inputs <italic>a)</italic> a parametric microarchitecture-independent characterization of the target workload, and <italic>b)</italic> a hardware configuration of the core and the memory hierarchy. The processor-performance model considers instruction-level parallelism (ILP) per type, models <italic> single instruction, multiple data</italic> (SIMD) features, and considers cache and memory-bandwidth contention between cores. We validate our model by comparing its performance estimates with measurements from hardware performance counters on Intel Xeon and ARM Cortex-A15 systems. We estimate multi-core contention with a maximum error of 11.4 percent. The average single-thread error increases from 25 percent for a state-of-the-art simulator to 59 percent for our model, but the correlation is still 0.8, a high relative accuracy, while we achieve a speedup of several orders of magnitude. With a much higher capacity than simulators and more reliable insights than back-of-the-envelope calculations it makes automated design-space exploration of exascale systems possible, which we show using a real-world case study from radio astronomy.05/08/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2780239Design, Implementation and Verification of Cloud Architecture for Monitoring a Virtual Machine's Security Health
https://www.computer.org/csdl/trans/tc/2018/06/08169039-abs.html
Cloud customers need assurances regarding the security of their virtual machines (VMs), operating within an Infrastructure as a Service (IaaS) cloud system. This is complicated by the customer not knowing where his VM is executing, and on the semantic gap between what the customer wants to know versus what can be measured in the cloud. We present <italic>CloudMonatt</italic>, an architecture for monitoring a VM's security health. We show a full prototype based on the OpenStack open source cloud software. We also verify <italic>CloudMonatt</italic> to show that there are no security vulnerabilities that could allow an attacker to subvert its protection. As such, we conduct a systematic security verification of <italic>CloudMonatt</italic>. We model and verify the network protocols within the distributed system, as well as interactions of hardware/software modules inside the cloud server. Our results show that <italic>CloudMonatt</italic> is capable of delivering this monitoring and attestation service to customers in an unforgeable and reliable manner.05/08/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2780823CLIM: A Cross-Level Workload-Aware Timing Error Prediction Model for Functional Units
https://www.computer.org/csdl/trans/tc/2018/06/08207606-abs.html
Timing errors that are caused by the timing violations of sensitized circuit paths, have emerged as an important threat to the reliability of synchronous digital circuits. To protect circuits from these timing errors, designers typically use a conservative timing margin, which leads to operational inefficiency. Existing adaptive approaches reduce such conservative margins by predicting the timing errors in advance and adjusting the timing margin adaptively. However, these error prediction approaches overlook the impact of input workload (i.e., operands) on path sensitization, thereby resulting in a loss of accuracy. The diversity of input operands leads to complex path sensitization behaviors, making them hard to represent in timing error modeling. In this paper, we propose <monospace> <bold>CLIM</bold></monospace>, a cross-level workload-aware timing error prediction model for functional units (FUs). <monospace><bold>CLIM</bold></monospace> predicts whether there are timing errors in FU at two levels: bit-level and value-level. At the bit level or value level, <monospace><bold>CLIM</bold></monospace> predicts each output bit or entire output value as one of two classes: <inline-formula><tex-math notation="LaTeX">$\lbrace$</tex-math> <alternatives><inline-graphic xlink:href="jiao-ieq1-2783333.gif"/></alternatives></inline-formula><italic>timing correct</italic>, <italic>timing erroneous</italic><inline-formula><tex-math notation="LaTeX">$\rbrace$</tex-math> <alternatives><inline-graphic xlink:href="jiao-ieq2-2783333.gif"/></alternatives></inline-formula> as a function of input workload and clock period, respectively. We apply supervised learning methods to construct <monospace><bold>CLIM </bold></monospace>, by using input operands, computation history and circuit toggling as input features, as well as outputs’ timing classes as labels. These training data are collected from gate-level simulations (GLS) of post place-and-route designs in TSMC 45nm process. We evaluate <monospace><bold>CLIM</bold></monospace> prediction accuracy for various FUs and compare it with baseline models. On average, <monospace><bold>CLIM</bold></monospace>  exhibits 95 percent prediction accuracy at value-level, 97 percent at bit-level, and executes at a rate 173X faster than GLS. We utilize <monospace><bold>CLIM</bold></monospace> to analyze the value-level and bit-level reliability of FUs under random and real-world application workloads. At value-level, <monospace><bold>CLIM</bold> </monospace>-based reliability estimation is within 2.8 percent deviation on average of detailed GLS ground truth. At bit-level, we introduce the concept of <italic>bit-level reliability specification</italic> of error-tolerant applications and compare this with the <monospace><bold>CLIM</bold></monospace>-based bit-level reliability estimation. By comparison, <monospace><bold>CLIM</bold></monospace> will classify the application quality into two classes: <inline-formula><tex-math notation="LaTeX">$\lbrace$</tex-math><alternatives> <inline-graphic xlink:href="jiao-ieq3-2783333.gif"/></alternatives></inline-formula><italic>acceptable</italic>, <italic>non-acceptable</italic><inline-formula><tex-math notation="LaTeX">$\rbrace$</tex-math><alternatives> <inline-graphic xlink:href="jiao-ieq4-2783333.gif"/></alternatives></inline-formula>. On average, 97 percent application quality classification is consistent with GLS ground truth.05/08/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2783333STEM: A Thermal-Constrained Real-Time Scheduling for 3D Heterogeneous-ISA Multicore Processors
https://www.computer.org/csdl/trans/tc/2018/06/08214221-abs.html
Synergistic processing between multiple instruction set architecture (ISA) cores and heat flows in a 3D heterogeneous multicore chip exacerbates the complexity of thermal problems. To satisfy performance and temperature requirements, a thermal size ratio detection method is proposed to control the heat generated by task executions with consideration of synergistic processing. At run-time, a Synthetic Thermal-Efficient Manager (STEM) is proposed to dispatch tasks and thermal sizes to cores and to adjust the heat generated in each core through dynamic voltage and frequency scaling. A schedulability test with a degradation factor from synergistic processing is first derived to guarantee that the timing and thermal constraints are met for all tasks in a 3D heterogeneous-ISA multicore processor. Finally, a series of simulations obtain encouraging performance results for the proposed methodology, and a case study using commercially available technology is performed to validate the practicability of the proposed approach.05/08/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2783941Contention and Locality-Aware Work-Stealing for Iterative Applications in Multi-Socket Computers
https://www.computer.org/csdl/trans/tc/2018/06/08214252-abs.html
Modern large-scale computers have shifted to Multi-socket Multi-core (MSMC) architectures, where multiple CPU chips are integrated into a machine as sockets and multiple memory nodes are integrated into the shared main memory (NUMA). To improve the hardware utilization of MSMC computers, multiple programs are often executed concurrently. However, most work-stealing schedulers are designed for single-socket architectures and contention-free scenarios. Work-stealing programs suffer from very-high-frequency remote memory access and serious interference from co-located programs in MSMC architectures, which in turn significantly degrade their performance. To solve these two problems, we propose a Contention- and Locality-Aware Work-Stealing (CLAWS) scheduler. CLAWS first evenly distributes the data set of a program to all the memory nodes and allocates a task to the socket where the local memory node stores its data. Then, according to the real-time contention situation for each socket collected at runtime, CLAWS dynamically migrates data and re-allocates the corresponding tasks to balance the workload and reduce remote memory accesses. Experimental results show that CLAWS can improve the performance of memory-bound programs by 40.1 percent on average compared with traditional work-stealing schedulers. Meanwhile, CLAWS is also more energy efficient than traditional work-stealing schedulers.05/08/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2783932A TMDTO Attack Against Lizard
https://www.computer.org/csdl/trans/tc/2018/05/08107499-abs.html
Lizard is a very recently proposed lightweight stream cipher that claims 60 bit security against distinguishing (related to state recovery) and 80 bit security against key recovery attack. This cipher has 121 bit state size. In this paper, we first note that using <inline-formula><tex-math notation="LaTeX">$\psi$</tex-math><alternatives> <inline-graphic xlink:href="maitra-ieq1-2773062.gif"/></alternatives></inline-formula> key stream bits one can recover <inline-formula><tex-math notation="LaTeX">$\psi$</tex-math><alternatives> <inline-graphic xlink:href="maitra-ieq2-2773062.gif"/></alternatives></inline-formula> unknown bits of the state when <inline-formula><tex-math notation="LaTeX">$\tau$</tex-math><alternatives> <inline-graphic xlink:href="maitra-ieq3-2773062.gif"/></alternatives></inline-formula> state bits are fixed to a specific pattern. This is made possible by guessing the remaining state bits. We present certain values of <inline-formula><tex-math notation="LaTeX">$\psi, \tau$</tex-math><alternatives> <inline-graphic xlink:href="maitra-ieq4-2773062.gif"/></alternatives></inline-formula> based on the state size that helps in mounting a generic conditional TMDTO attack following the BSW sampling. For Lizard, we obtain the preprocessing complexity as <inline-formula><tex-math notation="LaTeX">$2^{67}$</tex-math><alternatives> <inline-graphic xlink:href="maitra-ieq5-2773062.gif"/></alternatives></inline-formula>, and the maximum of Data, Time and Memory complexity during the online phase as <inline-formula><tex-math notation="LaTeX">$2^{54}$</tex-math> <alternatives><inline-graphic xlink:href="maitra-ieq6-2773062.gif"/></alternatives></inline-formula>. The parameters in the online phase are significantly less than <inline-formula><tex-math notation="LaTeX">$2^{60}$</tex-math> <alternatives><inline-graphic xlink:href="maitra-ieq7-2773062.gif"/></alternatives></inline-formula>.04/05/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2773062NUDA: Non-Uniform Directory Architecture for Scalable Chip Multiprocessors
https://www.computer.org/csdl/trans/tc/2018/05/08107506-abs.html
Chip multiprocessors (CMPs) involve directory storage overhead if cache coherence is realized via sharer tracking. This work proposes a novel framework dubbed <underline>n</underline>on-<underline>u</underline>niform <underline>d </underline>irectory <underline>a</underline>rchitecture (NUDA), by leveraging our two insights in that the number of “active” directory entries required to stay on chip is usually small for a short execution time window due to high directory locality, and that the fraction of interrogated directory entries drops as the core count rises. Unlike earlier storage overhead reduction techniques that require all cached LLC blocks to have their directory entries fully on chip, NUDA dynamically buffers only most active <underline>d</underline>irectory <underline>v </underline>ectors (DVs) on chip while keeping DVs of all LLC blocks in a backing store at low level storage. NUDA attains its superior efficiency via an inventive <underline>c</underline>riticality-<underline>a</underline>ware <underline>r</underline>eplacement <underline>p</underline>olicy (CARP) for on-chip buffer management and effective prefetching to <underline>p</underline>re-<underline>a</underline>ctivate <underline>ve</underline>ctors (PAVE) for upcoming coherence interrogations. We have evaluated NUDA by gem5 simulation for 64-core CMPs under PARSEC and SPLASH benchmarks, demonstrating that CARP and PAVE enhance on-chip directory storage efficiency significantly. NUDA with a small on-chip buffer for DVs exhibits negligible performance degradation (to stay within 2.6 percent) compared to a full on-chip directory, while outperforming its previous counterparts for directory area reduction when on-chip directory budget is provisioned scarcely for high scalability.04/05/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2773061NV-Dedup: High-Performance Inline Deduplication for Non-Volatile Memory
https://www.computer.org/csdl/trans/tc/2018/05/08115169-abs.html
The byte-addressable non-volatile memory (NVM) is a promising medium for data storage. NVM-oriented file systems have been designed to explore NVM's performance potential. Meanwhile, applications may write considerable duplicate data. For NVM, a removal of duplicate data can promote space efficiency, improve write endurance, and potentially improve the performance by avoidance of repeatedly writing the same data. However, we have observed severe performance degradations when implementing a state-of-the-art inline deduplication algorithm in an NVM-oriented file system. A quantitative analysis reveals that, with NVM, 1) the conventional way to manage deduplication metadata for block devices, particularly in light of consistency, is inefficient, and, 2) the performance with deduplication becomes more subject to fingerprint calculations. We hence propose a deduplication algorithm called <italic>NV-Dedup </italic>. NV-Dedup manages deduplication metadata in a fine-grained, CPU and NVM-favored way, and preserves the metadata consistency with a lightweight transactional scheme. It also does workload-adaptive fingerprinting based on an analytical model and a transition scheme among fingerprinting methods to reduce calculation penalties. We have built a prototype of NV-Dedup in the Persistent Memory File System (PMFS). Experiments show that, NV-Dedup not only substantially saves NVM space, but also boosts the performance of PMFS by up to 2.1<inline-formula> <tex-math notation="LaTeX">$\times$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq1-2774270.gif"/> </alternatives></inline-formula>.04/05/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2774270Efficient Data-Allocation Scheme for Eliminating Garbage Collection During Analysis of Big Graphs Stored in NAND Flash Memory
https://www.computer.org/csdl/trans/tc/2018/05/08118115-abs.html
A new control scheme for eliminating garbage collection during high-speed analysis of big-graph data stored in NAND flash memory is proposed and evaluated. During big-graph analysis, intermediate results of the analysis are stored in NAND flash memory and updated repeatedly. Under a conventional control scheme, excessive data copying, called “garbage collection,” occurs because overwriting data to NAND flash memory is prohibited. Such excessive data copying degrades performance of big-graph analysis. In contrast, under the proposed control scheme, the controller of NAND flash memory writes the intermediate results, which are updated at the same time, to the same block of NAND flash memory, and the excessive data copying is eliminated completely because all the data in the block can be erased at the same time before the intermediate results are updated. As a result, the proposed scheme shortens analysis time by 88 percent and increases analysis speed for big graphs 8.7 times. The proposed scheme can be applied to three-dimensional NAND flash memory and increases analysis speed 9.5 times. Also, the proposed scheme can be applied to an emerging high-density memory such as three-dimensional vertical chain-cell phase-change memory. These results show that the proposed control scheme enables high-speed analysis of big graphs.04/05/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2775624Advancing CMOS-Type Ising Arithmetic Unit into the Domain of Real-World Applications
https://www.computer.org/csdl/trans/tc/2018/05/08118124-abs.html
Solving combinatorial optimization problems is a great challenge for Von Neumann-architecture computing. Although the Ising model could provide promising solutions for such problems, existing Ising chips, including superconductive, optical and CMOS-type circuit implementation, cannot meet the precision requirement of real-world combinatorial optimization applications. To facilitate the support for real-world applications, we propose three improvements over existing CMOS-type Ising chips: suitable narrow bit width memory cells with approximate multiply-adders, double random sources flipping method with cross random number generators and shared circuit design between adjacent spin nodes. With above improvements, we achieve high precision as well as maintaining the low cost characteristic of CMOS-type Ising chips. When searching the ground state of Ising models, our CMOS-type Ising chip can improve the precision to more than 99 percent over existing ones with about 93 percent precision. Moreover, its hardware cost is only 32 percent of the common implementation to achieve the same high precision. Specially, we have applied our Ising chip in image segmentation applications, a typical real-world application. The results show that, to find a segmentation with similar quality, our CMOS-type Ising chip can speed up the segmenting processing by 1900× with only 0.017‰ energy consumption compared with approximate algorithms operating on conventional computers.04/05/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2775618The Suboptimal Routing Algorithm for 2D Mesh Network
https://www.computer.org/csdl/trans/tc/2018/05/08118167-abs.html
Due to the huge routing algorithm search space for 2D mesh based Network-on-Chip (NoC), Divide-Conquer method is presented to effectively explore the search space. When using Divide-Conquer method, a large number of routing algorithms will be created. In order to get the final results in an acceptable time, a precise metric is needed to measure routing performance and discard the poor performance routings. In this paper, we propose a new routing performance metric, namely, network pressure. Network pressure has the following three advantages: (1) it could measure the whole network congestion state; (2) network pressure of a network and that of its partial component is highly related, under the same routing; (3) it is closely related with routing performance. Based on network pressure and Divide-Conquer method, high performance routing could be achieved. The obtained routing is called suboptimal routing due to the following two reasons: (1) there is only a little gap between its performance and that of the fully adaptive routings under both transpose1 and transpose2 traffics. (2) the search space of routing algorithms is systematically and widely exploited.04/05/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2775643Clockless Spintronic Logic: A Robust and Ultra-Low Power Computing Paradigm
https://www.computer.org/csdl/trans/tc/2018/05/08118192-abs.html
Asynchronous logic offers the advantages of no clock tree, robust circuit operation, avoidance of worst-case timing margins, and a reduced emission spectrum. Thus, computational paradigms are sought to attain advantages of clockless logic by leveraging the complementary characteristics of emerging devices and CMOS transistors within novel circuit designs. This paper introduces Spin Torque Enabled NULL Convention Logic (STENCL), which exploits the physical characteristics of non-volatile Domain-Wall (DW) and memristive devices to realize the Quasi-Delay-Insensitive (QDI) NULL Convention Logic (NCL) asynchronous design methodology. First, a formal algorithm is developed to transform NCL-based threshold m-of-n gate realizations to STENCL, in order to generate the corresponding input memristance and NULL module memristance required for nominal currents achieving DW device biasing. Second, hysteresis and set/reset conditions are realized by determining the corresponding current fluctuations required to move the DW within each threshold logic gate to realize all 27 foundational NCL gate structures, which are then simulated to assess energy and delay metrics. Third, a case study of a four-stage pipelined 32-bit IEEE single-precision floating point co-processor implemented as a dual-rail STENCL architecture is compared to a conventional CMOS-based NCL design implemented by an IBM SOI1250 45nm CMOS process. Fourth, a sensitivity analysis is performed to assess the impact of write accuracy and drift on memristor and DW device operation. Results indicate that STENCL-based designs achieve between 2-fold to 20-fold reduction in energy consumption with up to 8-fold reduction in area, over an equivalent CMOS-based NCL design for 32-bit full adders. Comparisons for various four-stage pipelined 32-bit IEEE single-precision floating-point co-processors and ISCAS benchmarks further substantiate those benefits for operation within acceptable tolerances at identical process technology nodes.04/05/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2776139Response-Time Analysis of Engine Control Applications Under Fixed-Priority Scheduling
https://www.computer.org/csdl/trans/tc/2018/05/08119938-abs.html
Engine control systems include computational activities that are triggered at predetermined angular values of the crankshaft, and therefore generate a workload that tends to increase with the engine speed. To cope with overload conditions, a common practice adopted by the automotive industry is to design such angular tasks with a set of modes that switch at given rotation speeds to adapt the computational demand. This paper presents an exact response time analysis for engine control applications consisting of periodic and engine-triggered tasks scheduled by fixed priority. The proposed analysis explicitly takes into account the physical constraints of the considered systems and is based on the derivation of dominant speeds, which are particular engine speeds that are proved to determine the worst-case behavior of engine-triggered tasks from a timing perspective. Experimental results are finally reported to validate the proposed approach and compare it against an existing sufficient test.04/05/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2777826On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems
https://www.computer.org/csdl/trans/tc/2018/05/08119941-abs.html
Convolutional Neural Networks (CNNs) have shown a great deal of success in diverse application domains including computer vision, speech recognition, and natural language processing. However, as the size of datasets and the depth of neural network architectures continue to grow, it is imperative to design high-performance and energy-efficient computing hardware for training CNNs. In this paper, we consider the problem of designing specialized CPU-GPU based heterogeneous manycore systems for energy-efficient training of CNNs. It has already been shown that the typical on-chip communication infrastructures employed in conventional CPU-GPU based heterogeneous manycore platforms are unable to handle <italic>both</italic> CPU and GPU communication requirements efficiently. To address this issue, we first analyze the on-chip traffic patterns that arise from the computational processes associated with training two deep CNN architectures, namely, LeNet and CDBNet, to perform image classification. By leveraging this knowledge, we design a hybrid Network-on-Chip (NoC) architecture, which consists of both wireline and wireless links, to improve the performance of CPU-GPU based heterogeneous manycore platforms running the above-mentioned CNN training workloads. The proposed NoC achieves <bold>1.8×</bold> reduction in network latency and improves the network throughput by a factor of <bold>2.2</bold> for training CNNs, when compared to a highly-optimized wireline mesh NoC. For the considered CNN workloads, these network-level improvements translate into <bold>25</bold> percent savings in full-system energy-delay-product (EDP). This demonstrates that the proposed hybrid NoC for heterogeneous manycore architectures is capable of significantly accelerating training of CNNs while remaining energy-efficient.04/05/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2777863Optimization of Message Encryption for Real-Time Applications in Embedded Systems
https://www.computer.org/csdl/trans/tc/2018/05/08125122-abs.html
Today, security can no longer be treated as a secondary issue in embedded and cyber-physical systems. Therefore, one of the main challenges in these domains is the design of secure embedded systems under stringent resource constraints and real-time requirements. However, there exists an inherent trade-off between the security protection provided and the amount of resources allocated for this purpose. That is, the more the amount of resources used for security, the higher the security, but the fewer the number of applications which can be run on the platform and meet their timing requirements. This trade-off is of high importance since embedded systems are often highly resource constrained. In this paper, we propose an efficient solution to maximize confidentiality, while also guaranteeing the timing requirements of real-time applications on shared platforms.04/05/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2778728A Fast Leakage-Aware Full-Chip Transient Thermal Estimation Method
https://www.computer.org/csdl/trans/tc/2018/05/08125149-abs.html
Accurate and fast thermal estimation is important for the runtime thermal regulation of modern microprocessors due to excessive on-chip temperatures. However, due to the nonlinear relationship between the leakage power and temperature, full-chip thermal estimation methods suffer slow speed and scalability issue when the increasing static leakage power is considered. In this work, we propose a new fast leakage-aware full-chip thermal estimation method. Unlike traditional methods, which use iteration to handle the leakage-temperature nonlinearity dependency issue, the new method applies a dynamic linearization algorithm, which adaptively transforms the original nonlinear thermal model into a number of local linear thermal models. In order to further improve the thermal estimation efficiency, a specially-designed adaptive model order reduction method is integrated into the thermal estimation framework to generate local compact thermal models. Our numerical results show that the new method is able to accurately estimate full-chip transient temperature distribution by fully considering the nonlinear leakage-temperature dependency with fast speed. On different chips with core number ranging from 9 to 36, it achieved 85<inline-formula> <tex-math notation="LaTeX">$\times$</tex-math><alternatives><inline-graphic xlink:href="wang-ieq1-2778066.gif"/> </alternatives></inline-formula> to 589<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq2-2778066.gif"/></alternatives></inline-formula> speedup in average against traditional iteration based method, with average thermal estimation error to be around 0.2<inline-formula> <tex-math notation="LaTeX">$^{\circ}\mathrm{C}$</tex-math><alternatives> <inline-graphic xlink:href="wang-ieq3-2778066.gif"/></alternatives></inline-formula>.04/05/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2778066Tight Bounds of Differentially and Linearly Active S-Boxes and Division Property of Lilliput
https://www.computer.org/csdl/trans/tc/2018/05/08126222-abs.html
This paper provides security analysis of a lightweight block cipher called <sc>Lilliput</sc>, which was proposed in IEEE Transactions on Computers in 2015. <sc>Lilliput</sc> adopts an extended generalized Feistel network (EGFN). EGFN consists of non-linear, linear, and permutation layers, and the linear layer updates a part of the state only linearly, which causes several security concerns. Our first discovery is that the lower bounds of the number of differentially active S-boxes provided by the designers are incorrect. Thus the new bounds are derived by using mixed integer linear programming (MILP). We apply a two-stage search procedure introduced by Sun et al. that leads to tight bounds even for a large number of rounds. The search tool is then converted for linear cryptanalysis. With those updates, the challenging problem of evaluating <sc>Lilliput</sc>'s security against differential and linear cryptanalysis is closed. Another contribution is the best third-party cryptanalysis. The designers expected EGFN to efficiently enhance security against integral cryptanalysis. However, security is not as enhanced as the designers expected. In fact, division property finds a 13-round distinguisher that improves on the previous distinguisher by 4 rounds. The distinguisher is further extended to a 17-round key recovery that improves on the previous best attack by 3 rounds.04/05/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2775640Designing Checksums for Detecting Errors in Fast Unitary Transforms
https://www.computer.org/csdl/trans/tc/2018/04/08039530-abs.html
Parity computations, checksums, over the input and output data of fast unitary transforms are compared, down to roundoff noise levels, to detect the effects from a single error on any one line between stages of the fast algorithm. Error spaces and their dual spaces guide the design process.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2753774<sc>Adas</sc> on <sc>Cots</sc> with OpenCL: A Case Study with Lane Detection
https://www.computer.org/csdl/trans/tc/2018/04/08057795-abs.html
The concept of autonomous cars is driving a boost for car electronics and the size of automotive electronics market is foreseen to double by 2025. How to benefit from this boost is an interesting question. This article presents a case study to test the feasibility of using OpenCL as the programming language and <sc>Cots</sc> components as the underlying computing platforms for <sc>Adas</sc> development. For representative <sc>Adas</sc> applications, a scalable lane detection is developed that can tune the trade-off between detection accuracy and speed. Our OpenCL implementation is tested on 14 video streams from different data-sets with different road scenarios on 5 <sc>Cots</sc> platforms. We demonstrate that the <sc>Cots</sc> platforms can provide more than sufficient computing power for the lane detection in the meanwhile our OpenCL implementation can exploit the massive parallelism provided by the <sc>Cots </sc> platforms.03/09/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2759203MC-Fluid: Multi-Core Fluid-Based Mixed-Criticality Scheduling
https://www.computer.org/csdl/trans/tc/2018/04/08059775-abs.html
Owing to growing complexity and scale, safety-critical real-time systems are generally designed using the concept of mixed-criticality, wherein applications with different criticality or importance levels are hosted on the same hardware platform. To guarantee non-interference between these applications, the hardware resources, in particular the processor, are statically partitioned among them. To overcome the inefficiencies in resource utilization of such a static scheme, the concept of mixed-criticality real-time scheduling has emerged as a promising solution. Although there are several studies on such scheduling strategies for uniprocessor platforms, the problem of efficient scheduling for the multiprocessor case has largely remained open. In this work, we design a fluid-model based mixed-criticality scheduling algorithm for multiprocessors, in which multiple tasks are allowed to execute on the same processor simultaneously. We derive an exact schedulability test for this algorithm, and also present an optimal strategy for assigning the fractional execution rates to tasks. Since fluid-model based scheduling is not implementable on real hardware, we also present a transformation algorithm from fluid-schedule to a non-fluid one. We also show through experimental evaluation that the designed algorithms outperform existing scheduling algorithms in terms of their ability to schedule a variety of task systems.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2759765Mapping and Scheduling Mixed-Criticality Systems with On-Demand Redundancy
https://www.computer.org/csdl/trans/tc/2018/04/08066347-abs.html
Embedded systems in several domains such as avionics and automotive are subject to inspection from certification authorities. These authorities are interested in verifying the safety-critical aspects of a system and, typically, do not certify non-critical parts. The design of such Mixed-Criticality Systems (MCS) has received increasing attention in recent years. However, although MCS must be designed to overcome transient faults, their susceptibility to transient faults is often overlooked. In this paper, we consider the problem of mapping and scheduling efficient, certifiable MCS that can survive transient faults. We generalize previous MCS models and analysis to support On-Demand Redundancy (ODR). A task set transformation is proposed to generate a modified task set that supports various forms of ODR while satisfying reliability and certification requirements. The analysis is incorporated into a design space exploration algorithm that supports a wide range of fault-tolerance mechanisms and heterogeneous platforms. Experiments show that ODR can improve Quality of Service (QoS) provided to non-critical tasks by 29 percent on average, compared to lockstep execution. Moreover, combining several fault-tolerance mechanisms can lead to additional improvements in schedulability and QoS.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2762293Utilization-Based Scheduling of Flexible Mixed-Criticality Real-Time Tasks
https://www.computer.org/csdl/trans/tc/2018/04/08068215-abs.html
Mixed-criticality models are an emerging paradigm for the design of real-time systems because of their significantly improved resource efficiency. However, formal mixed-criticality models have traditionally been characterized by two impractical assumptions: once <italic>any</italic> high-criticality task overruns, <italic>all</italic> low-criticality tasks are suspended and <italic>all other</italic> high-criticality tasks are assumed to exhibit high-criticality behaviors at the same time. In this paper, we propose a more realistic mixed-criticality model, called the flexible mixed-criticality (FMC) model, in which these two issues are addressed in a combined manner. In this new model, only the overrun task itself is assumed to exhibit high-criticality behavior, while other high-criticality tasks remain in the same mode as before. The guaranteed service levels of low-criticality tasks are gracefully degraded with the overruns of high-criticality tasks. We derive a utilization-based technique to analyze the schedulability of this new mixed-criticality model under EDF-VD scheduling. During run time, the proposed test condition serves an important criterion for dynamic service level tuning, by means of which the maximum available execution budget for low-criticality tasks can be directly determined with minimal overhead while guaranteeing mixed-criticality schedulability. Experiments demonstrate the effectiveness of the FMC scheme compared with state-of-the-art techniques.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2763133Reliability Optimization on Multi-Core Systems with Multi-Tasking and Redundant Multi-Threading
https://www.computer.org/csdl/trans/tc/2018/04/08094023-abs.html
Using Redundant Multithreading (RMT) for error detection and recovery is a prominent technique to mitigate soft-error effects in multi-core systems. Simultaneous Redundant Threading (SRT) on the same core or Chip-level Redundant Multithreading (CRT) on different cores can be adopted to implement RMT. However, only a few previously proposed approaches use adaptive CRT managements on the system level and none of them considers both SRT and CRT on the task level. In this paper, we propose to use a combination of SRT and CRT, called Mixed Redundant Threading (MRT), as an additional option on the task level. In our coarse-grained approach, we consider SRT, CRT, and MRT on the system level simultaneously, while the existing results only apply either SRT or CRT on the system level, but not simultaneously. In addition, we consider further fine-grained task level optimizations to improve the system reliability under hard real-time constraints. To optimize the system reliability, we develop several dynamic programming approaches to select the redundancy levels under Federated Scheduling. The simulation results illustrate that our approaches can significantly improve the system reliability compared to the state-of-the-art techniques.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2769044Efficient Scheduling for Multi-Block Updates in Erasure Coding Based Storage Systems
https://www.computer.org/csdl/trans/tc/2018/04/08094270-abs.html
This paper considers the problem of how to reduce the I/O overhead of data update operations in erasure coding based storage systems. To this end, we first analyze the I/O overhead of update operations with current update approaches. We find the key to reduce such I/O overhead is designing a scheduling algorithm to construct the sequence of update operations. Such an algorithm needs to execute with a time limit, since update requests work under a stringent latency constraint. To quickly schedule the order of update operations, we propose an efficient algorithm, namely UCODR. Our theoretical analysis verifies that UCODR can effectively reduce the I/O overhead of update operations when multiple blocks are updated. To further confirm its effectiveness, we implement a prototype storage system to deploy UCODR with different erasure codes. Extensive experiments are conducted on the prototype storage system with real-world traces. The experimental results show that UCODR can reduce the time of update operations by up to 35 percent and improve the throughput of the storage system by up to 67 percent, compared with the state-of-the-art update approaches.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2769051Static Instruction Scheduling for High Performance on Limited Hardware
https://www.computer.org/csdl/trans/tc/2018/04/08094900-abs.html
Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14 percent for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications’ high-performance.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2769641Thermal-Aware Application Mapping Strategy for Network-on-Chip Based System Design
https://www.computer.org/csdl/trans/tc/2018/04/08096996-abs.html
Rapid progress in technology scaling makes transistors smaller and faster over successive generations, and consequently core count in a system gets increased. However, transistor power consumption no longer scales commensurately. Increased power density calls for better thermal safety of the multi-core systems, in which a flexible and scalable packet-switched architecture — Network-on-Chip (NoC) — is commonly used for communication among the cores. This paper proposes a strategy to increase the thermal safety of NoC-based systems by a graceful decrease in communication cost and an Integer Linear Programming (ILP) formulation to deal with the problem. To overcome huge computational overhead of ILP, another solution strategy, based on meta-heuristic technique, Particle Swarm Optimization (PSO) is also proposed. Several innovative augmentations have been introduced into the basic PSO to generate better quality solutions. A thermal-aware mapping heuristic is proposed to generate some intelligent solutions, which become a part of the initial population in the PSO. A trade-off has been established between communication cost and peak temperature of the die. Experiments on Big data and Graph analytical workloads are reported. The results obtained are better than those of many contemporary approaches, reported in the literature.03/09/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2770130Simultaneous and Speculative Thread Migration for Improving Energy Efficiency of Heterogeneous Core Architectures
https://www.computer.org/csdl/trans/tc/2018/04/08097407-abs.html
This paper proposes a microarchitectural mechanism to minimize the latency of thread migration for a tightly-coupled heterogeneous core, which has two execution backends (e.g., in-order and out-of-order execution pipelines). The proposed mechanism examines the dependencies between all in-flight instructions that reside in one of the backend pipelines, and allows both pipelines to simultaneously perform the instruction execution. At the microarchitectural level, instruction dispatching and instruction execution are seamlessly performed across thread migration, and therefore, this simultaneous backend execution can accelerate the program execution, which cannot be achieved with an existing migration mechanism. Accelerating thread migration will increase the overall performance with low power overhead, providing high energy efficiency. As compared to a baseline heterogeneous core with an existing migration mechanism, the simultaneous backend execution reduces 8.2 percent of the total execution cycle and consumes 2.9 percent lower total energy on average across SPEC CPU2006 benchmarks, which results in an improved energy efficiency of 10.9 percent in terms of the energy-delay product.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2770126Selective I/O Bypass and Load Balancing Method for Write-Through SSD Caching in Big Data Analytics
https://www.computer.org/csdl/trans/tc/2018/04/08100898-abs.html
Fast network quality analysis in the telecom industry is an important method used to provide quality service. SK Telecom, based in South Korea, built a Hadoop-based analytical system consisting of a hundred nodes, each of which only contains hard disk drives (HDDs). Because the analysis process is a set of parallel I/O intensive jobs, adding solid state drives (SSDs) with appropriate settings is the most cost-efficient way to improve the performance, as shown in previous studies. Therefore, we decided to configure SSDs as a write-through cache instead of increasing the number of HDDs. To improve the cost-per-performance of the SSD cache, we introduced a selective I/O bypass (SIB) method, redirecting the automatically calculated number of read I/O requests from the SSD cache to idle HDDs when the SSDs are I/O over-saturated, which means the disk utilization is greater than 100 percent. To precisely calculate the disk utilization, we also introduced a combinational approach for SSDs because the current method used for HDDs cannot be applied to SSDs because of their internal parallelism. In our experiments, the proposed approach achieved a maximum 2x faster performance than other approaches.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2771491Spectral Features of Higher-Order Side-Channel Countermeasures
https://www.computer.org/csdl/trans/tc/2018/04/08103813-abs.html
This brief deals with the problem of mathematically formalizing hardware circuits’ vulnerability to side-channel attacks. We investigate whether spectral analysis is a useful analytical tool for this purpose by building a mathematically sound theory of the vulnerability phenomenon. This research was originally motivated by the need for deeper, more formal knowledge around vulnerable nonlinear circuits. However, while building this new theoretical framework, we discovered that it can consistently integrate known results about linear ones as well. Eventually, we found it adequate to formally model side-channel leakage in several significant scenarios. In particular, we have been able to find the vulnerability perimeter of a known cryptographic primitive (i.e., Keccak <xref ref-type="bibr" rid="ref1">[1]</xref> ) and thus tackle the analysis of vulnerability when signal glitches are present. We believe the conceptual framework we propose will be useful for researchers and practitioners in the field of applied cryptography and side-channel attacks.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2772231Dynamic Scheduling with Service Curve for QoS Guarantee of Large-Scale Cloud Storage
https://www.computer.org/csdl/trans/tc/2018/04/08107532-abs.html
With the growing popularity of cloud storage, more and more diverse applications with diverse service level agreements (SLAs) are being accommodated into it. The quality of service (QoS) support for applications in a shared cloud storage becomes important. However, performance isolation, diverse performance requirements, especially harsh latency guarantees and high system utilization, are all challenging and desirable for QoS design. In this paper, we propose a service curve-based QoS algorithm to support latency guarantee applications, IOPS guarantee applications and best-effort applications at the same storage system, which not only provides a QoS guarantee for applications, but also pursues better system utilization. Three priority queues are exploited and different service curves are applied for different types of applications. I/O requests from different applications are scheduled and dispatched among the three queues according to their service curves and I/O urgency status, so that QoS requirements of all applications can be guaranteed on the shared storage system. Our experimental results show that our algorithm not only simultaneously guarantees the QoS targets of latency and throughput (IOPS), but also improves the utilization of storage resources.03/09/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2773511On Practical Discrete Gaussian Samplers for Lattice-Based Cryptography
https://www.computer.org/csdl/trans/tc/2018/03/07792671-abs.html
Lattice-based cryptography is one of the most promising branches of quantum resilient cryptography, offering versatility and efficiency. Discrete Gaussian samplers are a core building block in most, if not all, lattice-based cryptosystems, and optimised samplers are desirable both for high-speed and low-area applications. Due to the inherent structure of existing discrete Gaussian sampling methods, lattice-based cryptosystems are vulnerable to side-channel attacks, such as timing analysis. In this paper, the first comprehensive evaluation of discrete Gaussian samplers in hardware is presented, targeting FPGA devices. Novel optimised discrete Gaussian sampler hardware architectures are proposed for the main sampling techniques. An independent-time design of each of the samplers is presented, offering security against side-channel timing attacks, including the first proposed constant-time Bernoulli, Knuth-Yao, and discrete Ziggurat sampler hardware designs. For a balanced performance, the Cumulative Distribution Table (CDT) sampler is recommended, with the proposed hardware CDT design achieving a throughput of 59.4 million samples per second for encryption, utilising just 43 slices on a Virtex 6 FPGA and 16.3 million samples per second for signatures with 179 slices on a Spartan 6 device.02/08/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2016.2642962Hardware/Software Co-Design of an Accelerator for FV Homomorphic Encryption Scheme Using Karatsuba Algorithm
https://www.computer.org/csdl/trans/tc/2018/03/07797469-abs.html
Somewhat Homomorphic Encryption (SHE) schemes allow to carry out operations on data in the cipher domain. In a cloud computing scenario, personal information can be processed secretly, inferring a high level of confidentiality. For many years, practical parameters of SHE schemes were overestimated, leading to only consider the FFT algorithm to accelerate SHE in hardware. Nevertheless, recent work demonstrates that parameters can be lowered without compromising the security <xref ref-type="bibr" rid="ref1">[1]</xref> . Following this trend, this work investigates the benefits of using Karatsuba algorithm instead of FFT for the Fan-Vercauteren (FV) Homomorphic Encryption scheme. The proposed accelerator relies on an hardware/software co-design approach, and is designed to perform fast arithmetic operations on degree 2,560 polynomials with 135 bits coefficients, allowing to compute small algorithms homomorphically. Compared to a functionally equivalent design using FFT, our accelerator performs an homomorphic multiplication in 11.9 ms instead of 15.46 ms, and halves the size of logic utilization and registers on the FPGA.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2016.2645204Hardware-Based Trusted Computing Architectures for Isolation and Attestation
https://www.computer.org/csdl/trans/tc/2018/03/07807249-abs.html
Attackers target many different types of computer systems in use today, exploiting software vulnerabilities to take over the device and make it act maliciously. Reports of numerous attacks have been published, against the constrained embedded devices of the Internet of Things, mobile devices like smartphones and tablets, high-performance desktop and server environments, as well as complex industrial control systems. Trusted computing architectures give users and remote parties like software vendors guarantees about the behaviour of the software they run, protecting them against software-level attackers. This paper defines the security properties offered by them, and presents detailed descriptions of twelve hardware-based attestation and isolation architectures from academia and industry. We compare all twelve designs with respect to the security properties and architectural features they offer. The presented architectures have been designed for a wide range of devices, supporting different security properties.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2647955Bitstream Fault Injections (BiFI)–Automated Fault Attacks Against SRAM-Based FPGAs
https://www.computer.org/csdl/trans/tc/2018/03/07809042-abs.html
This contribution is concerned with the question whether an adversary can automatically manipulate an unknown FPGA bitstream realizing a cryptographic primitive such that the underlying secret key is revealed. In general, if an attacker has full knowledge about the bitstream structure and can make changes to the target FPGA design, she can alter the bitstream leading to key recovery. However, this requires challenging reverse-engineering steps in practice. We argue that this is a major reason why bitstream fault injection attacks have been largely neglected in the past. In this paper, we show that malicious bitstream modifications are i) much easier to conduct than commonly assumed and ii) surprisingly powerful. We introduce a novel class of bitstream fault injection (BiFI) attacks which does <italic>not </italic> require any reverse-engineering. Our attacks can be automatically mounted without any detailed knowledge about either the bitstream format or the design of the crypto primitive which is being attacked. Bitstream encryption features do not necessarily prevent our attack if the integrity of the encrypted bitstream is not carefully checked. We have successfully verified the feasibility of our attacks in practice by considering several publicly available AES designs. As target platforms, we have conducted our experiments on Spartan-6 and Virtex-5 Xilinx FPGAs.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2016.2646367Hybrid Obfuscation to Protect Against Disclosure Attacks on Embedded Microprocessors
https://www.computer.org/csdl/trans/tc/2018/03/07809080-abs.html
The risk of code reverse-engineering is particularly acute for embedded processors which often have limited available resources to protect program information. Previous efforts involving code obfuscation provide some additional security against reverse- engineering of programs, but the security benefits are typically limited and not quantifiable. Hence, new approaches to code protection and creation of associated metrics are highly desirable. This paper has two main contributions. We propose the first hybrid diversification approach for protecting embedded software and we provide statistical metrics to evaluate the protection. Diversification is achieved by combining hardware obfuscation at the microarchitecture level and the use of software-level obfuscation techniques tailored to embedded systems. Both measures are based on a compiler which generates obfuscated programs, and an embedded processor implemented in an FPGA with a randomized Instruction Set Architecture (ISA) encoding to execute the hybrid obfuscated program. We employ a fine-grained, hardware-enforced access control mechanism for information exchange with the processor and hardware-assisted booby traps to actively counteract manipulation attacks. It is shown that our approach is effective against a wide variety of possible information disclosure attacks in case of a physically present adversary. Moreover, we propose a novel statistical evaluation methodology that provides a security metric for hybrid-obfuscated programs.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2649520GliFreD: Glitch-Free Duplication Towards Power-Equalized Circuits on FPGAs
https://www.computer.org/csdl/trans/tc/2018/03/07827086-abs.html
Designers of secure hardware are required to harden their implementations against physical threats, such as power analysis attacks. In particular, cryptographic hardware circuits need to decorrelate their current consumption from the information inferred by processing (secret) data. A common technique to achieve this goal is the use of special logic styles that aim at equalizing the current consumption at each single processing step. However, since all hiding techniques like Dual-Rail Precharge (DRP) were originally developed for ASICs, the deployment of such countermeasures on FPGA devices with fixed and predefined logic structure poses a particular challenge. In this work, we propose and practically evaluate a new DRP scheme (GliFreD) that has been exclusively designed for FPGA platforms. GliFreD overcomes the well-known early propagation issue, prevents glitches, uses an isolated dual-rail concept, and mitigates imbalanced routings. With all these features, GliFreD significantly exceeds the level of physical security achieved by any previously reported, related countermeasures for FPGAs.02/08/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2651829A Multiplexer-Based Arbiter PUF Composition with Enhanced Reliability and Security
https://www.computer.org/csdl/trans/tc/2018/03/08025790-abs.html
Arbiter Physically Unclonable Functions (APUFs), while being relatively lightweight, are extremely vulnerable to modeling attacks. Hence, various compositions of APUFs such as XOR APUF and Lightweight Secure PUF have been proposed to be secure alternatives. Previous research has demonstrated that PUF compositions have two major challenges to overcome: vulnerability against modeling and statistical attacks, and lack of reliability. In this paper, we introduce a multiplexer-based composition of APUFs, denoted as MPUF, to simultaneously overcome these challenges. In addition to the basic MPUF design, we propose two MPUF variants namely cMPUF and rMPUF to improve the robustness against cryptanalysis and reliability-based modeling attack, respectively. An rMPUF demonstrates enhanced robustness against the reliability-based modeling attack, while even the well-known XOR APUF, otherwise robust to machine learning based modeling attacks, has been modeled using the same technique with linear data and time complexities. The rMPUF can provide a good trade-off between security and hardware overhead while maintaining a significantly higher reliability level than any practical XOR APUF instance. Moreover, MPUF variants are the first APUF compositions, to the best of our knowledge, that can achieve Strict Avalanche Criterion without using any additional input network (or hardware) for challenge transformation. Finally, we validate our theoretical findings using Matlab-based simulations of MPUFs.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2749226Achieving Load Balance for Parallel Data Access on Distributed File Systems
https://www.computer.org/csdl/trans/tc/2018/03/08027054-abs.html
The distributed file system, HDFS, is widely deployed as the bedrock for many parallel big data analysis. However, when running multiple parallel applications over the shared file system, the data requests from different processes/executors will unfortunately be served in a surprisingly imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file system such as HDFS store each data unit, referred to as chunk file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher probability the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as  <italic>hard disk head</italic> and <italic>network bandwidth</italic>, resulting in a degraded I/O performance. In this paper, we first conduct a complete analysis on how remote and imbalanced read/write patterns occur and how they are affected by the size of the cluster. We then propose novel methods, referred to as Opass, to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems. Our proposed methods can benefit parallel data-intensive analysis with various parallel data access strategies. Opass adopts new matching-based algorithms to match processes to data so as to compute the maximum degree of data locality and balanced data access. Furthermore, to reduce the imbalance of parallel writes, Opass employs a heatmap for monitoring the I/O statuses of storage nodes and performs HM-LRU policy to select a local optimal storage node for serving write requests. Experiments are conducted on PRObE’s Marmot 128-node cluster testbed and the results from both benchmark and well-known parallel applications show the performance benefits and scalability of Opass.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2749229Randomized Mixed-Radix Scalar Multiplication
https://www.computer.org/csdl/trans/tc/2018/03/08031048-abs.html
A set of congruence relations is a <inline-formula><tex-math notation="LaTeX">$\mathbb {Z}$</tex-math><alternatives> <inline-graphic xlink:href="imbert-ieq1-2750677.gif"/></alternatives></inline-formula>-covering if each integer belongs to at least one congruence class from that set. In this paper, we first show that most existing scalar multiplication algorithms can be formulated in terms of covering systems of congruences. Then, using a special form of covering systems called exact <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives> <inline-graphic xlink:href="imbert-ieq2-2750677.gif"/></alternatives></inline-formula>-covers, we present a novel uniformly randomized scalar multiplication algorithm with built-in protections against most passive side-channel attacks. Our algorithm randomizes the addition chain using a mixed-radix representation of the scalar. Its reduced overhead and purposeful robustness could make it a sound replacement to several conventional countermeasures. In particular, it is significantly faster than Coron's scalar blinding technique for elliptic curves when the choice of a particular finite field tailored for speed compels to double the size of the scalar, hence the cost of the scalar multiplication.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2750677Towards the Design of Efficient and Consistent Index Structure with Minimal Write Activities for Non-Volatile Memory
https://www.computer.org/csdl/trans/tc/2018/03/08046077-abs.html
Index structures can significantly accelerate the data retrieval operations in data intensive systems, such as databases. Tree structures, such as B<inline-formula><tex-math notation="LaTeX">$^{+}$</tex-math><alternatives> <inline-graphic xlink:href="zhuge-ieq1-2754381.gif"/></alternatives></inline-formula>-tree alike, are commonly employed as index structures; however, we found that the tree structure may not be appropriate for Non-Volatile Memory (NVM) in terms of the requirements for high-performance and high-endurance. This paper studies what is the best index structure for NVM-based systems and how to design such index structures. The design of an NVM-friendly index structure faces a lot of challenges. <italic>First</italic>, in order to prolong the lifetime of NVM, the write activities on NVM should be minimized. To this end, the index structure should be as simple as possible. The index proposed in this paper is based on the simplest data structure, i.e., linked list. <italic>Second</italic>, the simple structure brings challenges to achieve high-performance data retrieval operations. To overcome this challenge, we design a novel technique by explicitly building up a contiguous virtual address space on the linked list, such that efficient search algorithms can be performed. <italic>Third</italic>, we need to carefully consider data consistency issues in NVM-based systems, because the order of memory writes may be changed and the data content in NVM may be inconsistent due to write-back effects of CPU cache. This paper devises a novel indexing scheme, called “<bold>V</bold>irtual <bold>L</bold>inear <bold>A</bold>ddressable <bold>B</bold>uckets” (VLAB). We implement VLAB in a storage engine and plug it into MySQL. Evaluations are conducted on an NVDIMM workstation using YCSB workloads and real-world traces. Results show that write activities of the state-of-the-art indexes are 6.98 times more than ours; meanwhile, VLAB achieves 2.53 times speedup.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2754381Digit Serial Methods with Applications to Division and Square Root
https://www.computer.org/csdl/trans/tc/2018/03/08060979-abs.html
We present a generic digit serial method (DSM) to compute the digits of a real number <inline-formula> <tex-math notation="LaTeX">$V$</tex-math><alternatives><inline-graphic xlink:href="ferguson-ieq1-2759764.gif"/> </alternatives></inline-formula>. Bounds on these digits, and on the errors in the associated estimates of <inline-formula><tex-math notation="LaTeX">$V$</tex-math><alternatives> <inline-graphic xlink:href="ferguson-ieq2-2759764.gif"/></alternatives></inline-formula> formed from these digits, are derived. To illustrate our results, we derive such bounds for a parameterized family of high-radix algorithms for division and square root. These bounds enable a DSM designer to determine, for example, whether a given choice of parameters allows rapid formation and rounding of its approximation to <inline-formula><tex-math notation="LaTeX">$V$ </tex-math><alternatives><inline-graphic xlink:href="ferguson-ieq3-2759764.gif"/></alternatives></inline-formula>.02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2759764Special Section on Secure Computer Architectures
https://www.computer.org/csdl/trans/tc/2018/03/08287086-abs.html
02/08/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2788658Start Simple and then Refine: Bias-Variance Decomposition as a Diagnosis Tool for Leakage Profiling
https://www.computer.org/csdl/trans/tc/2018/02/07990260-abs.html
Evaluating the resistance of cryptosystems to side-channel attacks is an important research challenge. Profiled attacks reveal the degree of resilience of a cryptographic device when an adversary examines its physical characteristics. So far, evaluation laboratories launch several physical attacks (based on engineering intuitions) in order to find one strategy that eventually extracts secret information (such as a secret cryptographic key). The certification step represents a complex task because in practice the evaluators have tight memory and time constraints. In this paper, we propose a principled way of guiding the design of the most successful evaluation strategies thanks to the (bias-variance) decomposition of a security metric of profiled attacks. Our results show that we can successfully apply our framework on unprotected and protected algorithms implemented in software and hardware.01/12/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2731342On-Chip Fault Monitoring Using Self-Reconfiguring IEEE 1687 Networks
https://www.computer.org/csdl/trans/tc/2018/02/07990262-abs.html
Efficient handling of faults during operation is highly dependent on the interval (latency) from the time embedded monitoring instruments detect faults to the time when the fault manager localizes the faults. In this article, we propose a self-reconfiguring IEEE 1687 network in which all instruments that have detected faults are automatically included in the scan path, and a fault detection and localization module in hardware that detects the configuration of the network after self-reconfiguration and extracts the error codes reported by the fault monitoring instruments. To enable self-reconfiguration, we propose a modified segment insertion bit (SIB) compliant to IEEE 1687. We provide time analyses on fault detection and fault localization for single and multiple faults, and suggest how the self-reconfiguring IEEE 1687 network should be designed such that time for fault detection and fault localization is kept low and deterministic. We show that compared with previous schemes, our proposed network significantly reduces the fault localization time. For validation, we implemented a number of self-reconfiguring networks as well as their corresponding fault detection and localization modules in hardware, and performed post-layout simulations. We show that for large number of instruments, implementing the fault detection and localization module in hardware results in less area compared with the corresponding software-based implementation. Another benefit of the hardware implementation over its software counterpart is that to achieve the same fault detection and localization time, the hardware module can be clocked at a lower frequency than the core on which the corresponding software implementation would run.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2731338Approximate DCT Image Compression Using Inexact Computing
https://www.computer.org/csdl/trans/tc/2018/02/07990539-abs.html
This paper proposes a new framework for digital image processing; it relies on inexact computing to address some of the challenges associated with the discrete cosine transform (DCT) compression. The proposed framework has three levels of processing; the first level uses approximate DCT for image compressing to eliminate all computational intensive floating-point multiplications and executing the DCT processing by integer additions and in some cases logical right/left shifts. The second level further reduces the amount of data (from the first level) that need to be processed by filtering those frequencies that cannot be detected by human senses. Finally, to reduce power consumption and delay, the third level introduces circuit level inexact adders to compute the DCT. For assessment, a set of standardized images are compressed using the proposed three-level framework. Different figures of merits (such as energy consumption, delay, power-signal-to-noise-ratio, average-difference, and absolute-maximum-difference) are compared to existing compression methods; an error analysis is also pursued confirming the simulation results. Results show very good improvements in reduction for energy and delay, while maintaining acceptable accuracy levels for image processing applications.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2731770DuCNoC: A High-Throughput FPGA-Based NoC Simulator Using Dual-Clock Lightweight Router Micro-Architecture
https://www.computer.org/csdl/trans/tc/2018/02/08000664-abs.html
On-chip interconnections play an important role in multi/many-processor systems-on-chip (MPSoCs). In order to achieve efficient optimization, each specific application must utilize a specific architecture, and consequently a specific interconnection network. For design space exploration and finding the best NoC solution for each specific application, a fast and flexible NoC simulator is necessary, especially for large design spaces. In this paper, we present an FPGA-based NoC co-simulator, which is able to be configured via software. In our proposed NoC simulator, entitled <italic>DuCNoC</italic>, we implement a <italic>Dual-Clock</italic> router micro-architecture, which demonstrates 75x<inline-formula><tex-math notation="LaTeX">$-$</tex-math><alternatives> <inline-graphic xlink:href="mardanikamali-ieq1-2735399.gif"/></alternatives></inline-formula>350x speed-up against BOOKSIM. Additionally, we implement a two-layer configurable global interconnection in our proposed architecture to (1) reduce virtualization time overhead, (2) make an efficient trade-off between the resource utilization and simulation time of the whole simulator, and especially (3) provide the capability of simulating irregular topologies. Migration of some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, and implementing a dual-clock context switching in virtualization are other major features of DuCNoC. Thanks to its dual-clock router micro-architecture, as well as TGs and TRs migration to software side, DuCNoC can simulate a 100-node (10 <inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="mardanikamali-ieq2-2735399.gif"/></alternatives></inline-formula> 10) non-virtualized or a 2048-node virtualized mesh network on Xilinx Zynq-7000.01/12/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2735399Bubble Budgeting: Throughput Optimization for Dynamic Workloads by Exploiting Dark Cores in Many Core Systems
https://www.computer.org/csdl/trans/tc/2018/02/08006237-abs.html
All the cores of a many-core chip cannot be active at the same time, due to reasons like low CPU utilization in server systems and limited power budget in dark silicon era. These free cores (referred to as bubbles) can be placed near active cores for heat dissipation so that the active cores can run at a higher frequency level, boosting the performance of applications that run on active cores. Budgeting inactive cores (bubbles) to applications to boost performance has the following three challenges. First, the number of bubbles varies due to open workloads. Second, communication distance increases when a bubble is inserted between two communicating tasks (a task is a thread or process of a parallel application), leading to performance degradation. Third, budgeting too many bubbles as coolers to running applications leads to insufficient cores for future applications. In order to address these challenges, in this paper, a bubble budgeting scheme is proposed to budget free cores to each application so as to optimize the throughput of the whole system. Throughput of the system depends on the execution time of each application and the waiting time incurred for newly arrived applications. Essentially, the proposed algorithm determines the number and locations of bubbles to optimize the performance and waiting time of each application, followed by tasks of each application being mapped to a core region. A Rollout algorithm is used to budget power to the cores as the last step. Experiments show that our approach achieves 50 percent higher throughput when compared to state-of-the-art thermal-aware runtime task mapping approaches. The runtime overhead of the proposed algorithm is in the order of 1M cycles, making it an efficient runtime task management method for large-scale many-core systems.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2735967Efficient Protection of the Register File in Soft-Processors Implemented on Xilinx FPGAs
https://www.computer.org/csdl/trans/tc/2018/02/08008792-abs.html
Soft-processors implemented on SRAM-based FPGAs are increasingly being adopted in on-board computing for space and avionics applications due to their flexibility and ease of integration. However, efficient component-level protection techniques for these processors against radiation-induced upsets are necessary otherwise as system failures could manifest. A register file is one of the critical structures that stores vital information the processor uses related to user computations and program execution. In this paper, we present a fault tolerance technique for the register file of a microprocessor implemented in Xilinx SRAM-based FPGAs. The proposed scheme leverages the inherent implementation redundancy created by the FPGA design automation tools when mapping the register file to on-chip distributed memory. A parity-based error detection and switching logic are added for fault masking against single-bit errors. The proposed scheme has been implemented and evaluated in lowRISC, a RISC-V ISA soft-processor implementation. The effectiveness of the proposed scheme was tested using fault injection. The fault masking overhead required in terms of FPGA resources was much lower than a traditional Triple Modular Redundancy protection. Therefore, the proposed scheme is an interesting option to protect the register file of soft processors that are implemented in Xilinx FPGAs.01/12/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2737996Genetic Programming for Energy-Efficient and Energy-Scalable Approximate Feature Computation in Embedded Inference Systems
https://www.computer.org/csdl/trans/tc/2018/02/08008802-abs.html
With the increasing interest in deploying embedded sensors in a range of applications, there is also interest in deploying embedded inference capabilities. Doing so under the strict and often variable energy constraints of the embedded platforms requires algorithmic, in addition to circuit and architectural, approaches to reducing energy. A broad approach that has recently received considerable attention in the context of inference systems is approximate computing. This stems from the observation that many inference systems exhibit various forms of tolerance to data noise. While some systems have demonstrated significant approximation-versus-energy knobs to exploit this, they have been applicable to specific kernels and architectures; the more generally available knobs have been relatively weak, resulting in large data noise for relatively modest energy savings (e.g., voltage overscaling, bit-precision scaling). In this work, we explore the use of genetic programming (GP) to compute approximate features. Further, we leverage a method that enhances tolerance to feature-data noise through directed retraining of the inference stage. Previous work in GP has shown that it generalizes well to enable approximation of a broad range of computations, raising the potential for broad applicability of the proposed approach. The focus on feature extraction is deliberate because they involve diverse, often highly nonlinear, operations, challenging general applicability of energy-reducing approaches. We evaluate the proposed methodologies through two case studies, based on energy modeling of a custom low-power microprocessor with a classification accelerator. The first case study is on electroencephalogram-based seizure detection. We find that the choice of two primitive functions (square root, subtraction) out of seven possible primitive functions (addition, subtraction, multiplication, logarithm, exponential, square root, and square) enables us to approximate features in 0.41<inline-formula><tex-math notation="LaTeX">$mJ$</tex-math><alternatives> <inline-graphic xlink:href="lu-ieq1-2738642.gif"/></alternatives></inline-formula> per feature vector (FV), as compared to 4.79<inline-formula><tex-math notation="LaTeX">$mJ$</tex-math><alternatives> <inline-graphic xlink:href="lu-ieq2-2738642.gif"/></alternatives></inline-formula> per FV required for baseline feature extraction. This represents a feature extraction energy reduction of 11.68<inline-formula><tex-math notation="LaTeX"> $\times$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq3-2738642.gif"/></alternatives></inline-formula>. The important system-level performance metrics for seizure detection are sensitivity, latency, and number of false alarms per hour. Our set of GP models achieves 100 percent sensitivity, 4.37 second latency, and 0.15 false alarms per hour. The baseline performance is 100 percent sensitivity, 3.84 second latency, and 0.06 false alarms per hour. The second case study is on electrocardiogram-based arrhythmia detection. In this case, just one primitive function (multiplication) suffices to approximate features in 1.13<inline-formula><tex-math notation="LaTeX">$\mu J$</tex-math> <alternatives><inline-graphic xlink:href="lu-ieq4-2738642.gif"/></alternatives></inline-formula> per FV, as compared to 11.69<inline-formula><tex-math notation="LaTeX">$\mu J$</tex-math><alternatives> <inline-graphic xlink:href="lu-ieq5-2738642.gif"/></alternatives></inline-formula> per FV required for baseline feature extraction. This represents a feature extraction energy reduction of 10.35<inline-formula><tex-math notation="LaTeX"> $\times$</tex-math><alternatives><inline-graphic xlink:href="lu-ieq6-2738642.gif"/></alternatives></inline-formula>. The important system-level metrics in this case are sensitivity, specificity, and accuracy. Our set of GP models achieves 81.17 percent sensitivity, 80.63 percent specificity, and 81.86 percent accuracy, whereas the baseline achieves 82.05 percent sensitivity, 88.12 percent specificity, and 87.92 percent accuracy. These case studies demonstrate the possibility of a significant reduction in feature extraction energy at the expense of a slight degradation in system performance.01/12/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2738642Compact CA-Based Single Byte Error Correcting Codec
https://www.computer.org/csdl/trans/tc/2018/02/08010467-abs.html
Memory contents are usually corrupted due to soft errors caused by external radiation and hence the reliability of memory systems is reduced. In order to enhance the reliability of memory systems, error correcting codes (ECC) are widely used to detect and correct errors. Single bit error correcting with double bits errors detecting codes are generally used in memory systems. But in case of multiple cell errors, these codes are unable to detect and correct errors. Recently, single byte error correcting Reed Solomon (SEC-RS) codes are used to detect and correct single byte error in memory systems. In this paper, a new single byte error correcting (SEC) code is proposed based on the concept of cellular automata (termed as CASEC). The main aim of this work is to reduce the area and power of SEC encoder and decoder circuit without affecting delay. In this paper, CASEC(10,8,8), CASEC(18,16,8), 2xCASEC(10,8,4) and 2xCASEC(19,6,4) codecs are designed and implemented. CASEC(18,16,8) codec has 67.79 percent lesser hardware complexity compared to existing design. Proposed codecs are simulated and synthesized for both FPGA and ASIC platforms. It is found that speed of the proposed design is almost equal to the existing design but requires lesser area and power. Area-delay product (ADP) of proposed CASEC(10,8,8), CASEC(18,16,8), 2xCASEC(10,8,4) codecs are better compared to the existing designs.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2739726Bi-Objective Optimization of Data-Parallel Applications on Homogeneous Multicore Clusters for Performance and Energy
https://www.computer.org/csdl/trans/tc/2018/02/08013836-abs.html
Performance and energy are now the most dominant objectives for optimization on modern parallel platforms composed of multicore CPU nodes. The existing intra-node and inter-node optimization methods employ a large set of decision variables but do not consider problem size as a decision variable and assume a linear relationship between performance and problem size and between energy consumption and problem size. We demonstrate using experiments of real-life data-parallel applications on modern multicore CPUs that these relationships have complex (non-linear and even non-convex) properties and, therefore, that the problem size has become an important decision variable that can no longer be ignored. This key finding motivates our work in this paper. In this paper, we first formulate the bi-objective optimization problem for performance and energy (BOPPE) for data-parallel applications on homogeneous clusters of modern multicore CPUs. It contains only one but heretofore unconsidered decision variable, the problem size. We then present an efficient and exact global optimization algorithm called <italic>ALEPH</italic> that solves the <italic>BOPPE</italic>. It takes as inputs, discrete functions of performance and dynamic energy consumption against problem size and outputs the globally Pareto-optimal set of solutions. The solutions are the workload distributions, which achieve inter-node optimization of data-parallel applications for performance and energy. While existing solvers for <italic>BOPPE</italic> give only one solution when the problem size and number of processors are fixed, our algorithm gives a diverse set of globally Pareto-optimal solutions. The algorithm has time complexity of <inline-formula><tex-math notation="LaTeX">$O(m^2 \times p^2)$</tex-math><alternatives> <inline-graphic xlink:href="manumachu-ieq1-2742513.gif"/></alternatives></inline-formula> where <inline-formula> <tex-math notation="LaTeX">$m$</tex-math><alternatives><inline-graphic xlink:href="manumachu-ieq2-2742513.gif"/> </alternatives></inline-formula> is the number of points in the discrete speed/energy function and <inline-formula> <tex-math notation="LaTeX">$p$</tex-math><alternatives><inline-graphic xlink:href="manumachu-ieq3-2742513.gif"/> </alternatives></inline-formula> is the number of available processors. We experimentally study the efficiency and scalability of our algorithm for two data parallel applications, matrix multiplication and fast Fourier transform, on a modern multicore CPU and homogeneous clusters of such CPUs. Based on our experiments, we show that the average and maximum sizes of the globally Pareto-optimal sets determined by our algorithm are 15 and 34 and 7 and 20 for the two applications respectively. Comparing with load-balanced workload distribution solution, the average and maximum percentage improvements in performance and energy respectively demonstrated for the first application are (13%,97%) and (18%,71%). For the second application, these improvements are (40%,95%) and (22%,127%). Assuming 5 percent performance degradation from the optimal is acceptable, the average and maximum improvements in energy consumption demonstrated for the two applications respectively are 9 and 44 and 8 and 20 percent. Using the algorithm and its building blocks, we also present a study of interplay between performance and energy. We demonstrate how <italic>ALEPH</italic> can be combined with <italic> DVFS</italic>-based Multi-Objective Optimization (MOP) methods to give a better set of (globally Pareto-optimal) solutions.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2742513D$^{3}$ : A Dynamic Dual-Phase Deduplication Framework for Distributed Primary Storage
https://www.computer.org/csdl/trans/tc/2018/02/08015182-abs.html
Deploying deduplication for distributed primary storage is a sophisticated and challenging task, considering that the demands of low read/write latency, stable read/write performance, and efficient space saving are all of paramount importance. Unfortunately, existing schemes cannot present a satisfactory solution for the aforementioned requirements simultaneously. In this article, we propose D<inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math><alternatives> <inline-graphic xlink:href="deng-ieq2-2743199.gif"/></alternatives></inline-formula>, a dynamic dual-phase deduplication framework for distributed primary storage. Several major innovations are established in D<inline-formula> <tex-math notation="LaTeX">$^{3}$</tex-math><alternatives><inline-graphic xlink:href="deng-ieq3-2743199.gif"/> </alternatives></inline-formula>. First, we formulate a deduplication-oriented taxonomy called <italic>Dedup-Type </italic>, to group data with similar deduplication-related characteristics into larger categories. It serves as coarse-grained filter and one of the prioritizing references in D<inline-formula><tex-math notation="LaTeX">$^{3}$ </tex-math><alternatives><inline-graphic xlink:href="deng-ieq4-2743199.gif"/></alternatives></inline-formula>. Second, D<inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math><alternatives> <inline-graphic xlink:href="deng-ieq5-2743199.gif"/></alternatives></inline-formula> is a dual-phase framework—inline-phase and offline-phase deduplication processes work in concert with each other. Third, D <inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math><alternatives> <inline-graphic xlink:href="deng-ieq6-2743199.gif"/></alternatives></inline-formula> operates in a dynamic manner. We design two critical mechanisms: <italic>context-aware threshold adjustment</italic> (CTA) for local inline-phase deduplication, and <italic>deferred priority-based enforcement</italic> (DPE) for global offline-phase deduplication. The CTA mechanism enables selective deduplication under a periodically updated threshold. Data skipped during the inline phase is regarded as a candidate for offline phase, and is handled in a prioritized order under the governance of DPE mechanism. Evaluation results demonstrate that, compared with conventional inline and offline deduplication schemes, D<inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math><alternatives> <inline-graphic xlink:href="deng-ieq7-2743199.gif"/></alternatives></inline-formula> achieves more efficient and stabler read/write performance with competitive space saving.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2743199Principal Component Analysis Based Filtering for Scalable, High Precision k-NN Search
https://www.computer.org/csdl/trans/tc/2018/02/08024082-abs.html
Approximate <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="eyers-ieq1-2748131.gif"/></alternatives></inline-formula> Nearest Neighbours (A <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="eyers-ieq2-2748131.gif"/></alternatives></inline-formula>NN) search is widely used in domains such as computer vision and machine learning. However, A<inline-formula><tex-math notation="LaTeX">$k$ </tex-math><alternatives><inline-graphic xlink:href="eyers-ieq3-2748131.gif"/></alternatives></inline-formula>NN search in high-dimensional datasets does not scale well on multicore platforms, due to its large memory footprint. Parallel A <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="eyers-ieq4-2748131.gif"/></alternatives></inline-formula>NN search using space subdivision for filtering helps reduce the memory footprint, but its loss of precision is unstable. In this paper, we propose a new data filtering method—PCAF—for parallel A<inline-formula><tex-math notation="LaTeX">$k$</tex-math> <alternatives><inline-graphic xlink:href="eyers-ieq5-2748131.gif"/></alternatives></inline-formula>NN search based on principal component analysis. PCAF improves on previous methods, demonstrating sustained, high scalability for a wide range of high-dimensional datasets on both Intel and AMD multicore platforms. Moreover, PCAF maintains highly precise A<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="eyers-ieq6-2748131.gif"/></alternatives></inline-formula>NN search results.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2748131ClusterFetch: A Lightweight Prefetcher for Intensive Disk Reads
https://www.computer.org/csdl/trans/tc/2018/02/08025580-abs.html
By overlapping disk accesses with computation-intensive operations, prefetching can reduce delays in launching an application and in loading significant amounts of data while the application is running. The key to effective prefetching is making the tradeoff between the mining accuracy of selecting relevant blocks, and the time to decide those blocks. To address this problem, we propose a new prefetcher called <italic>ClusterFetch</italic>. In its learning mode, ClusterFetch detects periods of intensive disk accesses by monitoring the speed at which read requests are queued; it re-organizes these reads and locates the file opened by the application just before each such period. During subsequent runs of the same application, ClusterFetch prefetches the data associated with the opening of a “trigger” file. Our experimental results show that ClusterFetch implemented in Linux can reduce the application launch time by up to 41.3 percent and the loading time by up to 38.2 percent, while taking up less than 200 KB of main memory.01/12/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2748939A Generic Construction of Quantum-Oblivious-Key-Transfer-Based Private Query with Ideal Database Security and Zero Failure
https://www.computer.org/csdl/trans/tc/2018/01/07962191-abs.html
Higher security and lower failure probability have always been people’s pursuits in quantum-oblivious-key-transfer-based private query (QOKT-PQ) protocols since Jacobi et al. [<italic>Phys. Rev. A </italic> 83, 022301 (2011)] proposed the first protocol of this kind. However, higher database security generally has to be obtained at the cost of a higher failure probability, and vice versa. Recently, based on a round-robin differential-phase-shift quantum key distribution protocol, Liu et al. [<italic>Sci. China-Phys. Mech. Astron. </italic>, 58, 100301 (2015)] presented a private query protocol (RRDPS-PQ protocol) utilizing ideal single-photon signal which realizes both ideal database security and zero failure probability. However, ideal single-photon source is not available today, and for large database the required pulse train is too long to implement. Here, we reexamine the security of RRDPS-PQ protocol under imperfect source and present an improved protocol using a special “low-shift and addition” (LSA) technique, which not only can be used to query from large database but also retains the features of “ideal database security” and “zero-failure” even under weak coherent source. Finally, we generalize the LSA technique and establish a generic QOKT-PQ model in which both “ideal database security” and “zero failure” are achieved via acceptable communications.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2721404Type Information Elimination from Objects on Architectures with Tagged Pointers Support
https://www.computer.org/csdl/trans/tc/2018/01/07962268-abs.html
Implementations of object-oriented programming languages associate type information with each object to perform various runtime tasks such as dynamic dispatch, type introspection, and reflection. A common means of storing such relation is by inserting a pointer to the associated type information into every object. Such an approach, however, introduces memory and performance overheads when compared with non-object-oriented languages. Recent 64-bit computer architectures have added support for <italic>tagged pointers</italic> by ignoring a number of bits - <italic>tag </italic> - of memory addresses during memory access operations and utilize them for other purposes; mainly security. This paper presents the first investigation into how this hardware support can be exploited by a Java Virtual Machine to remove type information from objects. Moreover, we propose novel hardware extensions to the address generation and load-store units to achieve low-overhead type information retrieval and tagged object pointers compression-decompression. The evaluation has been conducted after integrating the Maxine VM and the ZSim microarchitectural simulator. The results, across all the DaCapo benchmark suite, pseudo-SPECjbb2005, SLAMBench and GraphChi-PR executed to completion, show up to 26 and 10 percent geometric mean heap space savings, up to 50 and 12 percent geometric mean dynamic DRAM energy reduction, and up to 49 and 3 percent geometric mean execution time reduction with no significant performance regressions.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2709739Time-Triggered Co-Scheduling of Computation and Communication with Jitter Requirements
https://www.computer.org/csdl/trans/tc/2018/01/07967685-abs.html
The complexity of embedded application design is increasing with growing user demands. In particular, automotive embedded systems are highly complex in nature, and their functionality is realized by a set of periodic tasks. These tasks may have hard real-time requirements and communicate over an interconnect. The problem is to efficiently co-schedule task execution on cores and message transmission on the interconnect so that timing constraints are satisfied. Contemporary works typically deal with zero-jitter scheduling, which results in lower resource utilization, but has lower memory requirements. This article focuses on jitter-constrained scheduling that puts constraints on the tasks jitter, increasing schedulability over zero-jitter scheduling. The contributions of this article are: 1) Integer Linear Programming and Satisfiability Modulo Theory model exploiting problem-specific information to reduce the formulations complexity to schedule small applications. 2) A heuristic approach, employing three levels of scheduling scaling to real-world use-cases with 10,000 tasks and messages. 3) An experimental evaluation of the proposed approaches on a case-study and on synthetic data sets showing the efficiency of both zero-jitter and jitter-constrained scheduling. It shows that up to 28 percent higher resource utilization can be achieved by having up to 10 times longer computation time with relaxed jitter requirements.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2722443PowerCool: Simulation of Cooling and Powering of 3D MPSoCs with Integrated Flow Cell Arrays
https://www.computer.org/csdl/trans/tc/2018/01/07967719-abs.html
Integrated Flow-Cell Arrays (FCAs) represent a combination of integrated liquid cooling and on-chip power generation, converting chemical energy of the flowing electrolyte solutions to electrical energy. The FCA technology provides a promising way to address both heat removal and power delivery issues in 3D Multiprocessor Systems-on-Chips (MPSoCs). In this paper we motivate the benefits of FCA in 3D MPSoCs via a qualitative analysis and explore the capabilities of the proposed technology using our extended PowerCool simulator. PowerCool is a tool that performs combined compact thermal and electrochemical simulation of 3D MPSoCs with inter-tier FCA-based cooling and power generation. We validate our electrochemical model against experimental data obtained using a micro-scale FCA, and extend PowerCool with a compact thermal model (3D-ICE) and subthreshold leakage estimation. We show the sensitivity of the FCA cooling and power generation on the design-time (FCA geometry) and run-time (fluid inlet temperature, flow rate) parameters. Our results show that we can optimize the FCA to keep maximum chip temperature below 95 <inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math><alternatives> <inline-graphic xlink:href="andreev-ieq1-2695179.gif"/></alternatives></inline-formula>C for an average chip power consumption of 50 W/cm<sup>2</sup> while generating up to 3.6 W per cm<sup>2</sup> of chip area.12/11/2017 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2695179Efficient Detection for Malicious and Random Errors in Additive Encrypted Computation
https://www.computer.org/csdl/trans/tc/2018/01/07967774-abs.html
Although data confidentiality is the primary security objective in additive encrypted computation applications, such as the aggregation of encrypted votes in electronic elections, ensuring the trustworthiness of data is equally important. And yet, integrity protections are generally orthogonal to additive homomorphic encryption, which enables efficient encrypted computation, due to the inherent malleability of homomorphic ciphertexts. Since additive homomorphic schemes are founded on modular arithmetic, our framework extends residue numbering to support fast modular reductions and homomorphic syndromes for detecting random errors inside homomorphic ALUs and data memories. In addition, our methodology detects <italic>malicious modifications</italic> of memory data, using keyed syndromes and block cipher-based integrity trees, which allow preserving the homomorphism of ALU operations, while enforcing non-malleability of memory data. Compared to traditional memory integrity protections, our tree-based syndrome generation and updating is parallelizable for increased efficiency, while requiring a small Trusted Computing Base for secret key storage and block cipher operations. Our evaluation shows more than 99.999 percent detection rate for random ALUs errors, as well as 100 percent detection rate of single bit-flips and clustered multiple bit upsets, for a runtime overhead between 1.2 and 5.5 percent, and a small area penalty.12/11/2017 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2722440293