IEEE Transactions on Computers
https://www.computer.org/csdl/trans/tc/index.html
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers, brief contributions, and comments on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; (g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.en-ushttps://s3.amazonaws.com/ieeecs.cdn.common/images/computer.org/CSLogo_dark.pngIEEE Computer Society Digital LibraryList of 100 recently published journal articles.
https://www.computer.org/csdl
GCMA: Guaranteed Contiguous Memory Allocator
https://www.computer.org/csdl/trans/tc/2019/03/08456561-abs.html
The importance of physically contiguous memory has increased in modern computing environments, including both low- and high-end systems. Existing physically contiguous memory allocators generally have critical limitations. For example, the most commonly adopted solution, the memory reservation technique, wastes a significant amount of memory space. Scatter/Gather direct memory access (DMA) and input-output memory management units (IOMMUs) avoid this problem by utilizing additional hardware for address space virtualization. However, additional hardware means an increase in costs and power consumption, which is especially disadvantageous for small systems and they do not provide real contiguous memory. Linux Contiguous Memory Allocator (CMA) aims to provide both contiguous memory allocation and to maximize memory utilization based on page migration, but they suffer from unpredictably long latency and a high probability of allocation failure. Therefore, we introduce a new solution to this problem, the guaranteed contiguous memory allocator (GCMA). This guarantees efficient memory space utilization, short latency, and successful allocation. The GCMA uses a reservation scheme and increases memory utilization by sharing the memory with immediately discardable data. Our evaluation of a GCMA on a Raspberry Pi 2 finds a latency that is 15-130 times lower compared to a CMA, and a latency that is up to 10 times lower when taking a photo. Using a large working set in a memory-fragmented high-end system, the GCMA is able to produce a 2.27x speedup.02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2869169Content Aware Refresh: Exploiting the Asymmetry of DRAM Retention Errors to Reduce the Refresh Frequency of Less Vulnerable Data
https://www.computer.org/csdl/trans/tc/2019/03/08456584-abs.html
DRAM refresh is responsible for significant performance and energy overheads in a wide range of computer systems, from mobile platforms to datacenters <xref ref-type="bibr" rid="ref1">[1]</xref> . With the growing demand for DRAM capacity and the worsening retention time characteristics of deeply scaled DRAM, refresh is expected to become an even more pronounced problem in future technology generations <xref ref-type="bibr" rid="ref2">[2]</xref> . This paper examines <italic>content aware refresh</italic>, a new technique that reduces the refresh frequency by exploiting the unidirectional nature of DRAM retention errors: assuming that a logical <bold>1</bold> and <bold>0</bold> respectively are represented by the presence and absence of charge, <bold>1</bold>-to-<bold>0</bold> failures are much more likely than <bold>0</bold>-to-<bold>1</bold> failures. As a result, in a DRAM system that uses a block error correcting code (ECC) to protect memory, blocks with fewer <bold>1</bold>s can attain a specified reliability target (i.e., mean time to failure) with a refresh rate lower than that which is required for a block with all <bold>1</bold>s. Leveraging this key insight, and without compromising memory reliability, the proposed content aware refresh mechanism refreshes memory blocks with fewer <bold>1</bold>s less frequently. To keep the overhead of tracking multiple refresh rates manageable, refresh groups—groups of DRAM rows refreshed together—are dynamically arranged into one of a predefined number of refresh bins and refreshed at the rate determined by the ECC block with the greatest number of <bold>1</bold>s in that bin. By tailoring the refresh rate to the actual content of a memory block rather than assuming a worst case data pattern, content aware refresh respectively outperforms DRAM systems that employ RAS-only Refresh, all-bank Auto Refresh, and per-bank Auto Refresh mechanisms by 12, 8, and 13 percent. It also reduces DRAM system energy by 15, 13, and 16 percent as compared to these systems.02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2868338Optimizing File Systems with a Write-Efficient Journaling Scheme on Non-Volatile Memory
https://www.computer.org/csdl/trans/tc/2019/03/08466031-abs.html
Modern file systems employ journaling techniques to guarantee data consistency in case of unexpected system crashes or power failures. However, journaling file systems usually suffer from performance decrease due to the extra journal writes. Moreover, the emerging non-volatile memory technologies (NVMs) have the potential capability to reduce the journaling overhead by being deployed as the journaling storage devices. However, traditional journaling techniques, which are designed for hard disks, fail to perform efficiently in NVMs. In order to address this problem, we propose an NVM-based journaling scheme, called NJS. The basic idea behind NJS is to reduce the journaling overhead of traditional file systems while fully exploiting the byte-accessibility characteristic of NVM, and alleviating the slow write and endurance limitation of NVM. Our NJS consists of four major contributions: (1) In order to decrease the amount of journal writes, NJS only needs to write the file system metadata and over-write data to NVM as write-ahead logging, thus alleviating the slow write and endurance limitation of NVM. (2) NJS adopts a wear aware strategy for NVM journal block allocation in which each block can be evenly worn out, thus further extending the lifetime of NVM. (3) We propose a novel journaling update scheme in which journal data blocks can be updated in the byte-granularity based on the difference of the old and new versions of journal blocks, thus fully exploiting the unique byte-accessibility characteristic of NVM. (4) NJS includes a garbage collection mechanism that absorbs the redundant journal updates, and actively delays the checkpointing to the file system. Evaluation results show the efficiency and efficacy of NJS. For example, compared with Ext4 with a ramdisk-based journaling device, the throughput improvement of Ext4 with our NJS is up to 131.4 percent.02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2870380Analysis, Modeling and Optimization of Equal Segment Based Approximate Adders
https://www.computer.org/csdl/trans/tc/2019/03/08468108-abs.html
Over the past decade, several approximate adders have been proposed in the literature based on the design concept of <italic>Equal Segment Adder</italic> (ESA). In this approach, an <inline-formula><tex-math notation="LaTeX">$N$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq1-2871096.gif"/></alternatives></inline-formula>-bit adder is segmented into several smaller and independent equally sized accurate sub-adders. An <inline-formula><tex-math notation="LaTeX">$N$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq2-2871096.gif"/></alternatives></inline-formula>-bit ESA has two primary design parameters: (i) Segment size (<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq3-2871096.gif"/></alternatives></inline-formula>), which represents the maximum length of carry propagation; and (ii) Overlapping bits (<inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq4-2871096.gif"/></alternatives></inline-formula>), which represents the minimum number of bits used in carry prediction, where <inline-formula><tex-math notation="LaTeX">$1 \leq k < N$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq5-2871096.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$0 \leq l < k$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq6-2871096.gif"/></alternatives></inline-formula>. Based on the combinations of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq7-2871096.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$l$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq8-2871096.gif"/></alternatives></inline-formula>, an <inline-formula><tex-math notation="LaTeX">$N$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq9-2871096.gif"/></alternatives></inline-formula>-bit ESA has <inline-formula><tex-math notation="LaTeX">$N(N-1)/2$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq10-2871096.gif"/></alternatives></inline-formula> possible configurations. In this paper, we analyse ESAs and propose analytical models to estimate accuracy, delay, power and area of ESAs. The key features of the proposed analytical models are that: (i) They are generalized, i.e., work for all possible configurations of an <inline-formula><tex-math notation="LaTeX">$N$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq11-2871096.gif"/></alternatives></inline-formula>-bit ESA; and (ii) They are superior (i.e., estimate more accurately) or at par to the existing analytical models. From the proposed analytical models, we observe that in an <inline-formula><tex-math notation="LaTeX">$N$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq12-2871096.gif"/></alternatives></inline-formula>-bit ESA, there exist multiple (more than one) configurations which exhibit similar accuracy. However, these configurations exhibit different delay, power and area. Therefore, for a given accuracy, the configurations which provide minimal delay, power and/or area need to be known apriori for efficient, intelligent and goal oriented implementations of ESAs. In this regard, we present an optimization framework that exploits the proposed analytical models to find the optimal configurations of an <inline-formula><tex-math notation="LaTeX">$N$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq13-2871096.gif"/></alternatives></inline-formula>-bit ESA. Further, we know that accuracy of an ESA does not depend on the adder architecture used to implement it, however, its delay, power and area depend significantly. Consequently, the optimal configurations vary with adder architectures used to implement the ESA. In order to cover a wide range of adders, we consider three types of adder architecture in our analysis: (i) Architectures having smaller area (<inline-formula><tex-math notation="LaTeX">$O(N)$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq14-2871096.gif"/></alternatives></inline-formula>); (ii) Architectures having smaller delay (<inline-formula><tex-math notation="LaTeX">$O(log_2N)$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq15-2871096.gif"/></alternatives></inline-formula>); and (iii) Architectures having in-between delay (<inline-formula><tex-math notation="LaTeX">$O(N/4)$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq16-2871096.gif"/></alternatives></inline-formula>) and area (<inline-formula><tex-math notation="LaTeX">$O(2N)$</tex-math><alternatives><inline-graphic xlink:href="dutt-ieq17-2871096.gif"/></alternatives></inline-formula>).02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2871096Reducing Flash Memory Write Traffic by Exploiting a Few MBs of Capacitor-Powered Write Buffer Inside Solid-State Drives (SSDs)
https://www.computer.org/csdl/trans/tc/2019/03/08470129-abs.html
To mitigate the long write latency of NAND flash memory, solid-state drives (SSDs) typically use capacitor-powered SRAM or DRAM to realize internal nonvolatile write buffering. Due to the cost and size constraints, intra-SSD capacitors can only power a very small amount (e.g., 8 MB or 16 MB) of nonvolatile write buffer. As a result, most commercial SSDs simply use the few MBs of capacitor-powered write buffer in the first-in first-out (FIFO) manner without employing any advanced data eviction policy. This paper presents a set of design strategies across the application and storage device levels that can effectively leverage the very small intra-SSD write buffer to noticeably reduce the amount of data being physically written to NAND flash memory. These cross-layer design strategies are primarily geared towards mainstream applications (e.g., database and filesystem) that heavily involve logging/journaling operations. This paper discusses different strategies for realizing flash memory write traffic reduction through nominal application-level modifications, and presents solutions to accordingly manage the write buffer at small processing and memory resource usage inside SSDs. To evaluate the potential effectiveness, we carried out case studies based upon popular open-source relational databases and filesystem. With only 8 MB of intra-SSD capacitor-powered write buffer, our experimental results show that the developed design solutions can reduce up to 39.7, 36.5, and 52.9 percent of total NAND flash memory write traffic for MySQL, <italic>ext4</italic>, and PostgreSQL, respectively.02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2871683<italic>RT-ByzCast</italic>: Byzantine-Resilient Real-Time Reliable Broadcast
https://www.computer.org/csdl/trans/tc/2019/03/08470958-abs.html
Today's cyber-physical systems face various impediments to achieving their intended goals, namely, communication uncertainties and faults, relative to the increased integration of networked and wireless devices, hinder the synchronism needed to meet real-time deadlines. Moreover, being critical, these systems are also exposed to significant security threats. This threat combination increases the risk of physical damage. This paper addresses these problems by studying how to build the first real-time Byzantine reliable broadcast protocol (RTBRB) tolerating network uncertainties, faults, and attacks. Previous literature describes either real-time reliable broadcast protocols, or asynchronous (non real-time) Byzantine ones. We first prove that it is impossible to implement RTBRB using traditional distributed computing paradigms, e.g., where the error/failure detection mechanisms of processes are decoupled from the broadcast algorithm itself, even with the help of the most powerful failure detectors. We circumvent this impossibility by proposing <italic>RT-ByzCast</italic>, an algorithm based on aggregating digital signatures in a sliding time-window and on empowering processes with self-crashing capabilities to mask and bound losses. We show that <italic>RT-ByzCast</italic> (i) operates in real-time by proving that messages broadcast by correct processes are delivered within a known bounded delay, and (ii) is reliable by demonstrating that correct processes using our algorithm crash themselves with a negligible probability, even with message loss rates as high as 60 percent.02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2871443Dynamic Voltage and Frequency Scaling in NoCs with Supervised and Reinforcement Learning Techniques
https://www.computer.org/csdl/trans/tc/2019/03/08489913-abs.html
Network-on-Chips (NoCs) are the de facto choice for designing the interconnect fabric in multicore chips due to their regularity, efficiency, simplicity, and scalability. However, NoC suffers from excessive static power and dynamic energy due to transistor leakage current and data movement between the cores and caches. Power consumption issues are only exacerbated by ever decreasing technology sizes. Dynamic Voltage and Frequency Scaling (DVFS) is one technique that seeks to reduce dynamic energy; however this often occurs at the expense of performance. In this paper, we propose <italic>LEAD Learning-enabled Energy-Aware Dynamic voltage/frequency</italic> scaling for multicore architectures using both supervised learning and reinforcement learning approaches. LEAD groups the router and its outgoing links into the same V/F domain and implements proactive DVFS mode management strategies that rely on offline trained machine learning models in order to provide optimal V/F mode selection between different voltage/frequency pairs. We present three supervised learning versions of LEAD that are based on buffer utilization, change in buffer utilization and change in energy/throughput, which allow proactive mode selection based on accurate prediction of future network parameters. We then describe a reinforcement learning approach to LEAD that optimizes the DVFS mode selection directly, obviating the need for label and threshold engineering. Simulation results using PARSEC and Splash-2 benchmarks on a 4 × 4 concentrated mesh architecture show that by using supervised learning LEAD can achieve an average dynamic energy savings of 15.4 percent for a loss in throughput of 0.8 percent with no significant impact on latency. When reinforcement learning is used, LEAD increases average dynamic energy savings to 20.3 percent at the cost of a 1.5 percent decrease in throughput and a 1.7 percent increase in latency. Overall, the more flexible reinforcement learning approach enables learning an optimal behavior for a wider range of load environments under any desired energy versus throughput tradeoff.02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2875476TA-LRW: A Replacement Policy for Error Rate Reduction in STT-MRAM Caches
https://www.computer.org/csdl/trans/tc/2019/03/08489928-abs.html
As technology process node scales down, on-chip SRAM caches lose their efficiency because of their low scalability, high leakage power, and increasing rate of soft errors. Among emerging memory technologies, <italic><inline-formula><tex-math notation="LaTeX">$Spin$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq1-2875439.gif"/></alternatives></inline-formula></italic>-<italic><inline-formula><tex-math notation="LaTeX">$Transfer\; Torque\; Magnetic\; RAM$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq2-2875439.gif"/></alternatives></inline-formula></italic> (STT-MRAM) is known as the most promising replacement for SRAM-based cache memories. The main advantages of STT-MRAM are its non-volatility, near-zero leakage power, higher density, soft-error immunity, and higher scalability. Despite these advantages, high error rate in STT-MRAM cells due to <italic><inline-formula><tex-math notation="LaTeX">$retention\; failure$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq3-2875439.gif"/></alternatives></inline-formula></italic>, <italic><inline-formula><tex-math notation="LaTeX">$write\; failure$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq4-2875439.gif"/></alternatives></inline-formula></italic>, and <italic><inline-formula><tex-math notation="LaTeX">$read\; disturbance$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq5-2875439.gif"/></alternatives></inline-formula></italic> threatens the reliability of cache memories built upon STT-MRAM technology. The error rate is significantly increased in higher temperature, which further affects the reliability of STT-MRAM-based cache memories. The major source of heat generation and temperature increase in STT-MRAM cache memories is write operations, which are managed by cache <italic><inline-formula><tex-math notation="LaTeX">$replacement\; policy$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq6-2875439.gif"/></alternatives></inline-formula></italic>. To the best of our knowledge, none of previous studies have attempted to mitigate heat generation and high temperature of STT-MRAM cache memories using replacement policy. In this paper, we first analyze the cache behavior in conventional <italic><inline-formula><tex-math notation="LaTeX">$Least$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq7-2875439.gif"/></alternatives></inline-formula></italic>-<italic><inline-formula><tex-math notation="LaTeX">$Recently\; Used$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq8-2875439.gif"/></alternatives></inline-formula></italic> (LRU) replacement policy and demonstrate that the majority of consecutive write operations (more than 66 percent) are committed to adjacent cache blocks. These adjacent write operations cause accumulated heat and increased temperature, which significantly increase the cache error rate. To eliminate heat accumulation and the adjacency of consecutive writes, we propose a cache replacement policy, named <italic><inline-formula><tex-math notation="LaTeX">$Thermal$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq9-2875439.gif"/></alternatives></inline-formula></italic>-<italic><inline-formula><tex-math notation="LaTeX">$Aware\; Least$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq10-2875439.gif"/></alternatives></inline-formula></italic>-<italic><inline-formula><tex-math notation="LaTeX">$Recently\; Written$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq11-2875439.gif"/></alternatives></inline-formula></italic> (TA-LRW), to smoothly distribute the generated heat by conducting consecutive write operations in distant cache blocks. TA-LRW guarantees the distance of at least three blocks for each two consecutive write operations in an 8-way associative cache. This distant write scheme reduces the temperature-induced error rate by 94.8 percent, on average, compared with the conventional LRU policy, which results in 6.9x reduction in cache error rate. The implementation cost and complexity of TA-LRW is as low as <italic><inline-formula><tex-math notation="LaTeX">$First$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq12-2875439.gif"/></alternatives></inline-formula></italic>-<italic><inline-formula><tex-math notation="LaTeX">$In,\; First$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq13-2875439.gif"/></alternatives></inline-formula></italic>-<italic><inline-formula><tex-math notation="LaTeX">$Out$</tex-math><alternatives><inline-graphic xlink:href="asadi-ieq14-2875439.gif"/></alternatives></inline-formula></italic> (FIFO) policy while providing a near-LRU performance, having the advantages of both replacement policies. The significantly reduced error rate is achieved by imposing only 2.3 percent performance overhead compared with the LRU policy.02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2875439Redio: Accelerating Disk-Based Graph Processing by Reducing Disk I/Os
https://www.computer.org/csdl/trans/tc/2019/03/08489961-abs.html
Disk-based graph systems store part or all of graph data on external devices like hard drives or SSDs, achieving scalability without excessive hardware. However, massive expensive disk I/Os remain the major performance bottleneck of disk-based graph processing. In this paper, we propose Redio, a new approach to accelerating disk-based graph processing by reducing disk I/Os. First, Redio observes that it is feasible to accommodate all vertex states in main memory and this can eliminate almost all vertex-related disk I/Os. Second, Redio introduces a dynamic selective scheduling scheme to identify inactive edges in each iteration and skip them when and only when such skipping can bring performance benefit. To improve its effectiveness, Redioin corporates a compact edge storage to improve data locality and an indexed bitmap to minimize its memory and computation overheads. We have implemented a single-node prototype for Redio under the edge-centric computation model. Extensive experiments show that Redio consistently outperforms well-known edge-centric disk-based systems in all experiments, delivering an average speedup of <inline-formula><tex-math notation="LaTeX">$4.33\times$</tex-math><alternatives><inline-graphic xlink:href="zhang-ieq1-2875458.gif"/></alternatives></inline-formula> on HDDs and <inline-formula><tex-math notation="LaTeX">$5.33\times$</tex-math><alternatives><inline-graphic xlink:href="zhang-ieq2-2875458.gif"/></alternatives></inline-formula> on SSDs over the fastest among them (i.e., GridGraph). Experimental results also show that Redio delivers an average speedup of <inline-formula><tex-math notation="LaTeX">$3.13\times$</tex-math><alternatives><inline-graphic xlink:href="zhang-ieq3-2875458.gif"/></alternatives></inline-formula> on HDDs and <inline-formula><tex-math notation="LaTeX">$1.28\times$</tex-math><alternatives><inline-graphic xlink:href="zhang-ieq4-2875458.gif"/></alternatives></inline-formula> on SSDs over the fastest among representative vertex-centric disk-based systems (i.e., FlashGraph).02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2875458ASAP: Accelerated Short-Read Alignment on Programmable Hardware
https://www.computer.org/csdl/trans/tc/2019/03/08490591-abs.html
The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short period of time. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This paper explores the use of hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preprocessing sequenced genomes. We focus on the Levenshtein distance (edit-distance) computation kernel and propose the ASAP accelerator, which utilizes the intrinsic delay of circuits for edit-distance computation elements as a proxy for computation. Our design is implemented on an Xilinx Virtex 7 FPGA in an IBM POWER8 system that uses the CAPI interface for cache coherence across the CPU and FPGA. Our design is <inline-formula><tex-math notation="LaTeX">$200\times$</tex-math><alternatives><inline-graphic xlink:href="banerjee-ieq1-2875733.gif"/></alternatives></inline-formula> faster than an equivalent Smith-Waterman-C implementation of the kernel running on the host processor, <inline-formula><tex-math notation="LaTeX">$40-60\times$</tex-math><alternatives><inline-graphic xlink:href="banerjee-ieq2-2875733.gif"/></alternatives></inline-formula> faster than an equivalent Landau-Vishkin-C++ implementation of the kernel running on the IBM Power8 host processor, and <inline-formula><tex-math notation="LaTeX">$2\times$</tex-math><alternatives><inline-graphic xlink:href="banerjee-ieq3-2875733.gif"/></alternatives></inline-formula> faster for an end-to-end alignment tool for 120-150 base-pair short-read sequences. Further the design represents a <inline-formula><tex-math notation="LaTeX">$3760\times$</tex-math><alternatives><inline-graphic xlink:href="banerjee-ieq4-2875733.gif"/></alternatives></inline-formula> improvement over the CPU in performance/Watt terms.02/11/2019 4:48 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2875733CC Meets FIPS: A Hybrid Test Methodology for First Order Side Channel Analysis
https://www.computer.org/csdl/trans/tc/2019/03/08490742-abs.html
Common Criteria (CC) and FIPS 140-3 are two popular side channel testing methodologies. Test Vector Leakage Assessment Methodology (<italic>TVLA</italic>), a potential candidate for FIPS, can detect the presence of side-channel information in leakage measurements. However, <italic>TVLA</italic> results cannot be used to quantify side-channel vulnerability and it is an open problem to derive its relationship with side channel attack success rate (<italic>SR</italic>), i.e., a common metric for CC. In this paper, we extend the <italic>TVLA</italic> testing beyond its current scope. Precisely, we derive a concrete relationship between <italic>TVLA</italic> and signal to noise ratio (<italic>SNR</italic>). The linking of the two metrics allows direct computation of success rate (<italic>SR</italic>) from <italic>TVLA</italic> for given choice of intermediate variable and leakage model and thus unify these popular side channel detection and evaluation metrics. An end-to-end methodology is proposed, which can be easily automated, to derive attack <italic>SR</italic> starting from <italic>TVLA</italic> testing. The methodology works under both univariate and multivariate setting and is capable of quantifying any first order leakage. Detailed experiments have been provided using both simulated traces and real traces on SAKURA-GW platform. Additionally, the proposed methodology is benchmarked against previously published attacks on <italic>DPA contest v4.0</italic> traces, followed by extension to jitter based countermeasure. The result shows that the proposed methodology provides a quick estimate of <italic>SR</italic> without performing actual attacks, thus bridging the gap between CC and FIPS.02/08/2019 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2875746Affine Modeling of Program Traces
https://www.computer.org/csdl/trans/tc/2019/02/08408540-abs.html
A formal, high-level representation of programs is typically needed for static and dynamic analyses performed by compilers. However, the source code of target applications is not always available in an analyzable form, e.g., to protect intellectual property. To reason on such applications it becomes necessary to build models from observations of its execution. This paper presents an algebraic approach which, taking as input the trace of memory addresses accessed by a single memory reference, synthesizes an affine loop with a single perfectly nested statement that generates the original trace. This approach is extended to support the synthesis of unions of affine loops, useful for minimally modeling traces generated by automatic transformations of polyhedral programs, such as tiling. The resulting system is capable of processing hundreds of gigabytes of trace data in minutes, minimally reconstructing 100 percent of the static control parts in PolyBench/C applications and 99.9 percent in the Pluto-tiled versions of these benchmarks.01/15/2019 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2853747Sampled Simulation of Task-Based Programs
https://www.computer.org/csdl/trans/tc/2019/02/08424416-abs.html
Sampled simulation is a mature technique for reducing simulation time of single-threaded programs. Nevertheless, current sampling techniques do not take advantage of other execution models, like task-based execution, to provide both more accurate and faster simulation. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals, we employ a novel fast-forwarding mechanism for dynamically scheduled programs. We evaluate different automatic techniques for clustering task instances and show that DBSCAN clustering combined with analytical performance modeling provides the best trade-off of simulation speed and accuracy. TaskPoint is the first technique combining sampled simulation and analytical modeling and provides a new way to trade off simulation speed and accuracy. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 8 simulated threads by an average factor of 220x at an average error of 0.5 percent and a maximum error of 7.9 percent.01/17/2019 12:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2860012Phantasy: Low-Latency Virtualization-Based Fault Tolerance via Asynchronous Prefetching
https://www.computer.org/csdl/trans/tc/2019/02/08438984-abs.html
Fault tolerance has become increasingly critical for virtualized systems as growing amount of mission-critical applications are now deployed on virtual machines rather than directly on physical machines. However, prior hardware-based fault-tolerant systems require extensive modification to existing hardware, which makes them infeasible for industry practitioners. Although software-based techniques realize fault tolerance without any hardware modification, they suffer from significant latency overhead that is often orders of magnitude higher than acceptable. To realize practical low-latency fault tolerance in the virtualized environment, we first identify two bottlenecks in prior approaches, namely the overhead for tracking dirty pages in software and the long sequential dependency in checkpointing system states. To address these bottlenecks, we design a novel mechanism to asynchronously prefetch the dirty pages without disrupting the primary VM execution to shorten the sequential dependency. We then develop Phantasy, a system that leverages page-modification logging (PML) technology available on commodity processors to reduce the dirty page tracking overhead and asynchronously prefetches dirty pages through direct remote memory access via RDMA. Evaluated on 25 real-world applications, we demonstrate Phantasy can significantly reduce the performance overhead by 38 percent on average, and further reduce the latency by 85 percent compared to a state-of-the-art virtualization-based fault-tolerant system.01/17/2019 12:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2865943Adaptive Partition Testing
https://www.computer.org/csdl/trans/tc/2019/02/08440117-abs.html
Random testing and partition testing are two major families of software testing techniques. They have been compared both theoretically and empirically in numerous studies for decades, and it has been widely acknowledged that they have their own advantages and disadvantages and that their innate characteristics are fairly complementary to each other. Some work has been conducted to develop advanced testing techniques through the integration of random testing and partition testing, attempting to preserve the advantages of both while minimizing their disadvantages. In this paper, we propose a new testing approach, <italic>adaptive partition testing</italic>, where test cases are randomly selected from some partition whose probability of being selected is adaptively adjusted along the testing process. We particularly develop two algorithms, <italic>Markov-chain based adaptive partition testing</italic> and <italic>reward-punishment based adaptive partition testing</italic>, to implement the proposed approach. The former algorithm makes use of Markov matrix to dynamically adjust the probability of a partition to be selected for conducting tests; while the latter is based on a reward and punishment mechanism. We conduct empirical studies to evaluate the performance of the proposed algorithms using ten faulty versions of three large-scale open source programs. Our experimental results show that, compared with two baseline techniques, namely random partition testing (RPT) and dynamic random testing (DRT), our algorithms deliver higher fault-detection effectiveness with lower test case selection overhead. It is demonstrated that the proposed adaptive partition testing is an effective testing approach, taking advantages of both random testing and partition testing.01/17/2019 12:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2866040Efficient Design-for-Test Approach for Networks-on-Chip
https://www.computer.org/csdl/trans/tc/2019/02/08440731-abs.html
To achieve high reliability in on-chip networks, it is necessary to test the network continuously with Built-in Self-Tests (BIST) so that the faults can be detected quickly and the number of affected packets can be minimized. However, BIST causes significant performance loss due to data dependencies. We propose EsyTest, a comprehensive test strategy with minimized influence on system performance. EsyTest tests the data path and the control path separately. The data path test starts periodically, but the actual test performs in the free time slots to avoid deactivating the router for testing. A reconfigurable router architecture and an adaptive fault-tolerant routing algorithm are proposed to guarantee the access to the processing core when the associated router is under test. During the whole test procedure of the network, all processing cores are accessible, and thus the system performance is maintained during the test. At the same time, EsyTest provides a full test coverage for the NoC and a better hardware compatibility comparing with the existing test strategies. Under the PARSEC benchmark and different test frequencies, the execution time increases less than 5 percent at the cost of 9.9 percent more area and 4.6 percent more power in comparison with the execution where no test procedure is applied.01/17/2019 12:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2865948Tolerating C Integer Error via Precision Elevation
https://www.computer.org/csdl/trans/tc/2019/02/08443077-abs.html
In C programs, integer error is a common yet important kind of defect due to arithmetic operations that produce unrepresentable values in certain types. Integer errors are harbored in a wide range of applications and possibly lead to serious software failures and exploitable vulnerabilities. Due to the complicated semantics of C, manually preventing integer errors is challenging even for experienced developers. In this paper we propose a novel approach to automate C integer error repair by elevating the precision of arithmetic operations according to a set of code transformation rules. A large portion of integer errors can be repaired by recovering expected results (i.e., tolerance) instead of removing program functionality. Our approach is fully automatic without requiring code specifications. Furthermore, the transformed code is ensured to be well-typed and has conservativeness property with respect to the original code. Our approach is implemented as a prototype <sc>CIntFix</sc> which succeeds in repairing all the integer errors from 7 categories in NIST's Juliet Test Suite. Furthermore, <sc>CIntFix</sc> is evaluated on large code bases in SPEC CINT2000, scaling to 366 KLOC within 126 seconds while the transformed code has 10.5 percent slowdown on average. The evaluation results substantiate the potential of our approach in real-world scenarios.01/17/2019 12:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2866388Microcontroller TRNGs Using Perturbed States of NOR Flash Memory Cells
https://www.computer.org/csdl/trans/tc/2019/02/08443106-abs.html
This paper introduces a new technique that perturbs split-gate NOR Flash memory cells and extracts randomness of read noise to generate true random numbers. Flash memory cells exhibit threshold voltage fluctuations during read operations caused by thermal noise and random telegraph noise effects. Recent proposals demonstrate how these inherent properties of Flash memory cells can be used to create true random numbers in modern NAND Flash memories. However, they cannot be directly applied to NOR Flash memories in microcontrollers that have different architecture, improved data retention, high endurance, and are not as susceptible to noise as high-density NAND Flash memories. The proposed technique is experimentally demonstrated and evaluated using a family of commercial microcontrollers. The evaluation shows that it enables extraction of high-throughput random sequences that pass the NIST statistical tests. Advantages of the proposed technique are as follows: (a) it does not require any special hardware and/or interface modifications, (b) it is robust, cost-effective, and high-throughput, (c) it is entirely implemented in software, and (d) it is flexible and can be tailored to work in low-end microcontrollers that are often resource- or cost-constrained.01/17/2019 12:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2866459A Theoretical Model to Link Uniqueness and Min-Entropy for PUF Evaluations
https://www.computer.org/csdl/trans/tc/2019/02/08444682-abs.html
Physical unclonable functions (PUFs) are security primitives that enable the extraction of digital identifiers from electronic devices, based on the inherent silicon process variations between devices which occur during the manufacturing process. Due to the intrinsic and lightweight nature of a PUF, they have been proposed to provide security at a low cost for many applications, in particular for the internet of things (IoT). Many metrics have been proposed to evaluate the security and performance of PUF architectures, two of which are uniqueness and min-entropy. The uniqueness of a PUF response evaluates its ability to differentiate between different physical devices, while the min-entropy estimation is a measure of how much uncertainty the PUF response contains. The min-entropy is a lower-bound of real entropy. When the uniqueness of a PUF design is close to the optimal, it is unclear if this also implies that the design has a significantly high entropy; hence it would be useful to ascertain the minimum uniqueness required to achieve a given entropy. To date, a thorough investigation of the relationship between uniqueness and entropy for PUF designs has not been conducted. In this paper, this relationship between the uniqueness and entropy is explored, and for the first time, to the authors’ knowledge, the relationship between them is modeled. To verify this model, both simulated and hardware-based experimental results are performed, with a test-bed containing 184 Xilinx Artix-7 FPGA based Basys3 boards providing a large data set for granular results. The experimental results demonstrate that the proposed model accurately estimates the relationship between uniqueness and min-entropy, with both the theoretical analysis and software simulations closely matching the experimental results.01/17/2019 12:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2866241An Energy-Efficient Accelerator Based on Hybrid CPU-FPGA Devices for Password Recovery
https://www.computer.org/csdl/trans/tc/2019/02/08453825-abs.html
Password recovery tools are needed to recover lost and forgotten passwords so as to regain access to valuable information. As the process of password recovery can be extremely compute-intensive, hardware accelerators are often needed to expedite the recovery process. This paper thus presents a high performance, energy-efficient accelerator built upon modern hybrid CPU-FPGA SoC devices. The proposed password recovery accelerator relies on the development of a set of intellectual property (IP) cores for implementing variety of encryption algorithms with vastly different characteristics and complexities. To keep the resource requirements of each IP core running on a resource-strapped FPGA to the minimum, while achieving the highest throughput possible, the most performance critical computational hash functions are mapped to the FPGA with two specific optimization techniques, namely the fixed message padding for hashing and loop transformation for deep pipelining. The proposed password recovery accelerator implements a non-blocking deep pipeline design that does not incur any data and structural hazards, which is made possible by applying a task scheduling scheme through the use of block RAMs. Synchronization between tasks that are mapped to run separately on CPU and FPGA is achieved through task reordering and a communication protocol for maximum parallelism and low overhead. The proposed design is evaluated on Xilinx XC7Z030-3 device, and it is compared much favorably with other known implementations. The proposed hardware accelerator design is found 12.5 and 3.1 times more resource-efficient than the pure FPGA-based password recovery accelerators for TrueCrypt and WPA-2, respectively. The proposed implementation also shows more than 200 percent improvement in energy efficiency over a state-of-the-art implementation on NVIDIA GTX 750 Ti GPU.01/17/2019 12:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2868191RC-NVM: Dual-Addressing Non-Volatile Memory Architecture Supporting Both Row and Column Memory Accesses
https://www.computer.org/csdl/trans/tc/2019/02/08453833-abs.html
Although emerging non-volatile memories (NVMs) have been comprehensively studied to design next-generation memory systems, the symmetry of the crossbar structure adopted by most NVMs has not been addressed. In this work, we argue that crossbar-based NVMs can enable dual-addressing memory architecture, i.e., RC-NVM, to support both row- and column-oriented memory accesses for workloads with different access patterns. Through circuit-level analysis, we first prove that such a dual-addressing architecture is only practical with crossbar-based NVMs rather than DRAM. Then, we introduce the RC-NVM architecture from bank, chip and module levels, and propose RC-NVM aware memory controller. We also address the challenges to implement the end-to-end RC-NVM system. Especially, we design a novel protocol to solve the cache synonym problem with very little overhead. Finally, we introduce the deployment of RC-NVM for in-memory databases (IMDBs) and evaluate its performance with IMDBs and well-optimized general matrix multiply (GEMM) workloads. Experimental results show that with only 10 percent area overhead 1) the memory access performance of IMDBs can be improved up to 14.5X, and 2) for GEMM, RC-NVM naturally supports SIMD operations and outperforms the best tiled layout by 19 percent.01/24/2019 6:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2868368Automated Test Generation for Debugging Multiple Bugs in Arithmetic Circuits
https://www.computer.org/csdl/trans/tc/2019/02/08453904-abs.html
Optimized and custom arithmetic circuits are widely used in embedded systems such as multimedia applications, cryptography systems, signal processing and console games. Debugging of arithmetic circuits is a challenge due to increasing complexity coupled with non-standard implementations. Existing algebraic rewriting techniques produce a remainder to indicate the presence of a potential bug. However, bug localization remains a major bottleneck. Simulation-based validation using random or constrained-random tests are not effective for complex arithmetic circuits due to bit-blasting. In this paper, we present an automated test generation and bug localization technique for debugging arithmetic circuits. This paper makes four important contributions. We propose an automated approach for generating directed tests by suitable assignments of input variables to make the remainder non-zero. The generated tests are guaranteed to activate bugs. We also propose an automatic bug fixing technique by utilizing the patterns of the remainder terms as well as by analyzing the regions activated by the generated tests to detect and correct the error(s). We also propose an efficient debugging algorithm that can handle multiple dependent as well as independent bugs. Finally, our proposed framework, consisting of directed test generation, bug localization and bug correction, is fully automated. In other words, our framework is capable of producing a corrected implementation of arithmetic circuits without any manual intervention. Our experimental results demonstrate that the proposed approach can be used for automated debugging of large and complex arithmetic circuits.01/17/2019 12:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2868362Coding for Write Latency Reduction in a Multi-Level Cell (MLC) Phase Change Memory (PCM)
https://www.computer.org/csdl/trans/tc/2019/02/08454732-abs.html
This paper presents a new write latency reduction scheme for a Phase Change Memory (PCM) made of Multi-Level Cells (MLCs). This scheme improves over an existing scheme found in the technical literature and known as CABS. The proposed scheme is based on the utilization of a new coding arrangement for the selection of candidate codewords. The code relies on the two-step feature found in the write operation of a MLC PCM and avoids the symbol that incurs in the largest latency at a higher rate than CABS. A detailed simulation based evaluation and comparison are also pursued; the proposed scheme accomplishes improvements in write latency (for parallel writing) as well as coding rate (16/17 for the proposed scheme versus 16/18 for CABS for 16 symbols or 32-bit word). As the proposed scheme utilizes novel selection criteria for the candidates, the design of the required circuitry (encoder and decoder) has also been changed with respect to CABS; in terms of hardware, the areas of the encoder and decoder for the proposed scheme are reduced by 73 and 56 percent respectively compared with CABS.01/17/2019 12:14 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2868928Enhancing Instruction TLB Resilience to Soft Errors
https://www.computer.org/csdl/trans/tc/2019/02/08488602-abs.html
A translation lookaside buffer (TLB) is a type of cache used to speed up the virtual to physical memory translation process. Instruction TLBs store virtual page numbers and their related physical page numbers for the last accessed pages of instruction memory. TLBs like other memories suffer soft errors that can corrupt their contents. A false positive due to an error produced in the virtual page number stored in the TLB may lead to a wrong translation and, consequently, the execution of a wrong instruction that can lead to a program hard fault or to data corruption. Parity or error correction codes have been proposed to provide protection for the TLB, but they require additional storage space. This paper presents some schemes to increase the instruction TLB resilience to this type of errors without requiring any extra storage space, by taking advantage of the spatial locality principle that takes place when executing a program.01/17/2019 12:13 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2874467On Improving the Write Responsiveness for Host-Aware SMR Drives
https://www.computer.org/csdl/trans/tc/2019/01/08387485-abs.html
This paper presents a Virtual Persistent Cache design to remedy the long latency behavior and to ultimately improve the write responsiveness of the Host-Aware Shingled Magnetic Recording (HA-SMR) drives. Our design keeps the cost-effective model of the existing HA-SMR drives, but at the same time asks the great help from the host system for adaptively providing some computing and management resources to improve the drive performance when needed. The technical contribution is to trick the HA-SMR drives by smartly reshaping the access patterns to HA-SMR drives, so as to avoid the occurrences of long latencies in most cases and thus to ultimately improve the drive performance and responsiveness. We conduct experiments on real Seagate 8 TB HA-SMR drives to demonstrate the advantages of Virtual Persistent Cache over the real workloads from Microsoft Research Cambridge. The results show that the proposed design can remedy most of the long latencies and improve the drive performance by at least 58.11 percent, under the evaluated workloads.12/14/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2845383Dynamic Guardband Selection: Thermal-Aware Optimization for Unreliable Multi-Core Systems
https://www.computer.org/csdl/trans/tc/2019/01/08395346-abs.html
Circuit aging has become the major reliability concern in current and upcoming technology nodes. For instance, Bias Temperature Instability (BTI) leads to an increase in the threshold voltage of a transistor. That, in turn, may prolong the critical path delay of the processor and eventually may lead to timing errors. In order to avoid aging-induced timing errors, designers employ guardbands either with respect to voltage or frequency. State-of-the-art techniques determine a <italic>guardband type</italic> at the circuit level at design time irrespective from the running workload at the system level. Our investigation revealed that generated temperatures by a running workload have the potential to play a key role in determining the appropriate guardband type with respect to system performance. Therefore, we propose a paradigm shift in designing guardbands: to select the guardband types on-the-fly with respect to the workload-induced temperatures aiming at optimizing for performance under temperature and reliability constraints. Moreover, different guardband types for different cores can be selected simultaneously when multiple applications with diverse properties suggest this to be useful. Our dynamic guardband selection allows for a higher performance compared to techniques that employ a fixed (at design time) guardband type throughout.12/11/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2848276An Aging-Aware GPU Register File Design Based on Data Redundancy
https://www.computer.org/csdl/trans/tc/2019/01/08395355-abs.html
Nowadays, GPUs sit at the forefront of high-performance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance. At deep nanometer technologies, the SRAM memory cells that implement GPU register files are very sensitive to the Negative Bias Temperature Instability (NBTI) effect. NBTI ages cell transistors by degrading their threshold voltage <inline-formula><tex-math notation="LaTeX">$V_{th}$</tex-math><alternatives> <inline-graphic xlink:href="valero-ieq1-2849376.gif"/></alternatives></inline-formula> over the lifetime of the GPU. This degradation, which manifests when a cell keeps the same logic value for a relatively long period of time, compromises the cell read stability and increases the transistor switching delay, which can lead to wrong read values and eventually exceed the processor cycle time, respectively, so resulting in faulty operation. This work proposes architectural mechanisms leveraging the redundancy of the data stored in GPU register files to attack NBTI aging. The proposed mechanisms are based on data compression, power gating, and register address rotation techniques. All these mechanisms working together balance the distribution of logic values stored in the cells along the execution time, reducing both the overall <inline-formula><tex-math notation="LaTeX">$V_{th}$</tex-math><alternatives> <inline-graphic xlink:href="valero-ieq2-2849376.gif"/></alternatives></inline-formula> degradation and the increase in the transistor switching delays. Experimental results show that a conventional GPU register file suffers the worst case for NBTI, since a significant fraction of the cells maintain the same logic value during the entire application execution (i.e., a 100 percent ‘0’ and ‘1’ duty cycle distributions). On average, the proposal reduces these distributions by 58 and 68 percent, respectively, which translates into <inline-formula> <tex-math notation="LaTeX">$V_{th}$</tex-math><alternatives><inline-graphic xlink:href="valero-ieq3-2849376.gif"/> </alternatives></inline-formula> degradation savings by 54 and 62 percent, respectively.12/11/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2849376On Analysis of Lightweight Stream Ciphers with Keyed Update
https://www.computer.org/csdl/trans/tc/2019/01/08400392-abs.html
As the need for lightweight cryptography has grown even more due to the evolution of the Internet of Things, it has become a greater challenge for cryptographers to design ultra lightweight stream ciphers in compliance with the rule of thumb that the internal state size should be at least twice as the key size to defend against generic Time-Memory-Data Tradeoff (TMDT) attacks. However, Recently in 2015, Armknecht and Mikhalev sparked a new light on designing keystream generators (KSGs), which in turn yields stream ciphers, with small internal states, called <italic> KSG with Keyed Update Function (KSG with KUF)</italic>, and gave a concrete construction named Sprout. But, currently, security analysis of KSGs with KUF in a general setting is almost non-existent. Our contribution in this paper is two-fold. 1) We give a general mathematical setting for KSGs with KUF, and for the first time, analyze a class of such KSGs, called KSGs with Boolean Keyed Feedback Function (KSG with Boolean KFF), generically. In particular, we develop two generic attack algorithms applicable to any KSG with Boolean KFF having almost arbitrary output and feedback functions where the only requirement is that the secret key incorporation is biased. We introduce an upper bound for the time complexity of the first algorithm. Our extensive experiments validate our algorithms and assumptions made thereof. 2) We study Sprout to show the effectiveness of our algorithms in a practical instance. A straightforward application of our generic algorithm yields one of the most successful attacks on Sprout.12/11/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2851239Disaggregated Cloud Memory with Elastic Block Management
https://www.computer.org/csdl/trans/tc/2019/01/08400405-abs.html
With the growing importance of in-memory data processing, cloud service providers have launched large memory virtual machine services to accommodate memory intensive workloads. Such large memory services using low volume scaled-up machines are far less cost-efficient than scaled-out services consisting of high volume commodity servers. By exploiting memory usage imbalance across cloud nodes, disaggregated memory can scale up the memory capacity for a virtual machine in a cost-effective way. Disaggregated memory allows available memory in remote nodes to be used for the virtual machine requiring more memory than its locally available memory. It supports high performance with the faster direct memory while satisfying the memory capacity demand with the slower remote memory. This paper proposes a new hypervisor-integrated disaggregated memory system for cloud computing. The hypervisor-integrated design has several new contributions in its disaggregated memory design and implementation. First, with the tight hypervisor integration, it investigates a new page management mechanism and policy tuned for disaggregated memory in virtualized systems. Second, it restructures the memory management procedures and relieves the scalability concern for supporting large virtual machines. Third, exploiting page access records available to the hypervisor, it supports application-aware elastic block sizes for fetching remote memory pages with different granularities. Depending on the degrees of spatial locality for different regions of memory in a virtual machine, the optimal block size for each memory region is dynamically selected. The experimental results with the implementation integrated to the KVM hypervisor, show that the disaggregated memory can provide on average 6 percent performance degradation compared to the ideal local-memory only machine, even though the direct memory capacity is only 50 percent of the total memory footprint.12/11/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2851565Hardware Optimizations and Analysis for the WG-16 Cipher with Tower Field Arithmetic
https://www.computer.org/csdl/trans/tc/2019/01/08409309-abs.html
This paper explores tower field constructions and hardware optimizations for the WG-16 stream cipher. The constructions <inline-formula><tex-math notation="LaTeX">${\mathbb {F}}_{(((2^2)^2)^2)^2}$</tex-math><alternatives> <inline-graphic xlink:href="zidaric-ieq1-2854757.gif"/></alternatives></inline-formula> and <inline-formula> <tex-math notation="LaTeX">${\mathbb {F}}_{(2^{4})^4}$</tex-math><alternatives> <inline-graphic xlink:href="zidaric-ieq2-2854757.gif"/></alternatives></inline-formula> were chosen because their small subfields enable high speed arithmetic implementations and their regularity provides flexibility in pipeline granularity. A design methodology is presented where the tower field constructions guide how to proceed systematically from algebraic optimizations, through initial hardware implementation, selection of submodules, pipelining, and finally detailed hardware optimizations to increase clock speed. The highest frequency WG(16, 32) keystream generator, obtained for the 65 nm ASIC library, reached a clock speed of 2.44 GHz at 26.3 kGE, and the smallest area keystream generator achieved a clock speed of 0.33 GHz at 9.9 kGE. The highest frequency FPGA implementation on a Xilinx Spartan 6 reached a clock speed of 256 MHz using 631 slices. In addition, the paper demonstrates that LFSR feedback polynomials can be optimized to increase security without hurting performance, and retiming optimizations can be used to increase clock speed without increasing area.12/11/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2854757Optimal Binning for Genomics
https://www.computer.org/csdl/trans/tc/2019/01/08410020-abs.html
Genome sequencing is expected to be the most prolific source of big data in the next decade; millions of whole genome datasets will open new opportunities for biological research and personalized medicine. Genome sequences are abstracted in the form of interesting regions, describing abnormalities of the genome. The parallel execution on the cloud of complex operations for joining and mapping billions of genomic regions is increasingly important. Genome binning, i.e., partitioning of the genome into small-size segments, adapts classic data partitioning methods to genomics; region distributions to bins must reflect operation-specific correctness rules. As a consequence, determining the optimal bin size for such operations is a complex mathematical problem, whose solution requires careful modeling. The main result of this paper is the mathematical formulation and solution of the optimal binning problem for join and map operations in the context of GMQL, a query language over genomic regions; the model is validated by experiments showing its accuracy and sensitivity to the variation of operationsâ€™ parameters. We also optimize sequences of operations by inheriting the binning between two consecutive operations and we show the deployment of GMQL and the tuning of the proposed model on different cloud computing systems.12/18/2018 2:51 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2854880An Efficient Method for Calculating the Error Statistics of Block-Based Approximate Adders
https://www.computer.org/csdl/trans/tc/2019/01/08419308-abs.html
Adders are key building blocks of many error-tolerant applications. Recently, a number of approximate adders were proposed. Many of them are block-based approximate adders. For approximate circuits, besides normal metrics such as area and delay, the other important design metrics are the various error statistics, such as error rate (ER), mean error distance (MED), and mean square error (MSE). Given the popularity of block-based approximate adders, in this work, we propose an efficient method to obtain their error statistics. We first show how to calculate the ER. Then, we demonstrate an approach to get the error distribution, which can be used to calculate other metrics, such as MED and MSE. Our method is applicable to an arbitrary block-based approximate adder. It is accurate for the uniformly distributed inputs. Experimental results also demonstrated that it produces error metrics close to the accurate ones for various types of non-uniform input distributions. Compared to the state-of-the-art algorithm for obtaining the error distributions of block-based approximate adders, for the uniform input distribution, our method improves the runtime by up to <inline-formula><tex-math notation="LaTeX">$4.8\times 10^4$</tex-math><alternatives> <inline-graphic xlink:href="qian-ieq1-2859960.gif"/></alternatives></inline-formula> times with the same accuracy; for non-uniform input distributions, it achieves a speed-up of up to 400 times with very similar accuracy.12/11/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2859960Scalable Byzantine Consensus via Hardware-Assisted Secret Sharing
https://www.computer.org/csdl/trans/tc/2019/01/08419336-abs.html
The surging interest in blockchain technology has revitalized the search for effective Byzantine consensus schemes. In particular, the blockchain community has been looking for ways to effectively integrate traditional Byzantine fault-tolerant (BFT) protocols into a blockchain consensus layer allowing various financial institutions to securely agree on the order of transactions. However, existing BFT protocols can only scale to tens of nodes due to their <inline-formula><tex-math notation="LaTeX">$O(n^2)$</tex-math><alternatives> <inline-graphic xlink:href="liu-ieq1-2860009.gif"/></alternatives></inline-formula> message complexity. In this paper, we propose FastBFT, a fast and scalable BFT protocol. At the heart of FastBFT is a novel message aggregation technique that combines hardware-based trusted execution environments (TEEs) with lightweight secret sharing. Combining this technique with several other optimizations (i.e., optimistic execution, tree topology and failure detection), FastBFT achieves low latency and high throughput even for large scale networks. Via systematic analysis and experiments, we demonstrate that FastBFT has better scalability and performance than previous BFT protocols.12/11/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2860009Minimizing Retention Induced Refresh Through Exploiting Process Variation of Flash Memory
https://www.computer.org/csdl/trans/tc/2019/01/08423091-abs.html
Refresh schemes have been the default approach in NAND flash memory to avoid data losses. The critical issue of the refresh schemes is that they introduce additional costs on lifetime and performance. Recent work proposed to minimize the refresh costs by using uniform refresh frequencies based on the number of program/erase (P/E) cycles. However, from our investigation, we find that the refresh costs still have a high burden on the lifetime performance. In this paper, a novel refresh minimization scheme is proposed by exploiting the process variation (PV) of flash memory. State-of-the-art flash memory always has significant PV, which introduces large variations on the retention time of flash blocks. In order to reduce the refresh costs, we first propose a new refresh frequency determination scheme by detecting the supported retention time of flash blocks. If the detected retention time is large, a low refresh frequency can be applied to minimize the refresh costs. Second, considering that the retention time requirements of data are varied with each others, we further propose a data hotness and refresh frequency matching scheme. The matching scheme is designed to allocate data to blocks with right higher supported retention time. Through simulation studies, the lifetime and performance are significantly improved compared with state-of-the-art refresh schemes.12/11/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.28587712018 Reviewers List<formula formulatype="inline"><tex Notation="TeX"/></formula>
https://www.computer.org/csdl/trans/tc/2019/01/08572816-abs.html
12/11/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.28801702018 Index IEEE Transactions on Computers Vol. 67
https://www.computer.org/csdl/trans/tc/2019/01/08572823-abs.html
12/11/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2882120Editorial from the Incoming Editor-in-Chief
https://www.computer.org/csdl/trans/tc/2019/01/08573058-abs.html
12/11/2018 2:06 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2879421Thank-You State of the Journal Editorial by the Outgoing Editor-in-Chief
https://www.computer.org/csdl/trans/tc/2019/01/08573061-abs.html
12/11/2018 2:05 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2879420Test of Reconfigurable Modules in Scan Networks
https://www.computer.org/csdl/trans/tc/2018/12/08357895-abs.html
Modern devices often include several embedded instruments, such as BIST interfaces, sensors, calibration facilities. New standards, such as IEEE Std 1687, provide vehicles to access these instruments. In approaches based on reconfigurable scan networks (RSNs), instruments are coupled with scan registers, connected into chains and interleaved with reconfigurable modules. Such modules embed reconfigurable multiplexers that permit a selective access to different parts of the chain. A similar scenario is also supported by IEEE Std 1149.1-2013. The test of permanent faults affecting an RSN requires to shift test vectors throughout a certain number of network configurations. This paper presents some methodologies to select the list of configurations that perform the complete test of the reconfigurable modules of the RSN. In particular, one method is presented that, by construction, can be proved to be able to apply the test in the minimum amount of clock cycles. Other methods are sub-optimal in terms of test application time (TAT), but scale well on large circuits. In order to provide a comparison between the proposed methods, experimental results on some benchmark RSNs are provided.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2834915Cube Attacks on Non-Blackbox Polynomials Based on Division Property
https://www.computer.org/csdl/trans/tc/2018/12/08357953-abs.html
The cube attack is a powerful cryptanalytic technique and is especially powerful against stream ciphers. Since we need to analyze the complicated structure of a stream cipher in the cube attack, the cube attack basically analyzes it by regarding it as a blackbox. Therefore, the cube attack is an experimental attack, and we cannot evaluate the security when the size of cube exceeds an experimental range, e.g., 40. In this paper, we propose cube attacks on non-blackbox polynomials. Our attacks are developed by using the division property, which is recently applied to various block ciphers. The clear advantage is that we can exploit large cube sizes because it never regards the cipher as a blackbox. We apply the new cube attack to <sc>Trivium</sc>, Grain128a, ACORN and Kreyvium. As a result, the secret keys of 832-round <sc>Trivium</sc>, 183-round Grain128a, 704-round ACORN and 872-round Kreyvium are recovered. These attacks are the current best key-recovery attack against these ciphers.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2835480Towards Adaptive Parallel Storage Systems
https://www.computer.org/csdl/trans/tc/2018/12/08359084-abs.html
Disk I/O is a major bottleneck limiting the performance and scalability of data intensive applications. A common way to address disk I/O bottlenecks is using parallel storage systems and utilizing concurrent operation of independent storage components; however, achieving a consistently high parallel I/O performance is challenging due to static configurations. Modern parallel storage systems, especially in the cloud, enterprise data centers, and scientific clusters are commonly shared by various applications generating dynamic and coexisting data access patterns. Nonetheless, these systems generally utilize one-layout-fits-all data placement strategy frequently resulting in suboptimal I/O parallelism. Guided by association rule mining, graph coloring, bin packing, and network flow techniques, this paper proposes a general framework for adaptive parallel storage systems, with the goal of continuously providing a high-degree of I/O parallelism. Evaluation results indicate that the proposed framework is highly successful in adjusting to skewed parallel access patterns for both hard disk drive (HDD) based traditional storage arrays and solid-state drive (SSD) based all-flash arrays. In addition to the storage arrays, the proposed framework is sufficiently generic and can be tailored to various other parallel storage scenarios including but not limited to key-value stores, parallel/distributed file systems, and internal parallelism of SSDs.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2836426Test Resource Reused Debug Scheme to Reduce the Post-Silicon Debug Cost
https://www.computer.org/csdl/trans/tc/2018/12/08359333-abs.html
In this paper, a design for debug (DFD) method that reuses test resources is proposed to reduce the debug cost in post-silicon validation. With the proposed method, the trace buffer is shared for embedded cores to capture the signatures of each core concurrently by reusing a test access mechanism. In this case, the depth of the trace buffer allocated to the core is reconfigurable and variable according to debug scheme. The experimental results indicate that the proposed DFD significantly reduces the debug time when the trace buffer is shared by cores in various debug cases.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2835462Subquadratic Space Complexity Multiplier Using Even Type GNB Based on Efficient Toeplitz Matrix-Vector Product
https://www.computer.org/csdl/trans/tc/2018/12/08359428-abs.html
Multiplication schemes based on Toeplitz matrix-vector product (TMVP) have been proposed by many researchers. TMVP can be computed using the recursive two-way and three-way split methods, which are composed of four blocks. Among them, we improve the space complexity of the component matrix formation (CMF) block. This result derives the improvements of multiplication schemes based on TMVP. Also, we present a subquadratic space complexity <inline-formula> <tex-math notation="LaTeX">$GF(2^m)$</tex-math><alternatives><inline-graphic xlink:href="hong-ieq1-2836425.gif"/> </alternatives></inline-formula> multiplier with even type Gaussian normal basis (GNB). In order to design the multiplier, we formulate field multiplication as a sum of two TMVPs and efficiently compute the sum. As a result, for type 2 and 4 GNBs, the proposed multipliers outperform other similar schemes. The proposed type 6 GNB is the first subquadrtic space complexity multiplier with its explicit complexity formula.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2836425Computation of 2D 8×8 DCT Based on the Loeffler Factorization Using Algebraic Integer Encoding
https://www.computer.org/csdl/trans/tc/2018/12/08360454-abs.html
This paper proposes a computational method for 2D 8×8 DCT based on algebraic integers. The proposed algorithm is based on the Loeffler 1D DCT algorithm, and it is shown to operate with exact computation—i.e., error-free arithmetic—up to the final reconstruction step (FRS). The proposed algebraic integer architecture maintains error-free computations until an entire block of DCT coefficients having size 8×8 is computed, unlike algorithms in the literature which claim to be error-free but in fact introduce arithmetic errors between the column- and row-wise 1D DCT stages in a 2D DCT operation. Fast algorithms are proposed for the final reconstruction step employing two approaches, namely, the expansion factor and dyadic approximation. A digital architecture is also proposed for a particular FRS algorithm, and is implemented on an FPGA platform for on-chip verification. The FPGA implementation operates at 360 MHz, and is capable of a real-time throughput of <inline-formula><tex-math notation="LaTeX"> $3.6\cdot 10^8$</tex-math><alternatives><inline-graphic xlink:href="cintra-ieq4-2837755.gif"/></alternatives> </inline-formula> 2D DCTs of size 8×8 every second, with corresponding pixel rate of <inline-formula> <tex-math notation="LaTeX">$2.3\cdot 10^{10}$</tex-math><alternatives> <inline-graphic xlink:href="cintra-ieq6-2837755.gif"/></alternatives></inline-formula> pixels per second. The digital architecture is synthesized using 180 nm CMOS standard cells and shows a chip area of 7.41 mm<inline-formula> <tex-math notation="LaTeX">$^2$</tex-math><alternatives><inline-graphic xlink:href="cintra-ieq7-2837755.gif"/> </alternatives></inline-formula>. The CMOS design is predicted to operate at 893 MHz clock frequency, at a dynamic power consumption 13.22 mW/MHz <inline-formula><tex-math notation="LaTeX">$\cdot$</tex-math><alternatives> <inline-graphic xlink:href="cintra-ieq8-2837755.gif"/></alternatives></inline-formula> V<inline-formula> <tex-math notation="LaTeX">$_{sup}^2$</tex-math><alternatives><inline-graphic xlink:href="cintra-ieq9-2837755.gif"/> </alternatives></inline-formula>.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2837755Contention-Aware Fair Scheduling for Asymmetric Single-ISA Multicore Systems
https://www.computer.org/csdl/trans/tc/2018/12/08360522-abs.html
Asymmetric single-ISA multicore processors (AMPs), which integrate high-performance big cores and low-power small cores, were shown to deliver higher performance per watt than symmetric multicores. Previous work has demonstrated that the OS scheduler plays an important role in realizing the potential of AMP systems. While throughput optimization on AMPs has been extensively studied, delivering fairness on these platforms still constitutes an important challenge to the OS. To this end, the scheduler must be equipped with a mechanism enabling to accurately track the progress that each application in the workload makes as it runs on the various core types throughout the execution. In turn, this progress largely depends on the benefit (or speedup) that an application derives on a big core relative to a small one, which may differ greatly across applications. While existing fairness-aware schedulers take application relative speedup into consideration when tracking progress, they do not cater to the performance degradation that may occur naturally due to contention on shared resources among cores, such as the last-level cache or the memory bus. In this paper, we propose CAMPS, a contention-aware fair scheduler for AMPs that primarily targets long-running compute-intensive workloads. Unlike other schemes, CAMPS does not require special hardware extensions or platform-specific speedup-prediction models to function. Our experimental evaluation, which leverages real asymmetric hardware and scheduler implementations in the Linux kernel, demonstrates that CAMPS improves fairness by up to 11 percent with respect to a state-of-the-art fairness-aware OS-level scheme, while delivering better system throughput.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2836418Access Characteristic Guided Read and Write Regulation on Flash Based Storage Systems
https://www.computer.org/csdl/trans/tc/2018/12/08362696-abs.html
NAND flash memory is now used in various storage systems, such as embedded systems, personal computers, and web servers. The developments in bit density and technology scaling have reduced its price, but worsen the reliability, leading to shortened lifetime and degraded access performance. This paper proposes to exploit access characteristics of workloads to improve flash performance and lifetime. The basic idea is to regulate the read and write operations based on the identified access characteristics. First, an access cost model is presented, which indicates a tradeoff between read and write time cost on NAND flash memory. Based on the access characteristics of workloads, read-only pages will be written with high cost so that they can be read with low cost, and write-only pages will be written with low cost. Second, the tradeoff between read cost and flash wearing is exploited for lifetime improvement. The write requests on write-only data are processed with reduced wearing by regulating the programm threshold voltage. Finally, as these approaches apply different write operations on write-only data for performance and lifetime improvement respectively, a combined approach is proposed to satisfy both goals. Simulation results show that the proposed approaches can improve performance and lifetime significantly with negligible overhead.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2839671FlinkCL: An OpenCL-Based In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data
https://www.computer.org/csdl/trans/tc/2018/12/08362980-abs.html
Research on in-memory big data management and processing has been prompted by the increase in main memory capacity and the explosion in big data. By offering an efficient in-memory distributed execution model, existing in-memory cluster computing platforms such as Flink and Spark have been proven to be outstanding for processing big data. This paper proposes FlinkCL, an in-memory computing architecture on heterogeneous CPU-GPU clusters based on OpenCL that enables Flink to utilize GPU's massive parallel processing ability. Our proposed architecture utilizes four techniques: a heterogeneous distributed abstract model (HDST), a Just-In-Time (JIT) compiling schema, a hierarchical partial reduction (HPR) and a heterogeneous task management strategy. Using FlinkCL, programmers only need to write Java code with simple interfaces. The Java code can be compiled to OpenCL kernels and executed on CPUs and GPUs automatically. In the HDST, a novel memory mapping scheme is proposed to avoid serialization or deserialization between Java Virtual Machine (JVM) objects and OpenCL structs. We have comprehensively evaluated FlinkCL with a set of representative workloads to show its effectiveness. Our results show that FlinkCL improve the performance by up to <inline-formula><tex-math notation="LaTeX">$11 \times$</tex-math><alternatives> <inline-graphic xlink:href="chen-ieq1-2839719.gif"/></alternatives></inline-formula> for some computationally heavy algorithms and maintains minor performance improvements for a I/O bound algorithm.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2839719An Online Learning Methodology for Performance Modeling of Graphics Processors
https://www.computer.org/csdl/trans/tc/2018/12/08365819-abs.html
Approximately 18 percent of the 3.2 million smartphone applications rely on integrated graphics processing units (GPUs) to achieve competitive performance. Graphics performance, typically measured in frames per second, is a strong function of the GPU frequency, which in turn has a significant impact on mobile processor power consumption. Consequently, dynamic power management algorithms have to assess the performance sensitivity to the frequency accurately to choose the operating frequency of the GPU effectively. Since the impact of GPU frequency on performance varies rapidly over time, there is a need for online performance models that can adapt to varying workloads. This paper presents a light-weight adaptive runtime performance model that predicts the frame processing time of graphics workloads at runtime without apriori characterization. We employ this model to estimate the frame time sensitivity to the GPU frequency, i.e., the partial derivative of the frame time with respect to the GPU frequency. The proposed model does not rely on any parameter learned offline. Our experiments on commercial platforms with common GPU benchmarks show that the mean absolute percentage error in frame time and frame time sensitivity prediction are 4.2 and 6.7 percent, respectively.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2840710FlexCL: A Model of Performance and Power for OpenCL Workloads on FPGAs
https://www.computer.org/csdl/trans/tc/2018/12/08365849-abs.html
Hardware acceleration is a promising trend for the energy and thermally constrained systems. The programmable nature of FPGAs allows it to deliver high performance and energy efficient solution. Unfortunately, the traditional RTL-based synthesis flow of FPGAs prevents its wide adoption. In response, recent adoption of OpenCL programming model has raised the possibility to program FPGAs in a software manner. To harness the power of FPGAs using OpenCL programming model, it is advantageous to design an analytical model for performance analysis, design space exploration and provide insights into the performance bottlenecks. To this end, this paper presents FlexCL, an analytical performance and power model for OpenCL workloads on FPGAs. FlexCL leverages static analysis to analyze the OpenCL kernels. As for the performance estimation, it first develops systematic computation models for processing elements, compute units and kernels by modeling the operation scheduling, work-item and work-group scheduling, and the resource constraints. Then, it models different global memory access patterns. Finally, FlexCL estimates the overall performance by tightly coupling the memory and computation models based on the communication mode. FlexCL can be also used to guide performance and power trade-off analysis. Experiments demonstrate that the average performance and power estimation errors of FlexCL are 9.5 and 12.6 percent for the Rodinia suite, respectively. The OpenCL model on FPGAs also exposes a rich optimization design space. With FlexCL, we can enable rapid exploration of the design space with respect to both performance and power within seconds instead of hours or days.11/11/2018 7:15 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2840686Fast In-Place Suffix Sorting on a Multicore Computer
https://www.computer.org/csdl/trans/tc/2018/12/08371211-abs.html
Sorting all suffixes of an input string <inline-formula><tex-math notation="LaTeX">$X$</tex-math><alternatives> <inline-graphic xlink:href="lao-ieq1-2842050.gif"/></alternatives></inline-formula> will produce the suffix array that is a fundamental data structure for full-text search on <inline-formula><tex-math notation="LaTeX">$X$</tex-math> <alternatives><inline-graphic xlink:href="lao-ieq2-2842050.gif"/></alternatives></inline-formula>. To utilize the parallel computing power of a multicore machine with shared memory, this article designs a fast linear-time and in-place parallel algorithm called pSACAK, for sorting the suffixes of an input string with a constant alphabet. This algorithm is a parallel variant of the sequential suffix sorting algorithm SACAK which improved the linear-time SAIS to be in-place for constant alphabets, and hence requires only a workspace of <inline-formula> <tex-math notation="LaTeX">$\mathcal {O}(K)$</tex-math><alternatives><inline-graphic xlink:href="lao-ieq3-2842050.gif"/> </alternatives></inline-formula> for alphabet size <inline-formula><tex-math notation="LaTeX">$K$</tex-math> <alternatives><inline-graphic xlink:href="lao-ieq4-2842050.gif"/></alternatives></inline-formula>. While our recent work has successfully designed the parallel variant of SAIS on a multicore machine, it remains a challenge to parallelize SACAK due to the strong data dependencies caused by the in-place constraint. A number of new techniques are proposed here to overcome the difficulties for designing pSACAK from the sequential SACAK. An experimental study is conducted to evaluate the performance of pSACAK versus other existing parallel suffix sorting algorithms. Our experimental results show that pSACAK is the most time and space efficient among all in comparison. To the best of our knowledge, pSACAK is the only linear-time and in-place parallel suffix sorting algorithm for constant alphabets reported so far.11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2842050Thermal-Aware Task Mapping on Dynamically Reconfigurable Network-on-Chip Based Multiprocessor System-on-Chip
https://www.computer.org/csdl/trans/tc/2018/12/08373717-abs.html
Dark silicon is the phenomenon that a fraction of many-core chip has to be turned off or run in a low-power state in order to maintain the safe chip temperature. System-level thermal management techniques normally map application on non-adjacent cores, while communication efficiency among these cores will be oppositely affected over conventional network-on-chip (NoC). Recently, SMART NoC architecture is proposed, enabling single-cycle multi-hop bypass channels to be built between distant cores at runtime, to reduce communication latency. However, communication efficiency of SMART NoC will be diminished by communication contention, which will in turn decrease system performance. In this paper, we first propose an Integer-Linear Programming (ILP) model to properly address communication problem, which generates the optimal solutions with the consideration of inter-processor communication. We further present a novel heuristic algorithm for task mapping in dark silicon many-core systems, called <italic>TopoMap</italic>, on top of SMART architecture, which can effectively solve communication contention problem in polynomial time. With fine-grained consideration of chip thermal reliability and inter-processor communication, presented approaches are able to control the reconfigurability of NoC communication topology in task mapping and scheduling. Thermal-safe system is guaranteed by physically decentralized active cores, and communication overhead is reduced by the minimized communication contention and maximized bypass routing. Performance evaluation on PARSEC shows the applicability and effectiveness of the proposed techniques, which achieve on average 42.5 and 32.4 percent improvement in communication and application performance, and 32.3 percent reduction in system energy consumption, compared with state-of-the-art techniques. <italic>TopoMap</italic> only introduces 1.8 percent performance difference compared to ILP model and is more scalable to large-size NoCs.11/11/2018 7:15 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2844365Reconfigurable Instruction-Based Multicore Parallel Convolution and Its Application in Real-Time Template Matching
https://www.computer.org/csdl/trans/tc/2018/12/08375740-abs.html
Convolution is widely used in scientific computational fields such as digital image processing and machine learning. However, these applications are difficult to execute in realtime because they are computationally intensive. This paper introduces a high-speed convolution solution that runs on our self-developed multicore digital signal processor (DSP). To optimize the convolution capability, we propose a convolution instruction and a convolution micro architecture in the design of a subcore. As a coprocessor, the designed subcore is integrated into a network-on-chip (NoC)-based multicore DSP. In the implementation of the multicore parallel convolution, an independent convolution task-partitioning and mapping scheme is proposed. Datablock storage and software prefetching mechanisms are used to hide the data transmission time during the calculation, improving the computing efficiency. We also develop a data reuse strategy that effectively reduces the data bandwidth requirements of multicore parallel convolution. The proposed methods are applied to correlation-based template matching, with the results showing that our convolution computing approach greatly improves the performance compared with the same operations run on a personal computer, a TMS320C6678 processor and an NVIDIA Quadro 1000M graphics processing unit (GPU).11/11/2018 7:16 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2844351A Faster Software Implementation of the Supersingular Isogeny Diffie-Hellman Key Exchange Protocol
https://www.computer.org/csdl/trans/tc/2018/11/08100879-abs.html
Since its introduction by Jao and De Feo in 2011, the supersingular isogeny Diffie-Hellman (SIDH) key exchange protocol has positioned itself as a promising candidate for post-quantum cryptography. One salient feature of the SIDH protocol is that it requires exceptionally short key sizes. However, the latency associated to SIDH is higher than the ones reported for other post-quantum cryptosystem proposals. Aiming to accelerate the SIDH runtime performance, we present in this work several algorithmic optimizations targeting both elliptic-curve and field arithmetic operations. We introduce in the context of the SIDH protocol a more efficient approach for calculating the elliptic curve operation <inline-formula><tex-math notation="LaTeX">$P+[k]Q$</tex-math><alternatives> <inline-graphic xlink:href="fazhernandez-ieq1-2771535.gif"/></alternatives></inline-formula>. Our strategy achieves a factor 1.4 speedup compared with the popular variable-three-point ladder algorithm regularly used in the SIDH shared secret phase. Moreover, profiting from pre-computation techniques our algorithm yields a factor 1.7 acceleration for the computation of this operation in the SIDH key generation phase. We also present an optimized evaluation of the point tripling formula, and discuss several algorithmic and implementation techniques that lead to faster field arithmetic computations. A software implementation of the above improvements on an Intel Skylake Core i7-6700 processor gives a factor 1.33 speedup against the state-of-the-art software implementation of the SIDH protocol reported by Costello-Longa-Naehrig in CRYPTO 2016.10/08/2018 5:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2771535Post-Quantum Key Exchange on ARMv8-A: A New Hope for NEON Made Simple
https://www.computer.org/csdl/trans/tc/2018/11/08107588-abs.html
<sc>NewHope</sc> and <sc>NewHope-Simple</sc> are two recently proposed post-quantum key exchange protocols based on the hardness of the Ring-LWE problem. Due to their high security margins and performance, there have already been discussions and proposals for integrating them into Internet standards, like TLS, and anonymity network protocols, like Tor. In this work, we present constant-time and vector-optimized implementations of <sc>NewHope</sc> and <sc> NewHope-Simple</sc> for ARMv8-A 64-bit processors which target high-speed applications. This architecture is implemented in a growing number of smart phone and tablet processors, and features powerful 128-bit SIMD operations provided by the NEON engine. In particular, we propose the use of three alternative modular reduction methods. They allow to better exploit NEON parallelism by avoiding larger data types during the Number Theoretic Transform ( <inline-formula><tex-math notation="LaTeX">$\sf{NTT}$</tex-math><alternatives> <inline-graphic xlink:href="desantis-ieq1-2773524.gif"/></alternatives></inline-formula>). Furthermore, they remove the need to transform input coefficients into Montgomery domain during pointwise multiplications. The NEON vectorized <inline-formula><tex-math notation="LaTeX">$\sf{NTT}$</tex-math><alternatives> <inline-graphic xlink:href="desantis-ieq2-2773524.gif"/></alternatives></inline-formula> uses a 16-bit unsigned integer representation and executes in only 18,909 clock cycles on an ARM Cortex-A53 core. Our implementation improves previous assembly-optimized results on ARM NEON platforms and outperforms the <monospace>C</monospace> reference implementation on the same platform by a factor of 8.3. The total time spent on the key exchange was reduced by more than a factor of 3.5 for both protocols.10/08/2018 2:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2773524CDT-Based Gaussian Sampling: From Multi to Double Precision
https://www.computer.org/csdl/trans/tc/2018/11/08295226-abs.html
The Rényi divergence is a measure of closeness of two probability distributions which has found several applications over the last years as an alternative to the statistical distance in lattice-based cryptography. A tight bound has recently been presented for the Rényi divergence of distributions that have a bounded relative error. We show that it can be used to bound the precision requirement in Gaussian sampling to the IEEE 754 floating-point standard double precision for usual lattice-based signature parameters by using a modified cumulative distribution table (CDT), which reduces the memory needed by CDT-based algorithms and, makes their constant-time implementation faster and simpler. Then, we apply this approach to a variable-center variant of the CDT algorithm which occasionally requires the online computation of the cumulative distribution function. As a result, the amount of costly floating-point operations is drastically decreased, which makes the constant-time and cache-resistant variants of this algorithm viable and efficient. Finally, we provide some experimental results indicating that comparing to rejection sampling our approach increases the GPV signature rate by a factor 4 to 8 depending on the security parameter.10/08/2018 5:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2807839Practical Randomized RLWE-Based Key Exchange Against Signal Leakage Attack
https://www.computer.org/csdl/trans/tc/2018/11/08300634-abs.html
Ring Learning With Errors (RLWE)-based key exchange is one of the most efficient and secure primitive for post-quantum cryptography. One common approach to achieve key exchange over RLWE is error reconciliation. Recently, an efficient attack against reconciliation-based RLWE key exchange protocols with reused keys was proposed. This attack can recover a long-term private key if a key pair is reused. We also know that in the real world, key reuse is commonly adopted in applications like the Transport Layer Security (TLS) protocol to improve performance. Directly motivated by this attack, we construct a new randomized RLWE-based key exchange protocol against this attack. Our lightweight approach incorporates an additional ephemeral public error term into key exchange materials, so that this attack no longer works. With the same attack, we practically show that the signal value of our protocol is indistinguishable from uniform random, therefore, this attack no longer works. We explain how the attack fails, present 200-bit classic and 80-bit quantum secure parameter choice, efficient implementations, comparisons and discussion. Benchmark shows our protocol is truly efficient and even faster than related vulnerable protocols.10/08/2018 5:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2808527Practical Parameters for <italic>Somewhat</italic> Homomorphic Encryption Schemes on Binary Circuits
https://www.computer.org/csdl/trans/tc/2018/11/08302942-abs.html
Post-quantum cryptography gets increasing attention lately, as we have to prepare alternative cryptographic solutions that will resist attacks from quantum computers. A very large effort is being done to replace the usual primitives such as encryption, signature or authentication. This effort also pulls new cryptographic features such as Somewhat or Fully Homomorphic Encryption schemes, based on lattices. Since their introduction in 2009, lots of the burden has been overcome and real applications now become possible. However many papers suffer from the fast constant pace of evolution on the attack side, so their parameter analysis is usually incomplete or obsolete. In this work we present a thorough study of two schemes that have proven their worth: FV and SHIELD, providing a deep analysis of how to setup and size their parameters, to ensure both correctness and security. Our overall aim is to provide easy-to-use guidelines for implementation purposes.10/08/2018 5:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2808962Optimizing Polynomial Convolution for NTRUEncrypt
https://www.computer.org/csdl/trans/tc/2018/11/08303667-abs.html
<inline-formula><tex-math notation="LaTeX">$\sf{ NTRUEncrypt}$</tex-math><alternatives> <inline-graphic xlink:href="zhang-ieq1-2809723.gif"/></alternatives></inline-formula> is one of the most promising candidates for quantum-safe cryptography. In this paper, we focus on the <inline-formula><tex-math notation="LaTeX"> $\sf{ NTRU743}$</tex-math><alternatives><inline-graphic xlink:href="zhang-ieq2-2809723.gif"/></alternatives> </inline-formula> parameter set. We give a report on all known attacks against this parameter set and show that it delivers 256 bits of security against classical attackers and 128 bits of security against quantum attackers. We then present a parameter-dependent optimization using a tailored hierarchy of multiplication algorithms as well as the Intel AVX2 instructions, and show that this optimization is constant-time. Our implementation is two to three times faster than the reference implementation of <inline-formula><tex-math notation="LaTeX">$\sf{ NTRUEncrypt}$</tex-math> <alternatives><inline-graphic xlink:href="zhang-ieq3-2809723.gif"/></alternatives></inline-formula>.10/08/2018 5:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2809723Constant-Time Discrete Gaussian Sampling
https://www.computer.org/csdl/trans/tc/2018/11/08314133-abs.html
Sampling from a discrete Gaussian distribution is an indispensable part of lattice-based cryptography. Several recent works have shown that the timing leakage from a non-constant-time implementation of the discrete Gaussian sampling algorithm could be exploited to recover the secret. In this paper, we propose a constant-time implementation of the Knuth-Yao random walk algorithm for performing constant-time discrete Gaussian sampling. Since the random walk is dictated by a set of input random bits, we can express the generated sample as a function of the input random bits. Hence, our constant-time implementation expresses the unique mapping of the input random-bits to the output sample-bits as a Boolean expression of the random-bits. We use bit-slicing to generate multiple samples in batches and thus increase the throughput of our constant-time sampling manifold. Our experiments on an Intel i7-Broadwell processor show that our method can be as much as 2.4 times faster than the constant-time implementation of cumulative distribution table based sampling and consumes exponentially less memory than the Knuth-Yao algorithm with shuffling for a similar level of security.10/08/2018 5:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2814587A High-Performance and Scalable Hardware Architecture for Isogeny-Based Cryptography
https://www.computer.org/csdl/trans/tc/2018/11/08315051-abs.html
In this work, we present a high-performance and scalable architecture for isogeny-based cryptosystems. In particular, we use the architecture in a fast, constant-time FPGA implementation of the quantum-resistant supersingular isogeny Diffie-Hellman (SIDH) key exchange protocol. On a Virtex-7 FPGA, we show that our architecture is scalable by implementing at 83, 124, 168, and 252-bit quantum security levels. This is the first SIDH implementation at close to the 256-bit quantum security level to appear in literature. Further, our implementation completes the SIDH protocol 2 times faster than performance-optimized software implementations and 1.34 times faster than the previous best FPGA implementation, both running a similar set of formulas. Our implementation employs inversion-free projective isogeny formulas. By replicating multipliers and utilizing an efficient scheduling methodology, we can heavily parallelize quadratic extension field arithmetic and the isogeny evaluation stage of the large-degree isogeny computation. For a constant-time implementation of 124-bit quantum security SIDH on a Virtex-7 FPGA, we generate ephemeral public keys in 8.0 and 8.6 ms and generate the shared secret key in 7.1 and 7.9 ms for Alice and Bob, respectively. Finally, we show that this architecture could also be used to efficiently generate undeniable and digital signatures based on supersingular isogenies.10/08/2018 5:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2815605HEPCloud: An FPGA-Based Multicore Processor for FV Somewhat Homomorphic Function Evaluation
https://www.computer.org/csdl/trans/tc/2018/11/08318681-abs.html
In this paper, we present an FPGA based hardware accelerator ‘<inline-formula><tex-math notation="LaTeX"> $\mathsf{HEPCloud}$</tex-math><alternatives><inline-graphic xlink:href="sinharoy-ieq1-2816640.gif"/></alternatives> </inline-formula>’ for homomorphic evaluations of medium depth functions which has applications in cloud computing. Our <inline-formula><tex-math notation="LaTeX">$\mathsf{HEPCloud}$</tex-math><alternatives> <inline-graphic xlink:href="sinharoy-ieq2-2816640.gif"/></alternatives></inline-formula> architecture supports the polynomial ring based homomorphic encryption scheme FV for a ring-LWE parameter set of dimension <inline-formula> <tex-math notation="LaTeX">$2^{15}$</tex-math><alternatives><inline-graphic xlink:href="sinharoy-ieq3-2816640.gif"/> </alternatives></inline-formula>, modulus size 1,228-bit, and a standard deviation 50. This parameter-set offers a multiplicative depth 36 and at least 85 bit security. The processor of <inline-formula><tex-math notation="LaTeX"> $\mathsf{HEPCloud}$</tex-math><alternatives><inline-graphic xlink:href="sinharoy-ieq4-2816640.gif"/></alternatives> </inline-formula> is composed of multiple parallel cores. To achieve fast computation time for such a large parameter-set, various optimizations in both algorithm and architecture levels are performed. For fast polynomial multiplications, we use CRT with NTT and achieve two dimensional parallelism in <inline-formula> <tex-math notation="LaTeX">$\mathsf{HEPCloud}$</tex-math><alternatives> <inline-graphic xlink:href="sinharoy-ieq5-2816640.gif"/></alternatives></inline-formula>. We optimize the BRAM access, use a fast Barrett like polynomial reduction method, optimize the cost of CRT, and design a fast divide-and-round unit. Beside parallel processing, we apply pipelining strategy in several of the sequential building blocks to reduce the impact of sequential computations. Finally, we implement <inline-formula><tex-math notation="LaTeX"> $\mathsf{HEPCloud}$</tex-math><alternatives><inline-graphic xlink:href="sinharoy-ieq6-2816640.gif"/></alternatives> </inline-formula> on a medium-size Xilinx Virtex 6 FPGA board ML605 board and measure its on-board performance. To store the ciphertexts during a homomorphic function evaluation, we use the large DDR3 memory of the ML605 board. Our FPGA-based implementation of <inline-formula><tex-math notation="LaTeX">$\mathsf{HEPCloud}$</tex-math><alternatives> <inline-graphic xlink:href="sinharoy-ieq7-2816640.gif"/></alternatives></inline-formula> computes a homomorphic multiplication in 26.67 s, of which the actual computation takes only 3.36 s and the rest is spent for off-chip memory access. It requires about 37,551 s to evaluate the SIMON-64/128 block cipher, but the per-block timing is only about 18 s because <inline-formula><tex-math notation="LaTeX">$\mathsf{HEPCloud}$</tex-math> <alternatives><inline-graphic xlink:href="sinharoy-ieq8-2816640.gif"/></alternatives></inline-formula> processes 2,048 blocks simultaneously. The results show that FPGA-based acceleration of homomorphic function evaluations is feasible, but fast memory interface is crucial for the performance.10/08/2018 5:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2816640Loop-Abort Faults on Lattice-Based Signature Schemes and Key Exchange Protocols
https://www.computer.org/csdl/trans/tc/2018/11/08354897-abs.html
Although postquantum cryptography is of growing practical concern, not many works have been devoted to implementation security issues related to postquantum schemes. In this paper, we look in particular at fault attacks against implementations of lattice-based signatures and key exchange protocols. For signature schemes, we are interested both in Fiat–Shamir type constructions (particularly BLISS, but also GLP, PASSSign, and Ring-TESLA) and in hash-and-sign schemes (particularly the GPV-based scheme of Ducas–Prest–Lyubashevsky). For key exchange protocols, we study the implementations of NewHope, Frodo, and Kyber. These schemes form a representative sample of modern, practical lattice-based signatures and key exchange protocols, and achieve a high level of efficiency in both software and hardware. We present several fault attacks against those schemes that recover the entire key recovery with only a few faulty executions (sometimes only one), show that those attacks can be mounted in practice based on concrete experiments in hardware, and discuss possible countermeasures against them.10/08/2018 5:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2833119Guest Editors’ Introduction to the Special Issue on Cryptographic Engineering in a Post-Quantum World: State of the Art Advances
https://www.computer.org/csdl/trans/tc/2018/11/08485531-abs.html
10/08/2018 5:02 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2869611ReDO: Cross-Layer Multi-Objective Design-Exploration Framework for Efficient Soft Error Resilient Systems
https://www.computer.org/csdl/trans/tc/2018/10/08323226-abs.html
Designing soft errors resilient systems is a complex engineering task, which nowadays follows a cross-layer approach. It requires a careful planning for different fault-tolerance mechanisms at different system's layers: starting from the technology up to the software domain. While these design decisions have a positive effect on the reliability of the system, they usually have a detrimental effect on its size, power consumption, performance and cost. Design space exploration for cross-layer reliability is therefore a multi-objective search problem in which reliability must be traded-off with other design dimensions. This paper proposes a cross-layer multi-objective design space exploration algorithm developed to help designers when building soft error resilient electronic systems. The algorithm exploits a system-level Bayesian reliability estimation model to analyze the effect of different cross-layer combinations of protection mechanisms on the reliability of the full system. A new heuristic based on the extremal optimization theory is used to efficiently explore the design space. Two exploration strategies are proposed. The first strategy aims at optimizing the reliability of the system alone. It is suited in those cases in which reaching a given reliability target is the sole goal. It focuses on finding a reduced set of system's components that, when protected, allow the designer to reach the desired reliability level. As a positive effect, by reducing the number of protected components, the overhead introduced by the fault tolerance techniques is reduced as well. The second strategy jointly considers the effect that the introduced fault-tolerance mechanisms have on the execution time, power, hardware area and software size. This strategy supports the exploration of the design space setting multiple objectives on different design dimensions. An extended set of simulations shows the capability of this framework when applied both to benchmark applications and realistic systems, providing optimized systems that outperform those obtained by applying state-of-the-art cross-layer reliability techniques.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2818735Elaborate Refresh: A Fine Granularity Retention Management for Deep Submicron DRAMs
https://www.computer.org/csdl/trans/tc/2018/10/08326524-abs.html
As the DRAM cell size continues to shrink, the proportion of leaky cells is increasing. As a result, the prior approaches, called retention aware refresh, which skip unnecessary refresh operations for non-leaky cells, are unable to skip as many refresh operations as before. The large granularity of the DRAM refresh mechanism makes this problem more serious. Specifically, even when there are only a small number of leaky cells in a particular retention group, that group is classified as a leaky group. Because of that, many non-leaky cells that also belong to that group are refreshed at an unnecessarily frequent rate. Since the granularity of the retention group is larger, this inefficiency becomes huge. To solve this problem, we propose a novel retention aware refresh approach called Elaborate Refresh, to reduce the granularity of the retention group further. The key idea of the Elaborate Refresh is to store leaky row addresses per each chip, and refresh different leaky row in each chip simultaneously. By doing so, Elaborate Refresh reduces the overhead of the leaky group refresh 16 times. In addition, Elaborate Refresh stores retention information in the DRAM chip, thus saving the refresh energy, even in the self-refresh mode when the memory controller cannot control the DRAM.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2820052A Single and Adjacent Error Correction Code for Fast Decoding of Critical Bits
https://www.computer.org/csdl/trans/tc/2018/10/08329222-abs.html
Many systems have critical bits which must be decoded at high speeds; for example, flags to mark the start and end of a packet (SOP and EOP) determine subsequent actions, thus they must be decoded first and fast. This paper presents a new single and adjacent error correction (SAEC) code; as the codewords have critical bits, the proposed code accomplishes a fast decoding for them. The proposed code is a systematic code and permits shortening. This is accomplished by reducing the information bits, so that columns in the H matrix can be eliminated, while still keeping both the SAEC capability and the systematic feature, but for an odd number of information bits, an adjustment step in critical bits is required. It is shown that the check bit length of the proposed code is nearly the same as that of the traditional (optimal) Hamming SAEC code. The decoder of the proposed SAEC code is compared with the traditional Hamming SAEC code; this comparison shows that on average, the delay time for the critical bits is reduced by 6 percent compared with the traditional Hamming SAEC code (so at the same reduction level as a previous SEC scheme for fast decoding of critical bits over a traditional SEC code). Also, the area and power consumption of the proposed decoder show average reductions of 12 percent and 10 percent compared with the decoder of a traditional SAEC code.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2821688Real-Power Computing
https://www.computer.org/csdl/trans/tc/2018/10/08330023-abs.html
The traditional hallmark in embedded systems is to minimize energy consumption considering hard or soft real-time deadlines. The basic principle is to transfigure the uncertainties of task execution times in the <italic>real</italic> world into energy saving opportunities. The energy saving is achieved by suitably controlling the reliable power supply at circuit or system-level with the aim of minimizing the slack times, while meeting the specified performance requirements. Computing paradigm for emerging ubiquitous systems, particularly for the energy-harvested ones, has clearly shifted from the traditional systems. The energy supply of these systems can vary temporally and spatially within a dynamic range, essentially making computation extremely challenging. Such a paradigm shift requires disruptive approaches to design computing systems that can provide continued functionality under unreliable supply power envelope and operate with autonomous survivability (i.e., the ability to automatically guarantee retention and/or completion of a given computation task). In this paper, we introduce <italic>Real-Power Computing</italic>, inspired by the above trends and tenets. We show how computation systems must be designed with power-proportionality to achieve sustained computation and survivability when operating at extreme power conditions. We present extensive analysis of the need for this new computing approach using definitions, where necessary, coupled with detailed taxonomies, empirical observations, a review of relevant research works and example scenarios using three case studies representing the proposed paradigm.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2822697Fast Data Delivery for Many-Core Processors
https://www.computer.org/csdl/trans/tc/2018/10/08331120-abs.html
Server workloads operate on large volumes of data. As a result, processors executing these workloads encounter frequent L1-D misses. In a many-core processor, an L1-D miss causes a request packet to be sent to an LLC slice and a response packet to be sent back to the L1-D, which results in high overhead. While prior work targeted response packets, this work focuses on accelerating the request packets. Unlike aggressive OoO cores, simpler cores used in many-core processors cannot hide the latency of L1-D request packets. We observe that LLC slices that serve L1-D misses are strongly temporally correlated. Taking advantage of this observation, we design a simple and accurate predictor. Upon the occurrence of an L1-D miss, the predictor identifies the LLC slice that will serve the next L1-D miss and a circuit will be set up for the upcoming miss request to accelerate its transmission. When the upcoming miss occurs, the resulting request can use the already established circuit for transmission to the LLC slice. We show that our proposal outperforms data prefetching mechanisms in a many-core processor due to (1) higher prediction accuracy and (2) not wasting valuable off-chip bandwidth, while requiring significantly less overhead. Using full-system simulation, we show that our proposal accelerates serving data misses by 22 percent and leads to 10 percent performance improvement over the state-of-the-art network-on-chip.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2821144A GPU-Aware Parallel Index for Processing High-Dimensional Big Data
https://www.computer.org/csdl/trans/tc/2018/10/08332539-abs.html
The problem of the curse of dimensionality for processing large high-dimensional datasets has been an open challenge. Numerous research efforts have been proposed for improving query performance in high-dimensional space through hierarchical indexing using the R-tree or its variants and exploring parallel processing of the R-tree on GPUs. Despite these existing efforts, the curse of dimensionality remains to be a grand challenge since the existing methods deteriorate drastically as the dimensionality of datasets increases. To cope with this problem, we present a novel GPU-aware parallel indexing method called G-tree, which offers consistent and stable performance in high-dimensional space. The rationale of the G-tree is to combine the efficiency of the R-tree in low-dimensional space with the massive parallel processing potential of GPUs by introducing a new data structure and three new optimization techniques to better utilize the GPU memory structure for accelerating both index search and index node access on GPUs. The first two optimizations promote effective parallelism utilization in GPU memory access. We dedicate the third optimization to further speed up the G-tree index by conducting progressive filtering using our dimension filters. We evaluate the validity of the G-tree approach by extensive experiments on high-dimensional datasets, showing that the G-tree outperforms the existing state-of-the-art techniques.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2823760First-Last: A Cost-Effective Adaptive Routing Solution for TSV-Based Three-Dimensional Networks-on-Chip
https://www.computer.org/csdl/trans/tc/2018/10/08333738-abs.html
3D integration opens up new opportunities for future multiprocessor chips by enabling fast and highly scalable 3D Network-on-Chip (NoC) topologies. However, in an aim to reduce the cost of Through-silicon via (TSV), partially vertically connected NoCs, in which only a few vertical TSV links are available, have been gaining relevance. To reliably route packets under such conditions, we introduce a lightweight, efficient and highly resilient adaptive routing algorithm targeting partially vertically connected 3D-NoCs named First-Last. It requires a very low number of virtual channels (VCs) to achieve deadlock-freedom (2 VCs in the East and North directions and 1 VC in all other directions), and guarantees packet delivery as long as one healthy TSV connecting all layers is available anywhere in the network. An improved version of our algorithm, named Enhanced-First-Last is also introduced and shown to dramatically improve performance under low TSV availability while still using less virtual channels than state-of-the-art algorithms. A comprehensive evaluation of the cost and performance of our algorithms is performed to demonstrate their merits with respects to existing solutions.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2822269Thread Voting DVFS for Manycore NoCs
https://www.computer.org/csdl/trans/tc/2018/10/08338086-abs.html
We present a thread-voting DVFS technique for manycore networks-on-chip (NoCs). This technique has two remarkable features which differentiate from conventional NoC DVFS schemes. (1) Not only network-level but also thread-level runtime performance indicatives are used to guide DVFS decisions. (2) To resolve multiple perhaps conflicting performance indicatives from many cores, it allows each thread to “vote” for a V/F level in its own performance interest, and a region-based V/F controller makes dynamic per-region V/F decision according to the major vote. We evaluate our technique on a 64-core CMP in full-system simulation environment GEM5 with both PARSEC and SPEC OMP2012 benchmarks. Compared to a network metric (router buffer occupancy) based approach, it can improve the network energy efficacy measured in MPPJ (million packets per joule) by up to 22 percent for PARSEC and 20 percent for SPEC OMP2012, and the system energy efficacy measured in MIPJ (million instructions per joule) by up to 35 percent for PARSEC and 33 percent for SPEC OMP2012.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2827039Subpage-Aware Solid State Drive for Improving Lifetime and Performance
https://www.computer.org/csdl/trans/tc/2018/10/08338090-abs.html
The manufacturers of NAND flash-based solid-state drives (SSDs) are increasing capacity and throughput by enlarging their page size, which is the minimum I/O unit in the NAND flash chips. Because the host and NAND flash chips have different I/O granularity units, the number of subpage requests increases. However, these subpage requests, especially writes, can cause internal fragmentation and endurance problems. Furthermore, subpage write requests inevitably involve read-modify-write (RMW) operations that increase the write response time because of the out-place-update feature in the NAND flash chips. In this paper, we propose a subpage-aware SSD to increase the lifetime and performance by reducing the number of NAND writes and eliminating unnecessary RMW operations. Our scheme attempts to merge subpage write requests to full page write requests in the write buffer to reduce the number of NAND writes and adds size information to the mapping table to detect unnecessary RMW operations. Our proposed scheme reduces the number of NAND writes by up to 30 and 19 percent on average and the write response time by up to 22 and 13 percent on average.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2827033Robust Mixed-Criticality Systems
https://www.computer.org/csdl/trans/tc/2018/10/08352522-abs.html
Certification authorities require correctness and survivability. In the temporal domain this requires a convincing argument that all deadlines will be met under error free conditions, and that when certain defined errors occur the behaviour of the system is still predictable and safe. This means that occasional execution-time overruns should be tolerated and where more severe errors occur levels of graceful degradation should be supported. With mixed-criticality systems, fault tolerance must be criticality aware, i.e. some tasks should degrade less than others. In this paper a quantitative notion of robustness is defined, and it is shown how fixed priority-based task scheduling can be structured to maximise the likelihood of a system remaining fail operational or fail robust (the latter implying that an occasional job may be skipped if all other deadlines are met). Analysis is developed for fail operational and fail robust behaviour, optimal priority ordering is addressed and an experimental evaluation is described. Overall, the approach presented allows robustness to be balanced against schedulability. A designer would thus be able to explore the design space so defined.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2831227A Cost-Effective Distribution-Aware Data Replication Scheme for Parallel I/O Systems
https://www.computer.org/csdl/trans/tc/2018/10/08352821-abs.html
As data volumes of high-performance computing applications continuously increase, low I/O performance becomes a fatal bottleneck of these data-intensive applications. Data replication is a promising approach to improve parallel I/O performance. However, most existing strategies are designed based on the assumption that contiguous requests are being served more efficiently than non-contiguous requests, which is not necessarily true in a parallel I/O system. The reason is that the multiple-server data distribution makes the favorable accesses between contiguous requests and non-contiguous ones indeterminate. In this study, we propose CEDA, a cost-effective distribution-aware data replication scheme to better support parallel I/O systems. As logical file access information is inefficient to make replication decisions in a parallel environment, CEDA considers physical data accesses on servers in both data selection and data placement during a parallel replication process. Specifically, CEDA first proposes a distribution-aware cost model to evaluate the file request time with a given data layout, and then it carries out cost-effective data replication based on replication benefit analysis. We have implemented CEDA as a part of the MPI I/O library in light of high portability on top of the OrangeFS file system. By replaying representative benchmarks and a real application, we collected comprehensive experimental results on both HDD- and SSD-based servers and conclude that CEDA can significantly improve parallel I/O system performance.09/06/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2831689Unbiased Rounding for HUB Floating-Point Addition
https://www.computer.org/csdl/trans/tc/2018/09/08300633-abs.html
Half-Unit-Biased (HUB) is an emerging format based on shifting the represented numbers by half Unit in the Last Place. This format simplifies two's complement and round-to-nearest operations by preventing any carry propagation. This saves power consumption, time and area. Taking into account that the IEEE floating-point standard uses an unbiased rounding as the default mode, this feature is also desirable for HUB approaches. In this paper, we study the unbiased rounding for HUB floating-point addition in both as standalone operation and within FMA. We show two different alternatives to eliminate the bias when rounding the sum results, either partially or totally. We also present an error analysis and the implementation results of the proposed architectures to help the designers to decide what their best option are.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2807429A Simulation Analysis of Redundancy and Reliability in Primary Storage Deduplication
https://www.computer.org/csdl/trans/tc/2018/09/08300656-abs.html
Deduplication has been widely used to improve storage efficiency in modern primary and secondary storage systems, yet how deduplication fundamentally affects storage system reliability remains debatable. This paper aims to analyze and compare storage system reliability with and without deduplication in primary workloads using public file system snapshots from two research groups. We first study the redundancy characteristics of the file system snapshots. We then propose a trace-driven, deduplication-aware simulation framework to analyze data loss in both chunk and file levels due to sector errors and whole-disk failures. Compared to without deduplication, our analysis shows that deduplication consistently reduces the damage of sector errors due to intra-file redundancy elimination, but potentially increases the damages of whole-disk failures if the highly referenced chunks are not carefully placed on disk. To improve reliability, we examine a deliberate copy technique that stores and repairs first the most referenced chunks in a small dedicated physical area (e.g., 1 percent of the physical capacity), and demonstrate its effectiveness through our simulation framework.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2808496Towards a Cryptographic Minimal Design: The sLiSCP Family of Permutations
https://www.computer.org/csdl/trans/tc/2018/09/08305605-abs.html
The security of highly resource constrained applications is often viewed in the literature from a single aspect of a specific cryptographic primitive. More precisely, most of the proposed lightweight cryptographic primitives focus on providing a single functionality within the available hardware area dedicated for security purposes. In this paper, we argue that for such applications, a cryptographic primitive that follows the <italic>cryptographic minimal design </italic> strategy maybe the only realistically adopted security solution where there is a constrained GE budget for all security functionalities. Indeed, it is reasonable, if not desirable, for the adopted cryptographic design to have well justified building components and to provide minimal overhead for multiple cryptographic functionalities including encryption, hashing, authentication, and pseudorandom bit generation. Following such a strategy, we propose the <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq1-2811467.gif"/></alternatives></inline-formula> family of lightweight cryptographic permutations which employs two of the most hardware efficient and extensively cryptanalyzed constructions, namely a 4-subblock Type-2 Generalized Feistel-like Structure (GFS) and round-reduced unkeyed Simeck. In addition to the hardware efficiency, we follow restrictive security design goals which enable us to provide resistance against differential and linear cryptanalysis, as well as guaranteed resistance to diffusion-based, algebraic, and self-symmetry distinguishers, and accordingly, we claim that there exist no structural distinguishers for <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq2-2811467.gif"/></alternatives></inline-formula>-<inline-formula> <tex-math notation="LaTeX">$b$</tex-math><alternatives><inline-graphic xlink:href="altawy-ieq3-2811467.gif"/> </alternatives></inline-formula> with a complexity below <inline-formula><tex-math notation="LaTeX">$2^{b/2}$</tex-math> <alternatives><inline-graphic xlink:href="altawy-ieq4-2811467.gif"/></alternatives></inline-formula> where <inline-formula><tex-math notation="LaTeX">$b$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq5-2811467.gif"/></alternatives></inline-formula> is the state size. Moreover, we present the <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq6-2811467.gif"/></alternatives></inline-formula> duplex sponge mode to illustrate how the permutations can be used in a unified design that provides (authenticated) encryption, hashing, and pseudorandom bit generation functionalities. Finally, we report two efficient parallel hardware implementations for the <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq7-2811467.gif"/></alternatives></inline-formula> unified duplex sponge mode when using <inline-formula><tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq8-2811467.gif"/></alternatives></inline-formula>-192 (resp. <inline-formula> <tex-math notation="LaTeX">$\sf{sLiSCP}$</tex-math><alternatives><inline-graphic xlink:href="altawy-ieq9-2811467.gif"/> </alternatives></inline-formula>-256) in CMOS <inline-formula><tex-math notation="LaTeX">$65$</tex-math><alternatives> <inline-graphic xlink:href="altawy-ieq10-2811467.gif"/></alternatives></inline-formula> nm ASIC with area of 2289 (resp. 3039) GE and a throughput of 29.62 (resp. 44.44) kbps, and their areas in CMOS <inline-formula> <tex-math notation="LaTeX">$130$</tex-math><alternatives><inline-graphic xlink:href="altawy-ieq11-2811467.gif"/> </alternatives></inline-formula> nm are 2498 (resp. 3319) GE.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2811467Network Synthesis for Distributed Embedded Systems
https://www.computer.org/csdl/trans/tc/2018/09/08307094-abs.html
The amazing proliferation of communication technologies for embedded systems opens the way for completely new applications but forces designers to adopt new methodologies to meet time-to-market constraints. Computer-Aided Design (CAD) has been traditionally applied to computers and embedded systems <italic>in isolation</italic> without considering them as a global inter-connected system. The paper contributes to fill this gap by proposing <italic>1) </italic> a communication-aware design flow for network-interconnected embedded systems and <italic>2)</italic> a formal framework to efficiently synthesize their network aspects by formulating and solving an optimization problem. Presented case studies show the potentiality of the proposed approach to address heterogeneous scenarios, e.g., related to smart spaces up to the ever-more-mentioned Internet-of-Things.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2812797FFT-Based McLaughlin's Montgomery Exponentiation without Conditional Selections
https://www.computer.org/csdl/trans/tc/2018/09/08307235-abs.html
Modular multiplication forms the basis of many cryptographic functions such as RSA, Diffie-Hellman key exchange, and ElGamal encryption. For large RSA moduli, combining the fast Fourier transform (FFT) with McLaughlin's Montgomery modular multiplication (MLM) has been validated to offer cost-effective implementation results. However, the conditional selections in McLaughlin's algorithm are considered to be inefficient and vulnerable to timing attacks, since extra long additions or subtractions may take place and the running time of MLM varies. In this work, we restrict the parameters of MLM by a set of new bounds and present a modified MLM algorithm involving no conditional selection. Compared to the original MLM algorithm, we inhibit extra operations caused by the conditional selections and accomplish constant running time for modular multiplications with different inputs. As a result, we improve both area-time efficiency and security against timing attacks. Based on the proposed algorithm, efficient FFT-based modular multiplication and exponentiation are derived. Exponentiation architectures with dual FFT-based multipliers are designed obtaining area-latency efficient solutions. The results show that our work offers a better efficiency compared to the state-of-the-art works from and above 2048-bit operand sizes. For single FFT-based modular multiplication, we have achieved constant running time and obtained area-latency efficiency improvements up to 24.3 percent for 1,024-bit and 35.5 percent for 4,096-bit operands, respectively.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2811466A Hybrid Multicast Routing Approach with Enhanced Methods for Mesh-Based Networks-on-Chip
https://www.computer.org/csdl/trans/tc/2018/09/08309347-abs.html
Multicast communication can greatly enhance the performance of Networks-on-Chip. Currently most multicast routing algorithms are either tree-based or path-based. The former has low latency but needs to solve multicast deadlocks through additional hardware resources. The latter can avoid deadlocks easily but may require long routing paths. In this paper we propose a hybrid multicast routing approach that combines the advantages of both path- and tree-based methods. The proposed approach ensures deadlock-free multicast routing without requiring additional virtual channels or large buffers to hold large packets. High routing performance is achieved using an adaptive routing strategy considering the traffic load in nearby routers. Two techniques, namely node balancing and path balancing, are further developed to enhance this hybrid routing algorithm. Extensive experiments with different buffer sizes, packet sizes and numbers of destinations per packet under random and Rent's rule traffic at various traffic injection rates have been conducted. The results show that the average latency of our approach is lower than previous multicast routing algorithms in most cases, and the saturation points of our approach are always at much higher injection rates.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2813394WASP: Selective Data Prefetching with Monitoring Runtime Warp Progress on GPUs
https://www.computer.org/csdl/trans/tc/2018/09/08309426-abs.html
This paper proposes a new data prefetching technique for Graphics Processing Units (GPUs) called Warp Aware Selective Prefetching (WASP). The main idea of WASP is to dynamically select warps whose progress is slower than that of the current warp as prefetching target warps. Under the in-order instruction execution model of GPUs, these prefetching target warps will certainly execute the same load as the current warp. Exploiting that, WASP prefetches the data for prefetching target warps, which allows the prefetched data to be accurately accessed. To simply verify the progress of the warps, WASP monitors the counts of the dynamic load executions for all warps. When a warp executes a load, WASP searches the warps with lower load execution counts than the current warp and generates the prefetch requests for them. In our evaluation, WASP achieves a 16.8 percent speedup compared to the baseline GPU.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2813379Queuing-Based eDRAM Refreshing for Ultra-Low Power Processors
https://www.computer.org/csdl/trans/tc/2018/09/08310027-abs.html
Ultra-low power processors designed to work at very low voltage are the enablers of the internet of things (IoT) era. Their internal memories, which are usually implemented by a static random access memory (SRAM) technology, stop functioning properly at low voltage. Some recent commercial products have replaced SRAM with embedded memory (eDRAM), in which stored data are destroyed over time, thus requiring periodic refreshing that causes performance loss. This article presents a queuing-based opportunistic refreshing algorithm that eliminates most if not all of the performance loss and is shown to be optimal. The queues used for refreshing miss refreshing opportunities not only when they are saturated but also when they are empty, hence increasing the probability of performance loss. We examine the optimal policy for handling a saturated and empty queue, and the ways in which system performance depends on queue capacity and memory size. This analysis results in a closed-form performance expression capturing read/write probabilities, memory size and queue capacity leading to CPU-internal memory architecture optimization.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2811470Aging-Aware Boosting
https://www.computer.org/csdl/trans/tc/2018/09/08319494-abs.html
DVFS-based boosting techniques have been widely employed by commercial multi-core processors, due to their superiority in improving the performance. <italic>Boosting</italic>, however, is particularly stressing circuits and hence it significantly contributes to an accelerated aging process. Circuit aging has become a real reliability concern because it leads to an increase in transistor threshold voltage that may cause timing errors as a result of higher delays in critical paths. Thus, high performance is desirable but it shortens the circuit lifetime through aging leaving a choice to trade-off. Besides well-known long-term aging effects, recent research also reported short-term aging effects. Our claim is that DVFS-based boosting techniques should consider both long- and short-term aging effects. This can be circumvented by wider timing guardbands. But that would be more expensive. The goal of this work is therefore to analyze and optimize <italic>boosting</italic> under specific consideration of long-term and short-term aging effects. As a result of our findings, we propose the first comprehensive aging-aware, yet efficient boosting technique. The employed aging-aware cell libraries in this work are publicly available at <uri> http://ces.itec.kit.edu/dependable-hardware.php</uri>.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2816014A Stochastic Computational Multi-Layer Perceptron with Backward Propagation
https://www.computer.org/csdl/trans/tc/2018/09/08319953-abs.html
Stochastic computation has recently been proposed for implementing artificial neural networks with reduced hardware and power consumption, but at a decreased accuracy and processing speed. Most existing implementations are based on pre-training such that the weights are predetermined for neurons at different layers, thus these implementations lack the ability to update the values of the network parameters. In this paper, a stochastic computational multi-layer perceptron (SC-MLP) is proposed by implementing the backward propagation algorithm for updating the layer weights. Using extended stochastic logic (ESL), a reconfigurable stochastic computational activation unit (SCAU) is designed to implement different types of activation functions such as the <inline-formula><tex-math notation="LaTeX">$tanh$ </tex-math><alternatives><inline-graphic xlink:href="han-ieq1-2817237.gif"/></alternatives></inline-formula> and the rectifier function. A triple modular redundancy (TMR) technique is employed for reducing the random fluctuations in stochastic computation. A probability estimator (PE) and a divider based on the TMR and a binary search algorithm are further proposed with progressive precision for reducing the required stochastic sequence length. Therefore, the latency and energy consumption of the SC-MLP are significantly reduced. The simulation results show that the proposed design is capable of implementing both the training and inference processes. For the classification of nonlinearly separable patterns, at a slight loss of accuracy by 1.32-1.34 percent, the proposed design requires only 28.5-30.1 percent of the area and 18.9-23.9 percent of the energy consumption incurred by a design using floating point arithmetic. Compared to a fixed-point implementation, the SC-MLP consumes a smaller area (40.7-45.5 percent) and a lower energy consumption (38.0-51.0 percent) with a similar processing speed and a slight drop of accuracy by 0.15-0.33 percent. The area and the energy consumption of the proposed design is from 80.7-87.1 percent and from 71.9-93.1 percent, respectively, of a binarized neural network (BNN), with a similar accuracy.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2817237Cloudlets Activation Scheme for Scalable Mobile Edge Computing with Transmission Power Control and Virtual Machine Migration
https://www.computer.org/csdl/trans/tc/2018/09/08322166-abs.html
Mobile devices have several restrictions due to design choices that guarantee their mobility. A way of surpassing such limitations is to utilize cloud servers called cloudlets on the edge of the network through Mobile Edge Computing. However, as the number of clients and devices grows, the service must also increase its scalability in order to guarantee a latency limit and quality threshold. This can be achieved by deploying and activating more cloudlets, but this solution is expensive due to the cost of the physical servers. The best choice is to optimize the resources of the cloudlets through an intelligent choice of configuration that lowers delay and raises scalability. Thus, in this paper we propose an algorithm that utilizes Virtual Machine Migration and Transmission Power Control, together with a mathematical model of delay in Mobile Edge Computing and a heuristic algorithm called Particle Swarm Optimization, to balance the workload between cloudlets and consequently maximize cost-effectiveness. Our proposal is the first to consider simultaneously communication, computation, and migration in our assumed scale and, due to that, manages to outperform other conventional methods in terms of number of serviced users.08/07/2018 2:07 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2818144An Erase Efficiency Boosting Strategy for 3D Charge Trap NAND Flash
https://www.computer.org/csdl/trans/tc/2018/09/08322264-abs.html
Owing to the fast-growing demands of larger and faster NAND flash devices, new manufacturing techniques have accelerated the down-scaling process of NAND flash memory. Among these new techniques, 3D charge trap flash is considered to be one of the most promising candidates for the next-generation NAND flash devices. However, the long erase latency of 3D charge trap flash becomes a critical issue. This issue is exacerbated because the distinct transient voltage shift phenomenon is worsened when the number of program/erase cycle increases. In contrast to existing works that aim to tackle the erase latency issue by reducing the number of block erases, we tackle this issue by utilizing the “multi-block erase” feature. In this work, an erase efficiency boosting strategy is proposed to boost the garbage collection efficiency of 3D charge trap flash via enabling multi-block erase inside flash chips. A series of experiments was conducted to demonstrate the capability of the proposed strategy on improving the erase efficiency and access performance of 3D charge trap flash. The results show that the erase latency of 3D charge trap flash memory is improved by 75.76 percent on average even when the P/E cycle reaches <inline-formula> <tex-math notation="LaTeX">$10^{4}$</tex-math><alternatives><inline-graphic xlink:href="chang-ieq1-2818118.gif"/> </alternatives></inline-formula>.08/07/2018 2:08 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2818118Advanced Compressor Tree Synthesis for FPGAs
https://www.computer.org/csdl/trans/tc/2018/08/08263391-abs.html
This work presents novel methods for the optimization of compressor trees for FPGAs as required in many arithmetic computations. As demonstrated in recent work, important key elements for the design of efficient but fast compressor trees are target-optimized 4:2 compressors as well as generalized parallel counters (GPCs). However, the optimization of a compressor tree for minimal resources using both compressors and GPCs has not been addressed so far. As this combined optimization is a non-trivial task, three methods are proposed to find best solutions for a given problem size: 1) a heuristic that obtains compressor trees with typically less resources and fewer stages than state-of-the-art heuristics, 2) an integer linear programming (ILP)-based methodology that finds optimal compressor trees using the fewest stages possible, 3) a combined approach that partially solves the problem heuristically to reduce the search space for the ILP-based method. In all methods, the cost for pipeline registers can be included. Synthesis experiments show that the proposed methods provide pipelined compressor trees with about 40 percent less LUTs compared to trees of 2-input adders at the cost of being about 12 ...20 percent slower.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2795611Exploring the Design Space of Fair Scheduling Supports for Asymmetric Multicore Systems
https://www.computer.org/csdl/trans/tc/2018/08/08265024-abs.html
Although traditional CPU scheduling efficiently utilizes multiple cores with equal computing capacity, the advent of multicores with diverse capabilities pose challenges to CPU scheduling. For such asymmetric multi-core systems, scheduling is essential to exploit the efficiency of core asymmetry, by matching each application with the best core type. However, in addition to the efficiency, an important aspect of CPU scheduling is fairness in CPU provisioning. Such uneven core capability is inherently unfair to threads and causes performance variance, as applications running on fast cores receive higher capability than applications on slow cores. Depending on co-running applications and scheduling decisions, the performance of an application may vary significantly. This study investigates the fairness problem in asymmetric multi-cores, and explores the design space of OS schedulers supporting multiple fairness constraints. In this paper, we consider two fairness-oriented constraints, <italic>minimum fairness</italic> for the minimum guaranteed performance and <italic>uniformity</italic> for performance variation reduction. This study proposes four scheduling policies which guarantee a minimum performance bound while improving the overall throughput and reducing performance variation too. The proposed fairness-oriented schedulers are implemented for the Linux kernel with an online application monitoring technique. Using an emulated asymmetric multi-core with frequency scaling and a real asymmetric multi-core with the big.LITTLE architecture, the paper shows that the proposed schedulers can effectively support the specified fairness while improving overall system throughput.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2796077Checkpointing Workflows for Fail-Stop Errors
https://www.computer.org/csdl/trans/tc/2018/08/08279499-abs.html
We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (<sc>M-SPGs</sc>), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the <sc>M-SPG</sc> structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2801300Mitigating Observability Loss of Toggle-Based <italic>X</italic>-Masking via Scan Chain Partitioning
https://www.computer.org/csdl/trans/tc/2018/08/08280565-abs.html
The Toggle-based <italic>X</italic>-masking method requires a single toggle at a given cycle, there is a chance that non-<italic>X</italic> values are also masked. Hence, the non-<italic>X</italic> value over-masking problem may cause a fault coverage degradation. In this paper, a scan chain partitioning scheme is described to alleviate non-<italic>X </italic> bit over-masking problem arising from Toggle-based <italic>X</italic>-Masking method. The scan chain partitioning method finds a scan chain combination that gives the least toggling conflicts. The experimental results show that the amount of over-masked bits is significantly reduced, and it is further reduced when the proposed method is incorporated with <italic>X</italic>-canceling method. However, as the number of scan chain partitions increases, the control data for decoder increases. To reduce a control data overhead, this paper exploits a Huffman coding based data compression. Assuming two partitions, the size of control bits is even smaller than the conventional <italic>X </italic>-toggling method that uses only one decoder. In addition, selection rules of <italic>X</italic>-bits delivered to <italic>X</italic>-Canceling MISR are also proposed. With the selection rules, a significant test time increase can be prevented.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2801847LEAD: An Adaptive 3D-NoC Routing Algorithm with Queuing-Theory Based Analytical Verification
https://www.computer.org/csdl/trans/tc/2018/08/08283712-abs.html
2D-NoCs have been the mainstream approach used to interconnect multi-core systems. 3D-NoCs have emerged to compensate for deficiencies of 2D-NoCs such as long latency and power overhead. A low-latency routing algorithm for 3D-NoC is designed to accommodate high-speed communication between cores. Both simulation and analytical models are applied to estimate the communication latency of NoCs. Generally, simulations are time-consuming and slow down the design process. Analytical models provide, within a fraction of the time, nearly accurate results which can be used by simulation to fine-tune the design. In this paper, a high performance and adaptive routing algorithm has been proposed for partially connected 3D-NoCs. Latency of the routing algorithm under different traffic patterns, different number of elevators and different elevator assignment mechanisms are reported. An analytical model, tailored to the adaptivity of the algorithm and under low traffic scenarios, has been developed and the results have been verified by simulation. According to the results, simulation and analytical results are consistent within a 10 percent margin.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2801298READ: Reliability Enhancement in 3D-Memory Exploiting Asymmetric SER Distribution
https://www.computer.org/csdl/trans/tc/2018/08/08283794-abs.html
3D-memory is one of promising applications in 3D-IC technology. With a 3D integration technology, the effective density of memories can increase and the interconnect distance from processor to memory can be shortened. Due to its stacked structure, the upper dies behave as shields blocking outer particles from reaching lower dies, and it makes error rate of the top layer largest among all layers. From a heat perspective, the lower dies would suffer from reliability problems since the lower dies are placed on top of logic die. The heat dissipation can more influence lower dies than upper dies. This creates unequal a reliability distribution for each layer in 3D-memories. A novel ECC organization scheme for 3D-memory to secure reliable operations under soft error rate (SER) profiles is introduced in this paper. The proposed scheme does not require additional redundant arrays. Instead, it utilizes unused spare columns of relatively reliable layer memories to store additional check-bits of less reliable layer memories. It forms a heterogeneous ECC organization across different layers which enhances ECC capabilities in less reliable layers. In addition, redundancy sharing scheme for yield enhancement can be implemented with the proposed scheme. Experimental results show that a memory with the proposed method can tolerate more than three times of a bit-error rate compared to the conventional memory.07/10/2018 1:01 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2801856Energy Optimal Task Scheduling with Normally-Off Local Memory and Sleep-Aware Shared Memory with Access Conflict
https://www.computer.org/csdl/trans/tc/2018/08/08290729-abs.html
The rapid development of the Real-Time and Embedded System (RTES) has increased the requirement on the processing capabilities of sensors, mobiles and smart devices, etc. Meanwhile, energy efficiency techniques are in desperate need as most devices in RTES are battery powered. Following the above trends, this work explores the memory system energy efficiency for a general multi-core architecture. This architecture integrates a local memory in each processing core, with a large off-chip memory shared among multiple cores. Decisions need to be made on whether tasks will be executed with the shared memory or the local memory to minimize the total energy consumption within real-time constraints. This paper proposes optimal schemes as well as a polynomial-time approximation algorithm with constant ratio. The problem complexity analysis for different task and system models is also presented. Experimental results show that the proposed approximation scheme performs close to the optimal solution in average.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2805337adBoost: Thermal Aware Performance Boosting Through Dark Silicon Patterning
https://www.computer.org/csdl/trans/tc/2018/08/08292829-abs.html
Increasing power densities of many-core systems leaves a fraction of on-chip resources inactive, referred to as dark silicon. Efficient management of critical interlinked parameters - power, performance and temperature can improve resource utilization and mitigate dark silicon. In this paper, we present a run-time resource management system for thermal aware performance boosting using a dark silicon aware run-time application mapping strategy. The mapping policy patterns inactive cores among active cores for relatively lower and even distribution of operating temperatures. This provides enough thermal headroom for boosting the frequency of active cores upon performance surges and allows sustained boosting periods, improving the performance further. We design a controller for thermal aware performance boosting that decides on efficient allocation utilization of power budget and thermal headroom obtained from patterning. Our strategy yields up to 37 percent better throughput, 29 percent lower waiting time and up to 2 <inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="kanduri-ieq1-2805683.gif"/></alternatives></inline-formula> longer boosting periods, in comparison with other state-of-the-art run-time mapping policies.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2805683ARMOR: A Recompilation and Instrumentation-Free Monitoring Architecture for Detecting Memory Exploits
https://www.computer.org/csdl/trans/tc/2018/08/08295231-abs.html
Software written in programming languages that permit manual memory management, such as C and C++, are often littered with exploitable memory errors. These memory bugs enable attackers to leak sensitive information, hijack program control flow, or otherwise compromise the system and are a critical concern for computer security. Many runtime monitoring and protection approaches have been proposed to detect memory errors in C and C++ applications, however, they require source code recompilation or binary instrumentation, creating compatibility challenges for applications using proprietary or closed source code, libraries, or plug-ins. This paper introduces a new approach for detecting heap memory errors that does not require applications to be recompiled or instrumented. We show how to leverage the calling convention of a processor to track all dynamic memory allocations made by an application during runtime. We also present a transparent tracking and caching architecture to efficiently verify program heap memory accesses. Performance simulations of our architecture using SPEC benchmarks and real-world application workloads show our architecture achieves hit rates over 95 percent for a 256-entry cache, resulting in only 2.9 percent runtime overhead. Security analysis using a software prototype shows our architecture detects 98 percent of heap memory errors from selected test cases in the Juliet Test Suite and real-world exploits.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2807818Towards Formal Evaluation and Verification of Probabilistic Design
https://www.computer.org/csdl/trans/tc/2018/08/08302961-abs.html
In the nanometer regime of integrated circuit fabrication, device variability imposes serious challenges to the design and manufacturing of reliable systems. A new computation paradigm of approximate and probabilistic design has been proposed recently to accept design imperfection as a resource for certain applications. Despite recent intensive study on approximate design, probabilistic design receives relatively few attentions. This paper provides a general formulation for the evaluation and verification of probabilistic design. We establish their connection to stochastic Boolean satisfiability (SSAT), (weighted) model counting, and probabilistic model checking. Moreover, a novel SSAT solver based on binary decision diagram (BDD) is proposed, and a comparative experimental study is performed to contrast the strengths and weaknesses of different solutions. The proposed BDD-based SSAT solver obtains the best scalability among all techniques in our experiments. We also compare the BDD-based SSAT solver to a prior method based on Bayesian network modeling. Experimental results show that our method outperforms the prior method by orders of magnitude in both runtime and memory usage. Our work can be an essential step towards automated synthesis of probabilistic design.07/10/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2807431Metastability-Containing Circuits
https://www.computer.org/csdl/trans/tc/2018/08/08314764-abs.html
In digital circuits, <italic>metastability</italic> can cause deteriorated signals that neither are logical 0 nor logical 1, breaking the abstraction of Boolean logic. Synchronizers, the only traditional countermeasure, exponentially decrease the odds of maintained metastability over time. We propose a fundamentally different approach: It is possible to deterministically <italic>contain</italic> metastability by fine-grained logical masking so that it cannot infect the entire circuit. At the heart of our approach lies a time- and value-discrete model for metastability in synchronous clocked digital circuits, in which metastability is propagated in a worst-case fashion. The proposed model permits positive results and passes the test of reproducing Marino's impossibility results. We fully classify which functions can be computed by circuits with standard registers. Regarding masking registers, we show that more functions become computable with each clock cycle, and that masking registers permit exponentially smaller circuits for some tasks. Demonstrating the applicability of our approach, we present the first fault-tolerant distributed clock synchronization algorithm that deterministically guarantees correct behavior in the presence of metastability. As a consequence, clock domains can be synchronized without using synchronizers, enabling metastability-free communication between them.07/10/2018 1:00 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2018.2808185Memory and Communication Profiling for Accelerator-Based Platforms
https://www.computer.org/csdl/trans/tc/2018/07/08234629-abs.html
The growing demand of processing power is being satisfied mainly by an increase in the number of homogeneous and heterogeneous computing cores in a system. Efficient utilization of these architectures demands analysis of memory-access behaviour of applications and perform data-communication aware mapping of applications on these architectures. Appropriate tools are required to highlight memory-access patterns and provide detailed intra- application data-communication information to assist developers in porting existing sequential applications efficiently to these architectures. In this work, we present the design of an open-source tool which provides such a detailed profile for C/C++ applications. In contrast to prior work, our tool not only reports detailed information, but also generates this information with manageable overheads for realistic workloads. Comparison with the state- of-the-art shows that the proposed profiler has, on the average, an order of magnitude less overhead as compared to the state-of-the-art data-communication profilers for a wide range of benchmarks. The experimental results show that our proposed tool generated profiling information for image processing applications which assisted in achieving a speed-up of <inline-formula><tex-math notation="LaTeX">$6.14\times$</tex-math><alternatives> <inline-graphic xlink:href="ashraf-ieq1-2785225.gif"/></alternatives></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$2.75\times$</tex-math><alternatives><inline-graphic xlink:href="ashraf-ieq2-2785225.gif"/> </alternatives></inline-formula> for heterogeneous multi-core platforms containing an FPGA and a GPU as accelerators, respectively.06/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2785225A Compositional Approach for Verifying Protocols Running on On-Chip Networks
https://www.computer.org/csdl/trans/tc/2018/07/08239635-abs.html
In modern many-core architectures, advanced on-chip networks provide the means of communication for the cores. This greatly complicates the design and verification of the cache coherence protocols deployed by those cores. A common approach to deal with this complexity is to decompose the whole system into the protocol and the network. This decomposition is, however, not always possible. For example, unexpected deadlocks can emerge when a deadlock-free protocol and a deadlock-free network are combined. This paper proposes a compositional methodology: prove properties over a network, prove properties over a protocol, and infer properties over the system as a whole. Our methodology is based on theorems that show that such decomposition is possible by having sufficiently large local buffers at the cores. We apply this methodology to verify several protocols such as MI, MSI, MESI and MEUSI running on top of advanced interconnects with adaptive routing.06/07/2018 2:04 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2786723Scheduling Weakly Consistent C Concurrency for Reconfigurable Hardware
https://www.computer.org/csdl/trans/tc/2018/07/08241825-abs.html
Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations (‘atomics’), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This article explores how these algorithms can be compiled from C to reconfigurable hardware via <italic>high-level synthesis</italic>(HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional <italic>intra-thread</italic> constraints among the memory operations. In addition, we show that we can support the pipelining of loops containing atomics by injecting further inter-iteration constraints. We implement our approach on two constraint-based scheduling HLS tools: LegUp 4.0 and LegUp 5.1. We extend both tools to support two memory models that are capable of synthesising atomics correctly. The first memory model only supports <italic>sequentially consistent</italic> (SC) atomics and the second supports <italic>weakly consistent</italic> (‘weak’) atomics as defined by the 2011 revision of the C standard. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many multi-threaded algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics in accordance with the C standard. A case study on a circular buffer suggests that on average circuits synthesised from programs that schedule atomics correctly can be 6x faster than an existing lock-based implementation of atomics, that weak atomics can yield a further 1.3x speedup, and that pipelining can yield a further 1.3x speedup.06/07/2018 2:03 pm PSThttp://doi.ieeecomputersociety.org/10.1109/TC.2017.2786249288