# Robust Sub-Powered Asynchronous Logic

Jiaoyan Chen<sup>#1</sup>, Arnaud Tisserand<sup>#2</sup>, Emanuel Popovici<sup>#3</sup>, Sorin Cotofana<sup>#1</sup> Department of Computer Engineering, TU Delft, Delft, the Netherlands<sup>#1</sup> CNRS, IRISA, INRIA, Univ. Rennes 1, Lannion, France<sup>#2</sup> Department of Electrical and Electronic Engineering, University College Cork, Cork, Ireland<sup>#3</sup>

Abstract— While MOSFET technology scaling provides substantial advantages in terms of Integrated Circuits (ICs) speed and energy consumption those are coming at the expense of a higher sensitivity to process, voltage, and temperature (PVT) variations. To alleviate this lack of robustness, which became a critical issue in advanced deep sub-micron technologies, many mechanisms have been proposed at all abstraction levels from device and circuit up to architecture and application software. Among those, a natural solution is to rely on asynchronous logic design style as by its nature is less sensitive to delay variations, which are the "de facto" PVT variations consequence. Several asynchronous logic families have been introduced as follows: (i) Single-rail energy effective logic but still time-sensitive as it relies on delay elements and (ii) Dual-rail robust but more power hungry logic. In this paper we introduce a robust asynchronous logic family which does not rely on timing assumptions and/or delay elements and can operate with sub-powered devices. The key element behind our proposal is a simplified completion detection mechanism which makes it substantially more energy effective when compared with other dual-rail approaches. A 32-bit Ripple Carry Adder (RCA) is implemented in 65nm and 45nm CMOS process to evaluate the practicability of our approach. Firstly, the Optimal Energy Point (OEP) of the proposed RCA is investigated by scaling VDD from 0.4V to 0.2V (50mV interval), where the OEP occurs at 0.25V for both technologies. Secondly, while comparing the energy consumption with the corresponding single-rail benchmark at its OEP in 65nm process, 30% (34 fJ for 65nm) and 40% (54fJ for 45nm after scaling) energy savings are achieved respectively. More impressive (10x better) energy efficiency and reasonable performance are obtained over dual-rail counterparts. At last, process variations concerned Monte Carlo simulation is executed to demonstrate the robustness of our methodology as well to explore the response of OEP, which remains unchanged at 0.25V.

## Keywords—asynchronous logic; low power; robustness; near/sub-threshold; process variation

## I. INTRODUCTION

High demand for low power electronic products pushes the development of power-saving technologies and techniques to new boundaries. Among them, voltage scaling techniques have been one of the most effective and straightforward methods to reduce the power/energy consumption in digital Integrated Circuits (ICs) [1]. In the voltage scaling family of techniques, sub-threshold logic which lowers the supply voltage to near or below the threshold voltage ( $V_{th}$ ) of MOSFET has been proven to achieve significant power/energy efficiency of at least one order of magnitude reduction in ICs [2,3]. Several works based on this type of technique have been proposed during last few years [4-7].

However, the penalty in performance is also significant. This guides also our work, namely to find the best trade-off between power savings and performance.

Meanwhile, as the MOS transistor sizes reach tens of nanometers, Process, Voltage, and Temperature (PVT), etc, complicate the timing analysis/validation in synchronous designs. Moreover, the clock tree has already been considered as one of the major energy optimization bottleneck in synchronous circuits (SYNC). With respect to the abovementioned concerns, asynchronous logic (ASYNC) provides an interesting alternative solution. The nature of self-timing and clock-less principle makes ASYNC more tolerant to PVT variations and potentially lower power/energy consumption than its synchronous counterparts. These features are exploited to aid the design of large SOC using mixed logic (Globally Asynchronous Locally Synchronous or GALS) [8].

The research in ASYNC designs operating at Ultra-Low Voltage (ULV) supply has drawn attention over last few years. ASYNC is believed to have more advantages with voltage scaling. The drain current of MOSFETs is more dominated by diffusion current, which is exponentially sensitive to PVT variations while the fixed rate of a global clock signal can become less practicable.

In ASYNC, there are mainly two protocols, namely bundle-data (single-rail) and dual-rail each of them having its own advantages and disadvantages [8]. For bundle-data, static CMOS gates can be used, thus it is easy to implement using Hardware Description Languages (HDLs) and/or CAD tools. Also, due to the single-rail property, the bundle-data protocol has higher power efficiency. On the other hand, the dual-rail family does not require delay elements that are necessary in bundle-data, which makes the dual-rail protocol more tolerant to PVT variations and hence results in more reliable circuits.

In [9], an ASYNC bundled-data pipelines called MOUSETRAP using local "clock" generators, i.e., replica delay, placed closely to the logic blocks, is proposed. The close placement topology can provide a tracking ability between the delay replica and logic circuits. In other words, both the delay elements and the logic circuits suffer similar variations (such as temperature). Therefore MOUSETRAP is robust to systematic variations. However, when it comes to random PVT variations, the tracking ability of the bundle-data pipelines becomes vulnerable especially at ULV [10]. An improved version is called soft MOUSETRAP with a wider capturing window in latches, which allows latches to capture

incoming data that are accidently delayed by random PVT variations [5].

From our point of view, to further enhance the immunity against PVT variations, latch-less dual-rail pipelines have potential than bundled-data implementations. larger Unfortunately, conventional dual-rail logic families, such as Domino Differential Cascode Voltage Switch Logic (DDCVSL) [11], Pre-Charged Half-Buffer (PCHB) [12], inevitably consume more power than static CMOS based bundled-data designs. Because of this, many ASYNC designers favor the bundled-data protocol even though it relies on timing/delay assumption. Recently, an ASYNC Quasi-Delay-Insensitive (QDI) Static Logic Transistor-Level Implementation (SLTI) approach was proposed in [7] which is utilizing dual-rail static logic to avoid delay assumption. Additionally, it also introduced a simplified completion detection circuit, which resulted in lower power dissipation and smaller area than the PCHB counterparts.

In this paper, we propose an asynchronous logic named Robust Sub-Powered Asynchronous Logic (RSPAL), which can be implemented in latch-less dual-rail pipelines and is energy efficient especially in sub-powered circuits. RSPAL is developed on the top of Asynchronous Charge Sharing Logic (ACSL) [13] with some modifications strengthen the reliability of circuits at ULV by eliminating the usage of charge sharing and latches in ACSL. Although the voltage scaling of ACSL is explored in [13], it lacks efficiency in subpower domain. RSPAL inherits intuitive completion detection protocol from ACSL which can spare a great overhead in ASYNC designs. Also, the power gating property is maintained in RSPAL. As there is no latch involved in RSPAL circuits, timing validation (i.e., hold time in MOUSETRAP [5,9]) is not necessary, which not only saves effort in designing replica delay elements but also increases the robustness against random PVT variations. The RSPAL pipelines can then run at full speed. In terms of power consumption, although RSPAL is literally a dual-rail dynamic logic, unlike the conventional dynamic logic families, does not need pre-charging and is naturally power gated resulting in low leakage power consumption as well.

This paper is organized as follows. In Section II, the soft MOUSETRAP pipelines (bundled-data) are introduced; then, the approaches of PCHB and SLTI (dual-rail) are both presented. Afterwards, how RSPAL evolves from ACSL to meet rigorous requirements in near/sub-threshold region is explored in Section III. Next, simulation results, including Monte Carlo (MC) simulation, are detailed and discussed in Section IV where more 40% energy reduction and reasonable performance is achieved. Finally, the paper is concluded in Section V.

## II. RELATED WORK: BUNDLED-DATA & DUAL-RAIL

In this section, two basic families in ASYNC are introduced along with several recent works and their advantages and disadvantages.



Fig. 1. Structure of soft MOUSETRAP pipeline [5]

## A. Bundled-data: Soft MOUSETRAP

The conditions for the soft MOUSETRAP [5] to successfully operate for voltages as low as 0.3V even with random PVT variation are a wider sampling window in latches and protocol modification. Fig.1 depicts the general structure of soft MOUSETRAP. When compared with the conventional MOUSETRAP, an extra delay, called  $t_{soft}$ , in the protocol paths from the node, *done<sub>N</sub>*, to the node *en<sub>N</sub>*, is introduced. This additional delay allows more time slack for the latches to open and sample the data. To determine the value of  $t_{soft}$ , intensive MC simulations have to be done; and careful calculation is also needed to meet the hold time constraints. Moreover, an alternative protocol is also introduced to avoid energy-consuming short path padding brought with large  $t_{soft}$ .

In [5], 18.5% less delay and 14.5% lower energy consumption of a 32-bit Ripple Carry Adder (RCA) was reported in contrast to the conventional MOUSETRAP over supply voltages ranging from 0.3V to 0.5V. No further lower voltage values were applied in their simulation. Also no optimal Power-Delay-Product (PDP)/Energy point was investigated.

## B. Dual-rail: PCHB and SLTI

The dual-rail based asynchronous handshaking protocol provides delay-insensitivity to asynchronous circuits because it takes advantage of the change of data signals (pre-charging and evaluation events) to communicate between each stage [11,12]. No delay assumption is essential in this solution. The penalty is high power consumption due to higher switching activity in dual-rail logic than single-rail one [11]. Additionally, complex completion detectors (CDs) based on the complementary signals provide high robustness in dualrail ASYNC at the expense of area and power dissipation overhead.

Several works dedicate to simplify the principle of those detectors. In [14], a power-efficient integrated input/output completion detection circuit for Pre-Charged Half-Buffer (PCHB) has been proposed with 35% PDP improvement against the conventional PCHB in the model of a 4x4 multiplier. Some further simplification on the CD of dual-rail ASYNC circuits has been introduced namely Static Logic Transistor-Level Implementation (SLTI) [7]. Unlike PCHB, SLTI only requires output completion detectors and utilizes static logic rather than dynamic logic, which enhance the feasibility of voltage scaling.



Fig. 2. Structures of (a) PCHB pipeline (b) SLTI pipeline [7]



Fig. 3. Signal transition diagram for (a) four-phase handshaking protocol (b) ACSL handshaking protocol [13]

The pipeline structures of these two approaches are illustrated in Fig.2, where *ICD* represents input completion detector and *OCD* indicates output completion detector. SLTI achieved around 51% power reduction at  $V_{DD}$ =0.2V when compared to PCHB counterparts for 32-bit arithmetic units, but no PVT variations are considered in those simulations.

#### **III. ROBUST SUB-POWERED ASYNCHRONOUS LOGIC**

In this section, thoughts behind the development of RSPAL based on ACSL are explained. Plus, the structure and operation of RSPAL are also detailed.

#### A. From ACSL to RSPAL

To improve the power efficiency of the dual-rail logic, ACSL was proposed in [13] which implements charge sharing technology into Positive Feedback Adiabatic Logic (PFAL). Logic blocks are built based on PFAL with some specially designed communication protocol shown in Fig.3 (b) to enable the charge sharing process and ensure the correct evaluation. Fig.3 (a) is the standard four-phase handshaking protocol. Fig.4 presents simulation waveforms of ACSL, particularly charge sharing of its Voltage Power Clock (VPC). In terms of PDP, an overall 28% reduction over DDCVSL is achieved thanks to the charge sharing technology. Also, more than 30% leakage power saving is gained against DDCVSL [13] given to the inherent power gating property.

ACSL performs well for normal operation at nominal V<sub>DD</sub>. However, when voltage supply,  $V_{\text{DD}},$  falls to near or below the threshold voltage, V<sub>GS</sub><V<sub>th</sub>, transistors thereby are in weakinversion mode where the evaluations rely on sub-threshold leakage current. The performance of MOSFETs degrades significantly while the power consumption is saved considerably. The trade-off between delay and power depending on V<sub>DD</sub> can also result in optimal PDP/Energy dissipation point. Unfortunately, such small leakage current limits the charge sharing efficiency and also makes the completion detection of sharing operation of ACSL less accurate. Therefore, we decided to take out the charge sharing operation and modify the handshaking protocol in RSPAL in order to focus on the robustness against PVT variations. Even without sharing, RSPAL still consumes less power/energy when compared with other approaches introduced in the previous section. The simulation results will be discussed in Section IV. Furthermore, unlike ACSL, latches are not necessary in RSPAL between stages also due to the absence of charge sharing.

## B. RSPAL Structure

Same as in ACSL, evaluation blocks in RSPAL are also based on PFAL as is demonstrated to be the most power efficient logic in adiabatic family [15]. Even in non-adiabatic mode, PFAL is also an promising logic for its high power efficiency and well-suited for ASYNC [13]. It should be noted that RSPAL does not belong to adiabatic logic which requires specially-designed AC power supply. The generic schematic of PFAL function block is depicted in Fig.5 (a),(b) and (c) are PFAL based one-bit full adder, sum and carry calculation respectively. All these gates can be dimensioned as the minimal size. Instead of using relatively complicated VPC generator, which requires three control signals [13] to evaluate the PFAL blocks in ACSL, only a simple buffer is sufficient in RSPAL, which is controlled by the C-elements [7] of each pipeline stage. C-elements are the basic circuits taking control of communication in ASYNC based on request signal (REQ) and acknowledge signal (ACK) from neighbouring stages. The structure is symmetric and both the N-MOS tree and the crosscoupled inverters are powered by VPC, also the outputs all follow VPC. When the inputs are available, VPC starts to evaluate the circuits by charging up to  $V_{DD}$ . Meanwhile, the differential outputs are set at either high or low depending on the function of the n-tree. The generic structure of RSPAL is presented in Fig.6. It is worth mentioning that all the logic signals of PFAL blocks can be directly connected with each other (no latch needed) due to the cross-coupled inverters already being embedded in PFAL blocks. It is not only power efficient but also improving the reliability at ULV. VPCs are not only used to power the function blocks but also to indicate the completion detection, which only takes a simple OR gate. When compared with other completion detection in ASYNC circuits [7, 14, 16], our solution is obviously more efficient with respect to delay, power, and area. This advantage can become substantial when the size of the circuits (the number of signals) increases. Compared with ACSL, there are fewer control signals (less complexity) and gates involved (smaller area) in RSPAL. For each propagated signal, a latch can be spared, which could also benefit the leakage power consumption.



Fig. 4. Simulation waveforms of VPCs and Control signals in ACSL







# C. RSPAL Operation

Although PFAL based RSPAL belongs to the dual-rail dynamic logic family, it does not require pre-charging before evaluation like that in PCHB or DDCVSL. In [13], it has been proven PFAL higher power efficiency than DDCVSL in ASYNC. The function blocks only get powered when their corresponding stage is activated and get discharged in the standby mode. This power gating nature without any overhead is another advantage of RSPAL, which can bring the leakage power down to an extremely low level. This property has already been demonstrated in ACSL [13]. Dissimilar to ACSL, the conventional four-phase handshaking protocol is applied in RSPAL, as shown in Fig.3 (a). The waveforms of signals, namely VPCs and CTRLs in RSPAL (shown in Fig.6), are displayed in Fig.7 at V<sub>DD</sub>=0.3V. It can be seen that VPCs always follow the changes of CTRLs. Once the VPCs turn to ground, the PFAL blocks are in standby mode with ultra-low leakage power dissipation. No timing assumption is required thanks to the structure of PFAL function blocks and the dedicated VPC generation scheme. PFAL function block is only powered once the previous stage has been evaluated, which means the input data to the current stage is ready to use and thereby metastability is avoided. All the ACK signals need to be reset to zero in order to trigger the pipeline initially.



Fig. 6. Structure of RSPAL pipeline



Fig. 7. Simulation waveforms of RSPAL (a) VPCs (b) CTRLs

# IV. SIMULATION RESULTS AND COMPARISONS

The delay and power consumption of the new technique is investigated in this section. We carried out HSPICE simulations, use Monte Carlo techniques, to make a comparison with other approaches introduced in Section II through a 32-bit Ripple Carry Adder. We use 45nm and 65nm CMOS process in our simulations where the threshold voltages are  $V_{thn}$ =0.322V for 45nm nFETs,  $V_{thn}$ =0.294V for 65nm nFETs, and  $V_{thp}$ =-0.302V for 45nm pFETs,  $V_{thp}$ =-0.229V for 65nm pFETs respectively. We choose the voltage supply sweeping from 0.4V down to 0.2V with 50mV interval. The temperature of all our simulations, the Gaussian Distribution (GD) based process variations with 10% deviation from the mean value in terms of effective length of MOSFETs (L), threshold voltages (V<sub>th</sub>), and thickness of oxide (T<sub>ox</sub>), are applied to test RSPAL stability/robustness and secondly check its performance under variations.

# A. Delay, Power of 32-bit RCA at ULV

A 45nm 32-bit RSPAL pipelined RCA has been built (32 stages) and powered by different voltage values, 0.4V, 0.35V, 0.3V, 0.25V, and 0.2V. Both delay and power data are collected in order to find the minimal energy point. All those data points are depicted in Fig.8 and summarized in Table I. As expected, delay increases and power decreases following the drop of supply voltage. The minimum energy point occurs at  $V_{DD}=0.25V$ , which is 49.4fJ. Compared with the energy consumption at 0.3V, it has 5.6% reduction. Meanwhile, the RCA consumes 11.2% more energy when  $V_{DD}=0.2V$  than the minimum point. It is worth mentioning that the minimum energy point for 65nm RSPAL RCA is also at V<sub>DD</sub>=0.2V where the corresponding energy consumption is 91fJ. It is interesting to see whether V<sub>DD</sub>=0.25V can still maintain the superiority in MC simulations, which being investigated in next subsection.

In [5] a 32-bit RCA based soft MOUSETRAP was pipelined in 4 stages (8-bit per stage) built on 65nm CMOS technology. The voltage scaling is from 0.5V to 0.3V where the lowest energy consumption is at  $V_{DD}=0.3V$ . As for the PCHB [14] and SLTI [7] counterparts, they are also built on 65nm CMOS technology in the model of 32-bit Kogge-Stone adder, which is known for its high speed. Both of them can operate at  $V_{DD}$ =0.2V. Table II outlines the simulation results of the proposed RSPAL RCA and all other three designs at their lowest energy point. For fair comparison, 65nm RSPAL RCA was also simulated. According to the data, bundled-data pipeline (soft MOUSETRAP) has superiority over dual-rail approaches, PCHB and SLTI, in terms of power/energy consumption. However, the latter two can operate at  $V_{DD}=0.2V$ , which has not been reported in soft MOUSETRAP. As for our RSPAL at both technologies, the best power/energy dissipation is achieved, more than 60% (43% after scaling) energy saving against soft MOUSETRAP at 45nm while it is about 30% reduction at 65nm. Even at  $V_{DD}=0.3V$ , our adder still has more than 50% (40% after scaling) reduction in energy with almost same operating speed at 45nm. Compared with two other dual-rail pipelines, RSPAL



Fig. 8. Simulation data of 32-bit RSPAL RCA (a) delay (b) power consumption (c) energy dissipation

TABLE I. DELAY, POWER, ENERGY OF 32-BIT RSPAL RCA

| Supply Voltage (V) | Delay(nS) | Power(uW) | Energy(fJ) | Energy<br>Difference |
|--------------------|-----------|-----------|------------|----------------------|
| 0.2                | 108.8     | 0.5       | 55.2       | +11.2%               |
| 0.25               | 42.2      | 1.2       | 49.4       | /                    |
| 0.3                | 18.2      | 2.9       | 52.2       | +5.6%                |
| 0.35               | 9.3       | 6.5       | 60.2       | +21.9%               |
| 0.4                | 5.6       | 12.5      | 70.1       | +41.9%               |

TABLE II. DELAY, POWER, ENERGY COMPARISON AT ULV

| 32-bit<br>Adder                   | Soft<br>MOUSE.<br>[5]           | PCHB<br>[14]                  | SLTI [7] RSPAL                            |                           | RSPAL                         |
|-----------------------------------|---------------------------------|-------------------------------|-------------------------------------------|---------------------------|-------------------------------|
| Process<br>Technology             | 65nm                            | 65nm                          | 65nm                                      | 65nm                      | 45nm                          |
| Logic<br>Realization              | Single-<br>rail Static<br>Logic | Dual-rail<br>Dynamic<br>Logic | Dual-railDual-railStaticDynamicLogicLogic |                           | Dual-rail<br>Dynamic<br>Logic |
| Timing<br>Assumption              | Yes                             | No                            | No                                        | No                        | No                            |
| Operating<br>Speed                | 58MHz<br>@ 0.3V                 | 7.3MHz<br>@ 0.2V              | 10MHz<br>@0.2V                            | 12MHz<br>@0.25V           | 24MHz<br>@0.25V               |
| Power<br>Consumption              | 7.2uW<br>@0.3V                  | 16.2uW<br>@0.2V               | 8uW<br>@0.2V                              | 8uW 1.1uW<br>@0.2V @0.25V |                               |
| Optimum<br>Energy Point           | 125fJ                           | 2219fJ                        | 800fJ                                     | 91fJ                      | 49fJ                          |
| Scaled<br>Optimum<br>Energy Point | 125fJ                           | 2219fJ                        | 800fJ                                     | 91fJ                      | 71fJ                          |

has much better low power/energy property. Although both PCHB and SLTI adders are using the Kogge-Stone high speed architecture, the operating speed is nearly the same as that of RSPAL one according to Table I and II at 45nm. Even at 65nm, our RCA can operate at 5.5MHz while consuming only 0.65uW, which means the energy consumption is 118fJ (not listed in the table). The reason to choose the Kogge-Stone adders as benchmarks is because we want to show that our RSPAL RCA can run almost as fast as the Kogge-Stone ones despite of more stages while consuming much less power/energy. Normally, the Kogge-Stone adder is much faster than the RCA with the same input-width [17,18].

#### B. Monte Carlo Simulation on 32-bit RSPAL RCA

As mentioned earlier in this paper, variations become more significant for the circuits working at ULV. It can affect all-around behavior and performance of the circuits. After exploring the voltage sweep within near/sub-threshold region, process variations (L, V<sub>th</sub>, T<sub>ox</sub>) are applied at 45nm in this subsection. In [5], MC simulation was carried out to determine the delay value of  $t_{\text{soft}}$  and check the correctness of their adder. But there is no delay and power data of its adder under MC simulations reported. As for the PCHB and SLTI designs, no MC simulation was done. After all, we believe it is worthwhile not only doing MC simulations on circuits but also analyzing the corresponding delay and power data even though MC simulation is time-taking. MC simulations on 32bit RSPAL RCA are executed 100 times with 10% deviation from the mean value over three key parameters mentioned above based on Gaussian distribution. No error is produced across one hundred MC iterations. Table III lists the mean values and their corresponding standard deviations of delay and power data respectively along with the energy consumption as the product of Mean Delay and Mean Power .

TABLE III. DELAY, POWER, ENERGY OF 32-BIT RSPAL RCA UNDER PROCESS VARIATIONS

| Supply<br>Voltage<br>(V) | Delay(nS) |                       | Power(uW) |                       | Energy (fJ)                |  |
|--------------------------|-----------|-----------------------|-----------|-----------------------|----------------------------|--|
|                          | Mean      | Standard<br>Deviation | Mean      | Standard<br>Deviation | Mean Delay *<br>Mean Power |  |
| 0.2                      | 123.6     | 71                    | 1.2       | 0.8                   | 148.3                      |  |
| 0.25                     | 47.8      | 26.9                  | 2.9       | 1.7                   | 138.6                      |  |
| 0.3                      | 16.4      | 8.9                   | 13.3      | 3.5                   | 218.1                      |  |
| 0.35                     | 9.9       | 4.8                   | 28.4      | 8.3                   | 281.1                      |  |
| 0.4                      | 5.9       | 2.6                   | 53.2      | 16.7                  | 313.9                      |  |





0.15 0.20 0.25 0.30 0.35 0.40 0.45 Supply Voltage (V) (c)

Fig. 9. Simulation data of 32-bit RSPAL RCA under process variations (a) delay (b) power consumption (c) energy dissipation

It can be observed that the deviations of delay are around 50% of their corresponding mean values. Comparing the mean value of delay in Table III and the delay data in Table I, when  $V_{DD}=0.2V$  and 0.25V, the differences between those two sets are 11%, others are all below 8%. Hence it can be concluded that lowering supply voltage leads to bigger delay variation. When the same comparison is applied on the power data in Table I and Table III, the differences at  $V_{DD}$ =0.2V and 0.25V are the smallest, which is about 60%. Power consumption changes significantly under process variation when  $V_{DD} \ge$ 0.3V. Based on the data in Table III, the optimum energy point is still 0.25V, which is 138.6fJ. The comparison between the data (delay, power and energy) with and without process variations are illustrated in Fig.9. The significant deviations in power consumption especially when  $V_{\text{DD}}$  >0.25V result in considerable (4 times) increases in total energy consumption.

## V. CONCLUSION

In this work, we firstly review the suitability for power reduction for two types of approaches in asynchronous circuits, namely bundled-data and dual-rail. Afterwards, a low power/energy asynchronous logic called RSPAL evolving from ACSL is proposed. Latch-less property and straightforward completion detection scheme benefit energy efficiency and performance. Optimum energy point is found at V<sub>DD</sub>=0.25V in near/sub-threshold region using 45nm and 65nm technology. Compared with other counterparts, our 45nm 32-bit RSPAL RCA saves more than 60% (43% after scaling) energy operating at 24MHz while 30% energy reduction is obtained for the 65nm RCA. Monte Carlo simulation on process variations is also carried out. The resulting delay and power data is discussed and contrasted with the original data without variations. It is also shown that the proposed technique is robust against process variations at ULV without any error occurrence. The optimal energy point is still around 0.25V taking the deviations into account.

## ACKNOWLEDGMENT

This work has been sponsored by the European Commission FP7 FET-Open iRISC (Innovative Reliable Chip Designs from Unreliable Components) project and SpiNaCH (CNRS PICS 6023) project.

#### REFERENCES

- Gonzalez, Ricardo, Benjamin M. Gordon, and Mark A. Horowitz. "Supply and threshold voltage scaling for low power CMOS." Solid-State Circuits, IEEE Journal of 32.8 (1997): 1210-1216.
- Zhai, Bo, et al. "Theoretical and practical limits of dynamic voltage scaling."Proceedings of the 41st annual Design Automation Conference. ACM (2004): 868-873.
- Wang, Alice, Benton Highsmith Calhoun, and Anantha P. Chandrakasan. Sub-threshold design for ultra low-power systems. Springer, 2006.
- Soeleman, Hendrawan, Kaushik Roy, and Bipul Chandra Paul. "Robust subthreshold logic for ultra-low power operation." Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 9.1 (2001): 90-99.
- Liu, Jian, Steven M. Nowick, and Mingoo Seok. "Soft MOUSETRAP: A Bundled-Data Asynchronous Pipeline Scheme Tolerant to Random Variations at Ultra-Low Supply Voltages." Asynchronous Circuits and

Systems (ASYNC), 2013 IEEE 19th International Symposium on. IEEE, (2013):1-7.

- Jorgenson, R. D., Sorensen, L., Leet, D., Hagedorn, M. S., Lamb, D. R., Friddell, T. H., & Snapp, W. P. "Ultralow-power operation in subthreshold regimes applying clockless logic." Proceedings of the IEEE,98.2(2010), 299-314.
- Ho, Weng-Geng, et al. "Low power sub-threshold asynchronous QDI Static Logic Transistor-level Implementation (SLTI) 32-bit ALU." Circuits and Systems (ISCAS), 2013 IEEE International Symposium on. IEEE (2013):353-356.
- Nowick, Steven M., and Montek Singh. "High-performance asynchronous pipelines: an overview." Design & Test of Computers, IEEE 28.5 (2011): 8-22.
- Singh, Montek, and Steven M. Nowick. "Mousetrap: High-speed transition-signaling asynchronous pipelines." Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 15.6 (2007): 684-698.
- Zhai, Bo, et al. "Analysis and mitigation of variability in subthreshold design."Proceedings of the 2005 international symposium on Low power electronics and design. ACM (2005): 20-25.
- 11. Weste, Neil, and David Harris. "CMOS VLSI Design." A Circuits and Systems perspective, Pearson Addison Wesley (2005).
- Ozdag, Recep O., and Peter A. Beerel. "High-Speed QDI Asynchronous Pipelines." Asynchronous Circuits and Systems (ASYNC) (2002): 13-13.
- Chen, J., Vasudevan, D., Schellekens, M., Popovici, E. "Ultra Low Power Asynchronous Charge Sharing Logic." Journal of Low Power Electronics, 8.4(2012): 526-534.
- Ho, W. G., Chong, K. S., Gwee, B. H., Chang, J. S., & Yee, M. F. (2011, December). A power-efficient integrated input/output completion detection circuit for asynchronous-logic quasi-delay-insensitive Pre-Charged Half-Buffer. In Integrated Circuits (ISIC), IEEE 13th International Symposium on (2011): 376-379.
- Amirante, E., Bargagli-Stoffi, A., Fischer, J., Iannaccone, G., & Schmitt-Landsiedel, D. "Variations of the power dissipation in adiabatic logic gates." In Proceedings of the 11th International Workshop on Power And Timing Modeling, Optimization and Simulation, PATMOS (2001) (Vol. 1, pp. 9-1).
- Lampinen, Harri, and Olli Vainio. "Dynamically biased current sensor for current-sensing completion detection." Electronics Letters 37.7 (2001): 408-409.
- 17. A. Tisserand. Low-Power Arithmetic Operators. Chapter 9 of Low Power Electronics Design. C. Piguet editor. CRC Press, 2004.
- R. Zimmermann. Binary Adder Architectures for Cell-Based VLSI and their Synthesis. PhD Thesis Swiss Federal Institute of Technology Zurich, 1998.