Abstract—Greedy arbiters and bundling merges compose simultaneous events in sequence and parallel respectively. Previous designs for these problems handle two to three inputs, and can be composed in a tree topology to handle more. In addition, they include subtle timing assumptions beyond the QDI delay model and undocumented timing assumptions in their arbiter's digital model. In this paper, we discuss two slightly different digital models that we call the ideal arbiter and buffered arbiter models, and match them to CMOS implementations. From CHP specifications of the greedy arbiter and bundling merge, we derive the Maybe Execute Element. We then show how it may be systematically composed to produce improved circuits for both which use a small number of simple gates, strictly abide by the QDI delay model, and gracefully scale to an arbitrary number of inputs. Finally, we touch on how this approach may be used to develop scalable circuit solutions to more sophisticated arbitration problems.
Keywords—arbitration, asynchronous, quasi-delay insensitive, qdi, non-deterministic, greedy arbiter, bundling merge
In asynchronous logic, the implementation of non-deterministic choice requires the use of two-input two-output mutual exclusion elements. A mutual exclusion element has the following behavior: if one of its inputs is asserted, then the corresponding output is asserted. However, if both inputs are asserted, then it asserts only one of the outputs, picking between the two arbitrarily. Because any circuit that implements this deceptively simple behavior can exhibit metastability, careful analog analysis is required to ensure correct operation .
These circuits may be divided into two classes: Synchronizers and Arbiters. Synchronizers handle the general case in which the asserted inputs may be unstable meaning that they may be deasserted before its corresponding output has been asserted . Arbiters handle the more specific case in which the asserted inputs remain stable. While synchronizers require fewer environmental restrictions, they are also a more complex circuit. Furthermore,  observed that it is possible to implement unstable conditions in certain practical cases by exploiting the assumption that a CMOS arbiter is in fact a fair circuit when both of its inputs are asserted. Hence, non-deterministic choices in asynchronous circuits almost always use standard arbiters as their building blocks.
One common use for an arbiter involves implementing mutually exclusive access to a shared resource. Two processes may request access to a shared resource through a third arbitrating process commonly referred to as a merge element or mixer . These elements have further been used in a variety of mutual exclusion problems. For example,  used the ideas from  to communicate pulses within a neuromorphic system.
The requesting processes, the shared resource, and the merge element all communicate across channels. Each channel is a pair of request/acknowledge wires on which two communicating processes execute a four phase handshake protocol. If either requesting process asserts its request, the merge element must first request a lock ensuring the availability of the shared resource. When the shared resource acknowledges that request with a grant, the merge element may dispatch that grant by acknowledging one of the input requests. Once the requesting process is done and lowers its request, the merge element unlocks the resource and the entire cycle begins again. The first half of the handshake is used to request the resource, and the second half is used to release the resource.
However, there are other scenarios that require more complex constraints for access to a shared resource. For example, two simultaneous requests could be granted sequentially before unlocking the resource as implemented by a greedy arbiter. This is used in  to optimize tree-arbitrated exclusive access of many channels to a single bus, in  to arbitrate data received from pixels in an event-driven image sensor, and in  for asynchronous address event communication in spiking neural networks.
Alternatively, both requests could be granted concurrently as implemented by a bundling merge in . An example of this involves using a counter to track the number of in-flight data items in a variable latency pipeline . When a data value enters the pipeline, the counter is incremented; when a value leaves the pipeline, the counter is decremented. If the increment and decrement requests occur simultaneously, a bundling merge could combine them and skip both operations. In the context of spiking neural networks with lossy communication, a bundling merge could be used to merge multiple near-simultaneous spikes into a single spike, reducing communication cost at the expense of fidelity. If you are designing an event-driven image sensor, events from multiple pixels in a column could be resolved by a read of the whole column.
One can imagine more complex arbitration problems that come from the interaction between these two. For example, a one-writer two-readers lock could mediate access to a shared memory with a write-only port and two read-only ports. Or, the counter in  could have a clear signal to execute a squash across the pipeline. This would have to be mutually exclusive from the execution of the bundled increment and decrement requests. These scenarios require generalized arbitration expressions. While this could be implemented by composing the greedy arbiter and bundling merge in various ways, arbitration trees are an inefficient way to solve the problem for a small to medium number of inputs.
Finally, there are scenarios that cannot be implemented via composition. Circling back to the counter in , suppose there are two entry and two egress points to the pipeline and the throughput of increment and decrement requests is above and beyond what the counter can manage. Then, you want to bundle any two requests allowing you to either skip the least significant bit of the counter or skip the counter altogether thereby reducing the throughput requirement on the counter by a factor of two.
In this paper, we re-examine the previous implementations of circuits for the greedy arbiter  and bundling merge , analyze their timing assumptions, and propose alternative templates for more robust arbitration expressions. In Section 2, we discuss the digital model used to analyze the arbiter's behavior, introducing a distinction between the ideal and buffered arbiters and their CMOS implementations. Then, in Section 3, we describe how the application of this model affects previously designed arbitration circuits. Section 4 proposes a new building block, the maybe execute element, as the basis for a family of arbitration problems which, in Section 5, is used to construct the bundling merge and greedy arbiter. In Section 6, we evaluate our design performance and compare it against the previous designs. Finally, the Appendix gives an overview of the program and circuit notation used in this paper.
Under the delay insensitive (DI) delay model, a circuit should
operate correctly independent of gate and wire delays. Correct operation means
that the circuit remains stable, non-interfering, and deadlock-free. An
instability, or glitch, can cause data-loss or lead to interference;
interference, or a short, can cause permanent circuit damage; and deadlock
halts the computation prematurely. To achieve this goal, every transition must
acknowledge every input to its driving gate. A transition
b if there is a causal sequence of
b that prevents
from firing until after
a has completed.
In order for this model to be Turing Complete, the quasi-delay insensitive (QDI) delay model makes one exception to acknowledgement called the Isochronic Fork Assumption. If there is a wire fork to multiple gates, and one of those gates does not acknowledge all of the transitions on that wire, then we assume that the delay from the driver to the non-acknowledging gate is bounded. In this model it is always safe to place an inverter before the wire fork. However, because gates have unbounded delay, placing an inverter after an isochronic fork and before the non-acknowledging gate can cause an instability. Because the isochronic fork timing assumption is easy to guarantee and maintain, real QDI circuits are robust by construction to temperature variation, process variation, sizing, noise, etc . For a more detailed discussion on the QDI model and this timing assumption, see , , and .
The standard arbiter circuit in Fig. 1 is responsible for ensuring mutually exclusive outputs given two inputs. In effect, it determines which of the two input requests arrived first. Ultimately, it is an analog circuit and must be carefully verified using analog analysis. Therefore, digital simulators must explicitly model the arbiter's behavior as a black-box. The implementations of the bundling merge in , and greedy arbiter in  along with many others in the literature assume an ideal arbiter in which the two outputs are always mutually exclusive high. This section examines the analog characteristics of the circuit used to implement an arbiter, and focuses on appropriate digital abstractions for different arbiter circuits.
The arbiter in Fig. 1 consists of two stages: an RS latch
followed by an instability filter (also called a metastability filter). If both
b transition to
simultaneously, then the RS-latch can become metastable with both
_v only partially transitioning. At
resolution, one of the two will finish its transition to
the other will return to
Vdd, causing a glitch . Therefore, the second stage implementing an
instability filter is required that converts the glitch caused by the
metastable latch into a delay. According to , an NMOS
form was first designed by Sutherland in 1976 which later appeared in  and . To our knowledge, the
CMOS form we use first appears in . This arbiter
will be expressed in circuit diagrams as a box labelled "Arb".
The inverters in the instability filter have unpredictable delay which depends on their load as determined externally to the arbiter. Because the downgoing transition of one output as highlighted blue in Fig. 1 is allowed to happen in parallel with the upgoing transition of the other as highlighted red, they can both be high at the same time, violating the blackbox model from the literature. This has practical implications, for example in the next section we analyze circuits in the literature that depend upon the outputs of the arbiter being mutually exclusive which is something that this circuit does not guarantee.
In practice, as long as the two outputs
drive similar capacitive loads, then the blackbox digital model holds true and
the two outputs remain mutually exclusive high. However, should they end up
driving different loads, that model will fail. Suppose
u is driving
a high capacitive load and
v is not. If
u are at
GND, then lowering
a could cause a transient
state in which both
v are high. A SPICE simulation in Fig. 2 replicates this behavior, showing both
v above the threshold voltage after 0.09 ns.
In effect, the blackbox model used in the literature has an extra timing
assumption. As suggested by , we could treat this
arbiter as a blackbox gate in which
treated as an isochronic fork, and guarantee the output loads in layout.
However, this is an assumption that is often lost when publishing circuits
which use arbiters.
Therefore, a more conservative model for an arbiter that more rigorously
implements the QDI delay model, and captures the impact of arbitrarily
different loads on the arbiter's outputs would include output buffers that have
an unbounded delay. We refer to this as the buffered arbiter model.
Specifically, the digital abstraction for the arbiter would artificially ensure
_v are mutually exclusive low, and model
v using ordinary inverters. The buffered
arbiter model includes a transient state where
v can both be high, consistent with Fig. 2.
Alternatively, an ideal arbiter that avoids the buffered arbiter model can
be implemented by folding a latch into the instability filter as shown in Fig. 3. This forces the pull up networks of the arbiter's
outputs to acknowledge the pull down network of the other, keeping them
mutually exclusive. Importantly, the pass transistor logic implementing the
instability filter has to remain intact. This is why
_v is still
the source node for the
u↾ rule and
_u for the
v↾ rule. If more drive strength is necessary, the output signals
may be further buffered while maintaining the ideal arbiter model by adding
more latches. This arbiter will be expressed in circuit diagrams as a box
Now in Fig. 4, the upgoing transition on
v waits for the downgoing transition on
u to pass the
The buffered and ideal arbiter models ultimately help to identify the timing assumptions beyond the strict QDI delay model that might be well-known by the authors of the work, but unclear to the readers. In subsequent sections, we will present designs that do not rely upon these timing assumptions for correct operation, thereby improving their reliability.
If we apply the buffered arbiter model to the bundling merge circuit from  in Fig. 5, it is possible for the
circuit to go through an entire cycle, asserting and deasserting the grant
high. This is because
w can be driven low following
Sa↾ and therefore can skip the check for
¬uA ∧ ¬uB.
As long as
uB continue to remain high,
Sa will continue to complete full
Ba to toggle without
Br ever switching.
However, the inverter that causes this failure cannot be removed. If it
were, then the decomposition of
c from the gate
Sr would create a second failure mode. After
Sr↾ followed by
the request on
Ar is acknowledged, eventually causing
uA⇂, another request can appear on
uB↾. This could cause a glitch on
w that can propagate out
Sr. The inverter and AND
w successfully covers this glitch
as long as the application of the atomic complex gate assumption to the driver
w is valid. This assumption requires the gate highlighted red
in Fig. 5 to behave as if it were a single gate.
Ultimately, this means that the inverter in that gate must transition before
We remark that since  already assumes a fast local
inverter through their application of the atomic complex gate assumption to the
w, the assumption of fast inverters in the
implementation of the arbiter is likely reasonable in practice as long as an
effort is made to implement that assumption in layout.
Applying the buffered arbiter model to the greedy arbiter from  in Fig. 6 breaks far more. This
is because the environment is allowed to bypass various stages of the internal
handshake. Suppose a request on
Ar has already been given the
Ar is subsequently lowered, but
Se are still high. Then, as
Sr is being lowered, a
Br can arrive causing a down-up glitch on
Br happens a little later, and
Sr subsequently transitions low, then the request on
Br will drive
v high causing a down-up glitch to
Br arrives even later and neither of these glitches
happen. This means
Sr remains low long enough for
Se to transition high. Then, as
high, an up-down glitch could be generated on
v since the request
Br has simultaneously lowered
The greedy arbiter from  in Fig. 7 already works under the buffered arbiter model, but
still has other environment-dependent timing assumptions. Suppose a request
Sr↾. Then, as
Se is lowered, another request arrives driving
and causing an up-down glitch on
bo, then as long as
_u has not transitioned low in response to the request on
Ar, the glitch on
bo could propagate through the
arbiter and out
Be. Alternatively if a request never arrives on
Br, then when
Ar is lowered,
be lowered and
Se raised. Then,
Se↾ can drive
_u ever transitioning. So, a new request
Ar could cause an up-down glitch on
Sr goes up and
Se down in response to this new
request, then that glitch will propagate out
Ae. It should be
noted that neither of these timing assumptions are hard to ensure in layout and
that the authors of  and 
made these assumptions intentionally to conserve transistors in the face of a
severe restriction due to their application requirements.
The behavior of the bundling merge can be succinctly described with a single
selection statement. The selection statement is ultimately symmetric across the
two input channels, meaning it behaves the same regardless of which channel
request arrives first. Given two channels
using four phase communication protocols and one channel
S using a
two phase communication protocol, the bundling merge is as follows.
∗[[A → S;A;S | B → S;B;S | A∧B → S;(A∥B);S]]
However, the behavioral description of the greedy arbiter implemented in  is quite a bit more complex since it tries to remain symmetric. Instead of implementing a symmetric greedy arbiter, we implement an asymmetric one that looks similar to the bundling merge so that we can build both circuits with common building blocks.
∗[[A → S;A;S | B → S;B;S | A∧B → S;(A;B);S]]
Now, the grants are always served in the same order regardless of which
request arrives first. This means that the only thing differentiating the
bundling merge and the greedy arbiter is the behavior when
B happen together. The bundling merge executes
while the greedy arbiter executes
A;B. So lets just look at one of
∗[[A → S;A;S | B → S;B;S | A∧B → S;(A∥B);S]]
The first thing we can do is pull
S out of the selection
statement since it behaves the same regardless of the condition. The first
S happens before any communication actions on
B but after at least one of their associated
probes. Therefore, we add an extra guard checking the probes of
B and then move the first action on
S so that it
precedes the selection statement. The second action on
after all communication actions on
we pull the second action on
S out to the other side of the
∗[[A∨B]; S;[A → A | B → B | A∧B → A∥B];S]
Next, we can add in a redundant branch to the selection statement in
preparation for later steps. The newly created branch
skip will never be executed due to the guard
beforehand. This has a subtle side-effect of making the guards
¬A∧B unstable, requiring a synchronizer
to implement the non-deterministic selection statement.
∗[[A∨B]; S; [ ¬A∧¬B → skip | A∧¬B → A | ¬A∧ B → B | A∧ B → A∥B ];S]
This allows us to factor the selection statement into two, one for the
A and one for
B. This effectively pulls the
parallel composition out of the selection statement. Note that the selection
statements must remain non-deterministic because the probes
¬B still may be unstable.
∗[[A∨B]; S; ([¬A → skip | A → A] ∥ [¬B → skip | B → B] );S]
Now, we can notice that
[¬A → skip | A → A] is effectively a
non-deterministic execute. If there is a request on
A. Otherwise skip. So, lets use process decomposition to
factor this out. However, in order to correctly decompose these processes, we
also have to keep the probes in
A∨B in mind. To isolate
B to the newly decomposed processes, we
create two state variables
uB that keep track
of the outcome of the non-deterministic selection statements. Then we can treat
them as shared variables, checking their value to ensure the non-deterministic
selection has resolved. To do this process decomposition, we introduce two new
Sb that use four phase communication
protocols. Finally, we can use the transformation described in  to implement the unstable guards on
¬B using the stable guards on
Sb. This allows us to use an arbiter to implement the
non-deterministic selection statements.
∗[[Sa → Sa | A → uA↾; [Sa]; A; uA⇂; Sa]] ∥ ∗[[Sb → Sb | B → uB↾; [Sb]; B; uB⇂; Sb]] ∥ ∗[[uA∨uB]; S;(Sa ∥ Sb);S]
Ultimately, the module we factored out
∗[[S → S | A → uA↾; [S]; A;
uA⇂; S]] can be implemented very succinctly. If
first, then we just do the complete handshake on
Ar arrives first, then we use a state variable
encode the output of the arbiter from the non-deterministic selection. Then, we
can wait for
Se to give us the grant, and proceed to execute
A by lowering
A has completed
its execution, as communicated by lowering
Ar, we can reset the
circuit, completing the handshakes on
∗[[ Se → Sr↾;[¬Se];Sr⇂ | Ar → uA↾;[Se];Ae⇂;[¬Ar];uA⇂;(Ae↾∥Sr↾;[¬Se];Sr⇂) ]]
This reshuffling leads to a very simple circuit using an ideal arbiter as
shown in Fig. 8. One should note that because of the
wire forks internal to the arbiter, any logic containing
The circuit in Fig. 8 is called the Maybe Execute
Element because when triggered on
S it only executes the
interfaced communication action on
A if that communication is
A is not ready, then the action is simply
skipped and the trigger on
S is acknowledged. Therefore, the action on
may be executed.
To make this circuit easier to navigate, we introduce some syntactic sugar for CHP using a re-write rule.
∗[[Sa → Sa | A →uA↾; [Sa]; A; uA⇂; Sa]] ∥ ∗[… [uA] … Sa …]
Specifically, the above CHP is rewritten as follows:
∗[… [A⸰] … A⸰ …]
This transforms the bundling merge specification to
∗[[A⸰∨B⸰]; S;(A⸰ ∥ B⸰);S]
and the greedy arbiter specification to
∗[[A⸰∨B⸰]; S;(A⸰; B⸰);S]
The parallel composition
A⸰∥B⸰ as seen in the bundling
merge specification, or the sequential composition
as seen in the greedy arbiter specification is now just as simple in Fig. 9 as it would be using syntax directed translation .
Its possible to implement the maybe execute circuit using buffered arbiters
as in Fig. 10. The added transient state in the
buffered arbiters desynchronizes
uA can go up before
goes down and
Sr can go up before
uA goes down. Now,
the first case is not a problem because
Ae is forced to wait for
Se anyway. However, the second case can cause an instability.
Therefore, we have to use an asymmetric C-element to force
uA to transition to
GND. This new C-element
uA⇂ can be easily folded into other logic in a
Therefore, the only difference between the state transition diagrams of the ideal arbiter maybe execute and the buffered arbiter maybe execute is the first case that we identified. This just creates an extra state in Fig. 11 highlighted in red beyond the ideal arbiter's state transition diagram. Take note that this and further diagrams are rendered using the standard state transition diagram notation from the asynchronous literature, as found in , and is not a conventional finite automaton even though the graphical notation is similar.
It is possible to optimize this further by replacing the gate driving
Ae with pass transistor logic. While it is more performant for
two inputs, it does not scale well beyond that.
Lastly, note that for efficiency, this implementation assumes that the
arbiter is fair as noted in . If the arbiter is not
fair, then it is possible for
A to execute multiple handshakes
Sr. The opportunistic merge in  makes a similar assumption.
Circling back to the bundling merge and greedy arbiter, this
transformation makes the specification for a bundling merge look very similar to a
deterministic merge and a greedy arbiter look very similar to an alternating
merge. The only departure from the deterministic specification is that we have
to actively check to make sure at least one of the input channels is requesting
[A⸰∨B⸰]. If not, then the circuit will
simply busy wait instead of blocking, repeatedly and unnecessarily executing
communication actions on
S, wasting a lot of energy along the
∗[[A⸰∨B⸰]; S;(A⸰ ∥ B⸰);S] ∗[[A⸰∨B⸰]; S;(A⸰; B⸰);S]
To see the busy-wait versions, look no further than back to the sequential
and parallel composition circuits in Fig. 9. However,
implementing the extra check for
[A⸰∨B⸰] which makes it
blocking requires adding only two extra transistors, as highlighted in Fig. 12, to what would otherwise just be an inverter.
The transistor network that drives
Sr in Fig. 12 and Fig. 14, is
ultimately a generalized C-element with a weak staticizer. The cross-coupled
inverters form a latch whose value is set by the pull-up and pull-down networks
of the C-element. The weak backwards inverter is a staticizer, designed to keep
the previous state of the C-element when both the pull-up and pull-down
networks are off.
The state transition diagram of this interface between
uB is shown in Fig. 13. Again, the extra states introduced in the buffered
arbiter implementations are highlighted in red.
And with some peephole circuit optimization, we get the circuits presented
in Fig. 14. Notice that no single gate in the
greedy arbiter has a transistor stack length that is dependant upon the number
of inputs. That means that one could scale it to an arbitrary number of inputs
without ever introducing any gate trees. For the bundling merge, the
generalized C-element driving
Sr does grow with the number of
inputs. However for the most part, that can be implemented as a tree of
standard C-elements. The final C-element at the root of the tree would need to
include the highlighted transistors attached to its ground node.
Furthermore, if the gates driving
Sr in  were merged together, the
Sa would no longer be necessary. With the inverter
removed, we would have successfully re-derived the bundling merge presented in
this work. This similarity shows the power of our systematic approach to
derive circuits of similar quality to carefully hand-crafted circuits in the
More generally, we can implement any arbiter of the form
expr];S;(compose expr);S] using the same basic template. For example, a
one-writer two-reader lock where the writer gets priority is
Alternatively, we could give the readers priority by modifying the composition
Or we could require both readers to make a request before either of them get
the grant by modifying the wait expression:
This last example demonstrates the power of this system to produce solutions to
arbitration problems that cannot be otherwise constructed with the circuits
found in the literature. In general, these types of solutions come from wait
expressions that are not simply an OR across all input request signals.
When optimizing the buffered arbiter design in Fig. 15, take note that we do not have to wait for
uA to go low before passing the grant off to the next request.
This is because
vA↾ requires that
Ar is lowered
thereby handing the grant back to us. We just need to make sure that
uB complete their transition before we
complete the handshake. This allows us to merge the check for
uB⇂ into the C-element driving
Sr making it a
symmetric generalized C-element.
The process of extending this circuit to many inputs is not altogether
obvious. There will still be a C-element tree driving
to the ideal arbiter version, with the special root node that checks the
upgoing transitions of
uB, etc. However, the
downgoing transitions of those signals must be checked at the leaves of the
tree. This prevents the transistor stacks at the root node from growing too
Aside from the bundling merge circuit in , the
circuits available in the literature rely upon hierarchical composition to
scale to many inputs. This connects the child-grant channel
B of one arbitration circuit to the parent-grant channel
S of another. Furthermore according to the authors, scaling the
flat composition in  above 3 inputs becomes dangerous
due to their timing assumptions, and the standard cell library is unlikely to
have the necessary gates for
w in Fig. 5.
We developed and evaluated all of these circuits using a set of in-house tools, a version of which is described in  and publicly available at . We also verified correctness for all of the elements described in this paper with a brute force switch-level simulation which identifies instability, interference, and deadlock across all states. No environmental assumptions were required since this was verified using simple sources and sinks on the input and output channels. We automatically translated these specifications into SPICE netlists and verified their analog properties using Synopsys's combined simulator with VCS, a Verilog simulator, to simulate the testbench and HSIM, a fast SPICE simulator, to report power and performance metrics.
To evaluate the energy per operation and throughput of these circuits, we
used a 1V 28nm process and protected each of the digitally driven channels with
a FIFO of three WCHB  buffers isolated to a different
power source to get more accurate results. All of the circuits are sized
minimally with a pn-ratio of
2, and we use weak feedback
staticizers for all of the C-elements in our designs. Circuitry necessary for
reset was not included in any of the above descriptions. The performance values
for the cited work is determined using the same method.
Fig. 11 shows the state transition diagram for
the maybe execute element, covering all combinations of possible signal
transitions under any possible timing. The figure was derived directly from the
circuit, demonstrating that the maybe execute element has no switching hazards
on any of the digital gates under any possible timing. Furthermore, the
A always waits for the parent grant from
S, and the parent grant is not returned until the handshake on
A has completed. This guarantees that child grants remain mutually
The circuit family we use guarantees that if two components are independently verified to be robust under all possible delay scenarios, and they only interact through delay insensitive handshake protocols, then their composition is also robust. Therefore, the compositions in Fig. 9 maintain timing robustness characteristics.
Finally, the state transition diagram in Fig. 13 was
similarly derived, demonstrating that the bundling merge and greedy arbiter
interface has no switching hazards on any of the digital gates under any
possible timing. Specifically, all upgoing transitions on
uB must happen after
Ce⇂ and all downgoing
transitions must happen after
Ce↾, keeping the gate driving
Therefore, our designs suffer from none of the subtle timing issues we highlighted in Section 3. Meanwhile, our designs expose a more general strategy for composing arbitrary non-deterministic selection expressions and show good performance with reasonable energy requirements. Furthermore, our designs only require 2 gates beyond those typically found in commercial standard-cell libraries, the arbiter and the C-element executing the wait expression. These two cells can be easily implemented using standard techniques or automated layout .
In a real system, arbitration circuits are likely to be few and far between, so any enhancements demonstrated by our circuits are likely to have little to no effect on the system performance. Therefore, the goal of our analysis is simply to ensure that these circuits do not end up bottlenecking the system performance beyond what is already in the literature. Any performance enhancements demonstrated by our circuits are not generally a result of the systematic strategy we developed, but are instead a result of the peephole optimizations that come after. To be fair, it is surprising that circuits which more strictly implement the QDI delay model can demonstrate any performance enhancement at all.
For the bundling merge
A∥B, our buffered arbiter approach, with
68 transistors, operates 5% faster and consumes 4% less energy per operation
than previous work  with 78 transistors. Our ideal
arbiter version, with 69 transistors, is slightly worse than the normal arbiter
design in every metric because it moves one of the acknowledgement checks
earlier in the handshake.
Fig. 16 shows the performance and Fig. 17 shows the energy of each bundling merge design as they scale to many inputs. The high activity (top) plot assumes that all inputs are making requests and the grants are servicing those requests at max possible throughput while the low activity (bottom) plot assumes that only one input is making requests while the others are inactive.
When scaling the designs beyond 2 inputs with high activity, the flat ideal arbiter design is quickly superior in both frequency and energy. This is primarily because all of the nodes in the tree composition are unbuffered, so a request has to be passed all the way up and the grant back down the tree before the requesting process can be serviced. This incurs a significant delay that the flat compositions easily avoid.
At lower activity, the tree compositions will start to perform better
because they have to check fewer nodes. More specifically, if
N nodes are making requests, the flat composition has to
N while the tree compositions only have to check
O(log2(N)) nodes. In the low activity energy plots in Fig. 17, this is apparent because the tree designs use
much less energy overall. However, the low activity frequency plots in Fig. 16 show that the flat ideal arbiter design remains
superior to the tree designs. In a real system, if access to a shared
resource is a bottleneck, and a bundling merge is used to arbitrate, then the
performance gains from the flat ideal bundling merge can ultimately lead to an
improvement in overall system performance.
For the greedy arbiter
A; B, our buffered arbiter version, with
59 transistors, and our ideal arbiter version, with 65 transistors, both operate
slower and consume more energy than  with 48
transistors and  with 35 transistors. This is to be
expected since  and  both
made significant timing assumptions, sacrificing reliability in order to fit
their power and throughput budget.
Fig. 18 shows the performance and Fig. 19 shows the energy of each greedy arbiter design as they scale to many inputs. Again, the high activity (top) plot assumes that all inputs are making requests and the grants are servicing those requests at max possible throughput while the low activity (bottom) plot assumes that only one input is making requests while the others are inactive.
As the greedy arbiter is scaled beyond two inputs with high activity, the
flat ideal arbiter design is still clearly superior to the tree designs in both
frequency and energy. However, the effect of lower activity is more
significant. While the bundling merges check all
N inputs in
parallel, the greedy arbiter does so one at a time. This means that the tree
composition will get a significant advantage to throughput beyond the flat
compositions at lower activity levels. It is not possible to include  in this comparison because it deadlocks all but one
input in this scenario.
In this paper, we presented a robust implementation for a non-deterministic channel action and showed how it could be easily composed to generate robust implementations for bundling merges and greedy arbiters along with any other desired locking mechanism. Along the way, we elaborated on the difference between the buffered arbiter and ideal arbiter models and how those differences affect their respective circuit implementations. Finally, we evaluated these designs, showing better performing bundling merge and a similarly performing greedy arbiter.
Overall, the bundling arbiter and greedy arbiter are different problems.  and  simply present circuit diagrams for greedy arbiters rather than an approach that could be used to design a bundling merge as well. This is also the case for , where a bundling merge circuit is presented rather than an approach that could also be used to design a greedy arbiter. This paper is the first to present a unified approach to both problems.
Communicating Hardware Processes (CHP) is a hardware description language used to describe clockless circuits derived from C.A.R. Hoare's Communicating Sequential Processes (CSP) . A full description of CHP and its semantics can be found in . Below is an informal description of that notation listed top to bottom in descending precedence. For a complete discussion of the interaction between the handshake expansions of channel actions like send and receive and the composition operators, see .
Dataless vs Datafull Dataless expressions operate on node voltages
while Datafull operate on delay insensitive encodings. Mixed expressions
implicitly cast the datafull to dataless using the encoding's validity.
Specifically, for a datafull expression
e its positive sense
e is cast to a validity check while its negative sense
¬e is cast to a neutrality check.
null is defined to
be a neutral state of an encoding.
X consists of a request
Xr and either an acknowledge
Xe. The acknowledge and enable serve the same
purpose, but have inverted sense. With these signals, a channel implements a
network protocol to transmit data from one QDI process to another.
skipdoes nothing and continues to the next command.
- Dataless Assignment
n↾sets the voltage of the node
n⇂sets it to
v := Ewaits until the datafull expression,
E, is valid, then assigns that value to the variable,
X!Ewaits until the datafull expression
Ehas a valid value, then sends that value across the channel
X. Ultimately, a send is expanded into a handshake on its underlying signals. The standard four phase send on channel
Xr := E; [Xa]; Xr := null; [¬Xa]for an acknowledge channel or
Xr := E; [¬Xe]; Xr := null; [Xe]for an enable channel.
X?vwaits until there is a valid value on the channel
X, then assigns that value to the variable
v. Ultimately, a receive is expanded into a handshake on its underlying signals. The standard four phase receive on channel
v := Xr; Xa↾; [¬Xr]; Xa⇂for an acknowledge channel or
v := Xr; Xe⇂; [¬Xr]; Xe↾for an enable channel.
- Dataless Channel Action If
Xis a dataless channel, then a send with an acknowledge channel is indistinguishable from a receive with an enable channel and a send with an enable channel is indistinguishable from a receive with an acknowledge channel. Therefore, we can simplify the syntax for the dataless send
X?is used determine if the channel is ready for a receive action, returning the value waiting on the request
Xrwithout executing the receive.
X!is used to determine if the channel is ready for a send action, expanding into either
¬Xagiven an acknowledge or
Xegiven an enable. For dataless channels, the syntax is simplified to
- Sequential Composition
S; Texecutes the programs
- Parallel Composition
S ∥ Texecutes the programs
Tin any order.
- Deterministic Selection
[G1 → S1▯…▯Gn → Sn]where
Gi, called a guard, is a dataless expression and
Siis a program. The selection waits until one of the guards,
Gi, evaluates to
Vdd, then executes the corresponding program,
Si. The guards must be stable and mutually exclusive. The notation
[G]is shorthand for
[G → skip].
- Non-Deterministic Selection
[G1→S1|…|Gn→Sn]is the same as Deterministic Selection except that the guards do not have to be stable or mutually exclusive. If two or more evaluate to
Vddsimultaneously, then one is picked arbitrarily (not necessarily random). In a circuit, this choice is implemented by a collection of arbiters and synchronizers. As discussed in Section 2, when two or more guards evaluate to
Vddsimultaneously, it can cause a metastable state in the arbiter or synchronizer. This metastable state then resolves non-deterministically, giving the grant to one of the branches of the selection statement. Therefore, the digital model of this selection statement is also non-deterministic in such a condition.
∗[G1 → S1▯…▯Gn → Sn]is similar to the selection statements. However, the action is repeated until no guard evaluates to
∗[S]is shorthand for
∗[Vdd → S].