Ode to a Vyukov Queue
CONTENTSIntroduction
[Addendum: great comments about this post on Reddit]
This post celebrates an instance of simplicity in action: Dmitry Vyukov’s1 Multi-Producer Single-Consumer queue (DV-MPSC).
DV-MPSC is one of those rare solutions that addresses essential complexity and no more. The result is something easy to understand and easy to implement. Such clarity is uncommon in the concurrent algorithm space and has no doubt contributed to adoption of DV-MPSC in mainstream concurrent systems like Netty (via JCTools), Akka and Rust2.
Such elegance is not without tradeoffs. We’ll investigate two commonly discussed shortcomings–blocking progress guarantee and serializable ordering–and review their impact.
Terminology Review
A quick review of important terms before diving into the details.
Progress Condition3 Definition Think of it as wait-free All threads complete their call in a finite number of steps. The strongest guarantee of progress. Everybody is working lock-free At least one thread is always able to complete its call in a finite number of steps. Some threads may starve but overall system progress is guaranteed. Somebody is working blocking Delay or error in one thread can prevent other threads from completing their call. Potentially all threads may starve and the system makes no progress. Somebody is probably working unless bad things happen Additionally:
- Thread - An algorithm instance on a multiprocessor/multicore shared memory system. We assume multiple threads are able to execute simultaneously.
- Completes its call - An invocation of the
push
orpop
method that returns control to the caller thread.
Algorithm Properties
Looking at DV-MPSC from an academic computer-science standpoint one might be a tad disappointed: DV-MPSC lacks properties typically associated with robust concurrent algorithms:
- Overall it’s a blocking algorithm. No {wait,lock,obstruction}-freedom here.
- The algorithm is not sequentially consistent nor linearizable, only serializable4.
Put another way, on paper DV-MPSC:
- Makes no guarantee that any work is going to get done, and
- Work that is lucky enough to get done has no overall deterministic ordering.
But all is not lost…read on.
Implementation
/*
* Minimal C++ implementation of the non-intrusive Vyukov MPSC queue.
* Illustrates the spirit of the algorithm while being accessible
* to most readers.
*
* Nodes are enqueued at the *tail* and removed from the *head*.
*/
struct Node {
std::atomic<Node*> next;
SomeOpaqueType value;
};
class MpscQueue {
public:
MpscQueue() : stub(new Node()), head(stub.get()), tail(stub.get()) {
stub->next.store(nullptr);
}
void push(Node* node) {
node->next.store(nullptr, memory_order_relaxed);
Node* prev = tail.exchange(node, memory_order_acq_rel);
prev->next.store(node, memory_order_release);
}
Node* pop() {
Node* head_copy = head.load(memory_order_relaxed);
Node* next = head_copy->next.load(memory_order_acquire);
if (next != nullptr) {
head.store(next, memory_order_relaxed);
head_copy->value = next->value;
return head_copy;
}
return nullptr;
}
private:
std::unique_ptr<Node> stub;
std::atomic<Node*> head;
std::atomic<Node*> tail;
};
(This is the non-intrusive variant5)
DV-MPSC boils down to a linked list + atomic operations. Additionally:
push
andpop
employ a minimal number of atomic operations.- There are no special cases to handle contended or partially-applied operations.
- Also no compare-and-swap loops or retries.
These qualities make DV-MPSC more approachable than many other concurrent queue algorithms (compare to the familiar M&S queue which has loops and needs at least two compare-and-swaps to enqueue a node6). Developers can lean on prior experience with linked lists to quickly grasp basics and save cognitive heavy-lifting for understand how the atomic operations and corresponding memory orderings work.
Block me, maybe?
Generally DV-MPSC operations are wait-free. Recall that both
push
and pop
have no loops or special cases; they execute straight-through every time.
However there is a corner-case lurking in push
that knocks DV-MPSC into the blocking category. Let’s examine push
in more detail and see what is going on:
Blocking Corner-Case
Annotating push
:
void push(Node* node) {
node->next.store(nullptr, memory_order_relaxed);
Node* prev = tail.exchange(node, memory_order_acq_rel); // #1 Serialize producers
// #2 DANGER ZONE
prev->next.store(node, memory_order_release); // #3 Serialize consumer
}
A new queue starts in this state:
Queue:
head -> stub
tail -> stub
List contents:
stub
next -> nullptr
A producer thread executes push(a)
and the queue now looks like this:
Queue:
head -> stub
tail -> a
List contents:
stub a
next -> a next -> nullptr
Now a producer begins to execute push(b)
completing #1
but stopping before #3
. The producer stopped (crashed,
preempted, etc) exactly at #2
leaving the queue in this state:
Queue:
head -> stub
tail -> b
List contents:
stub a | b
next -> a next -> nullptr | next -> nullptr
|
linked list is broken here ^
Uh-oh, the queue is inconsistent:
tail
points tob
; howeverb
is unreachable starting fromhead
What happens next? Additional calls to push
will add more unreachable nodes to the list.
If push(c)
were invoked the queue would look like this:
Queue:
head -> stub
tail -> c
List contents:
stub a | b c
next -> a next -> nullptr | next -> c next -> nullptr
|
And what about pop()
? The first pop
call will return a
. However subsequent pop
calls will
return nullptr
as if the list were empty.
Revisiting the above definition of blocking:
…[p]otentially all threads may starve and the system makes no progress.
Now we see why DV-MPSC taken as a whole is blocking. If this inconsistent condition is
encountered the algorithm will make no progress: it will never deliver elements to the consumer again.
Individual calls to push
and pop
will continue to execute
in a finite number of steps but the system overall fails to make progress.
Impact
For the above to occur the enqueuing thread must halt and never resume precisely after exchanging
tail
but before storing the new value of next
. On x86_64 this window of vulnerability can be as short as a single instruction.
The key is the never resume part. If the enqueuing thread is merely delayed–by garbage collection or OS preemption, for example–then the queue will become consistent again (and enqueued nodes visible to the consumer) once the thread resumes and completes its call.
In most practical uses of DV-MPSC this corner-case will be a non-issue.
Serializable Ordering
Imagine two threads invoking push
simultaneously:
Producer 1 push(a)
Producer 2 push(b)
Consumer pop(?) pop(?)
------------ program order -------------->
Later in program order the consumer executes two sequential pop
operations
and retrieves [a, b]
. Intuitively this makes sense: the push
operations were
simultaneous therefore we don’t know which producer “won” the race
to enqueue an element first.
We can also accept the two sequential pop
operations could have instead produced [b, a]
without violating our intuition
(the empty result case is ignored in this example).
Now imagine three producers:
Producer 1 push(a) push(d)
Producer 2 push(c)
Producer 3 push(b)
Consumer pop(?)
------------ program order -------------->
What value will pop
have? Ignoring the empty case, pop
could return
a
, b
, or c
but not d
. In fact, all of the lists below are valid orderings for sequential pop
operations (non-exhaustive):
[a, b, c, d]
[b, c, a, d]
[b, a, c, d]
[c, a, d, b]
[a, b, d, c]
This is serializable ordering in action: individual producers are FIFO–there is no valid history where
d
is read before a
–but no ordering between producers exists nor does the ordering need to be consistent with program order.
Impact
The potential orderings admitted by serializability may be unintuitive to humans, but per-producer FIFO partial ordering is sufficient to implement powerful concurrent systems such as the actor model. So aside from bending your brain a bit it doesn’t get in the way of utility.
Closing Thoughts
This was a quick take on DV-MPSC and two of the tradeoffs involved in its design. Up for learning more? The footnotes will get you started and you can also take a look at what others have written about DV-MPSC.
- Dmitry’s site 1024cores.net accumulates years of his insight into synchronization algorithms, multicore/multiprocessor development, and systems programming. If nothing else, visit his queue taxonomy and stand humbled knowing Dmitry has probably forgotten more about each dimension in that design space that most people presently know. As of this writing Dmitry is at Google where he’s working on all manner of wicked-cool dynamic testing tools. [return]
- Akka and JCTools change the algorithm slightly to make it linearizable at the expense of a weaker progress condition in
pop
. This is to necessary to conform to thejava.lang.Queue
interface. Consult this post to start down that rabbit-hole (beware it’s a deep one). [return] - More details on progress conditions. [return]
- Discussions of these finer points here and here. [return]
- The version here differs from 1024cores in that elements are enqueued at the tail and removed from the head. The 1024cores implementation uses ‘head’ and ‘tail’ in the reverse manner. Additionally, performance sensitive applications would likely use the intrusive version, but the non-intrusive version is easier to explain. [return]
- I fully acknowledge that the M&S queue addresses the more challenging MPMC case. I highlight due to it being frequently encountered as an exemplar of concurrent queues in general. [return]