In true modern fashion, I have reached the conclusion that the world needs to hear from me, and that I need to butcher buzzwords into places they don’t belong (this isn’t really about eventual consistency, at least in the distributed systems sense).
What do I intend to foist upon the world? I love teasing every inch of mileage out of a data structure by matching its constraints to my constraints. I don’t see much consistent discussion on this, so I figure I may as well try my hand at it. Perhaps I can infect the world with some of my enthusiasm, and we can escape the pervasive fear of “rolling your own” data structure. Not that this is usually a good idea, but I would love to see a greater proliferation of varieties available for general use, so that we don’t perpetually use the hammer of an STL hash map, tree or queue for every screw we see.
OK, down to business. What are my constraints today?
I have a simple synchronization primitive I use called a WaitQueue. It boils down to a queue of threads waiting for a state change, and requires a collection supporting three basic operations: poll, append, and remove from an arbitrary but known position (to support cancellation).
Now, there are a lot of variants at this point that would be acceptable, and I have used a variety of implementations over the years. Frankly it makes little difference when you’re suspending and signalling threads, since these costs usually overshadow any queueing costs. But a good solution is more generally useful than this scenario, and a question on the concurrency-interest mailing list had me spin the latest variant into a generalised collection.
So, the ideal data structure? The obvious answer is a doubly-linked list.
No doubt there’s a catch?
The problem here is that we’re in a concurrent world, and this makes the problem hard because – at least until transactional memory appears – very few macro behaviours can be executed atomically. Performing the action in multiple atomic steps1, however, is a significant expense and error prone.
When developing boring old sequential algorithms, we’re used to maintaining an exact representation of our structure, and there’s a tendency to assume this should continue when we go multi-threaded. But this isn’t a requirement of the structure, we simply do it because it seems natural.
There is an alternative, though: We can permit certain kinds of corruption that can safely be repaired. Put another way, we impose much weaker constraints on modifications to the structure, so that any number of possible outcomes are acceptable, even though only one is desired. This can help us alleviate the necessity for (as much) atomicity, and rather unexpectedly make things easier to reason about.
How does this affect our doubly-linked list?
A widely used design is to permit our prev pointer to be inaccurate. Since we only really need it for deletion, its only purpose is to point somewhere behind us; so long as it really is behind us, it doesn’t matter how far. We can always walk forwards again to find our real predecessor before we perform our delete.
This permits us to maintain our next pointers accurately with a single atomic operation, and only make a best-effort on the prev pointers, tidying them up periodically – say, while iterating the contents.
However I want to take this one natural step further: I’m happy for both the prev and next pointers to be inaccurate, so that both can be updated unconditionally, avoiding any costly CAS operations. As icing on the cake, this permits removal to be wait-free in worst-case constant time.
Sounds like a recipe for disaster.
If you think that’s bad, we do the same for our head and tail pointers too!
We depend on some very simple constraints, and these compose to a natural proof of validity.
1. We only insert at the tail2
2. We never delete the tail node (though it may be marked deleted)
3. Any node ever in the list must always eventually lead to the next live node in the list by following its next pointer chain.
4. We impose the equivalent of (4) on prev also.
How do these help us?
(1) & (2) are both simple constraints, and combine to ensure that inserts never interfere with deletes, nor vice-versa. Once we have this, and combined with (3), lock-free insertion is trivial, and out of scope for this post.
(3) permits the head and tail pointers to be stale, and with (4) we can implement removal.
So how do we ensure (3) and (4)? Deletes can only interfere with each other, so this is actually a lot simpler than it might at first appear, and can be shown safe by induction. (1-4) hold on any list without deletions, so we just need to ensure they continue after.
– First we mark the node deleted
– We walk our prev chain backwards until we find a live node, and we walk our next chain forwards until we find a live node
– We then set the next of our predecessor and the prev of our successor to each other, unconditionally
– We leave the pointers of the deleted node untouched
Now, if deletes occur in isolation (i.e. are separated by a live node) then this just trivially works.
However, if we have a string of adjacent deletes, what happens? Well, perhaps the next and prev pointers of the range represent different lists, perhaps all of them point to a different node than they should. But they all must point to a node prior (or equal) to the live node in the direction of their travel, since all of the mutators stop when encountering a live node, and the property held before we started. Regardless of the interleaving and whose “live node” is persisted in any node en route, the worst we can do is insert a stale reference.
Wait, I thought you said this was guaranteed constant time?
I did. This can be made constant time by simply bounding the number of nodes you traverse to find the live neighbour – to, say, 10. If there are so many competing adjacent deletes that you exceed this number of traversals, you just stop on the 10th deleted node and modify it, editing out the following 10 deleted nodes. The upshot is a list that steadily shrinks, instead of immediately reflecting the correct state, amortizing the costs over multiple operations. This has a stronger impact than it might first appear, switching aggregate behaviour when competition gets too fierce: once we exceed 10 items, mutators start editing different nodes, doing so without competition, ensuring we tend rapidly back towards our ideal list. In reality these modifications occur in such a short space of time that only a small fraction fail to apply cleanly, so this only bounds worst case behaviour.
In summation, why do I care?
The list supports wait-free removal in guaranteed constant time, and as the list grows in size (relative to mutator count) its behaviour tends steadily towards that of a single-threaded structure. This is pretty unique to my knowledge. It can also be extended in a number of interesting ways, to support lock-free insertion to arbitrary positions in the list, or wait-free append if we do not care about traversal order. But this was no doubt exhausting to read, so I’ll save that for another time. In the meantime, take a look at the code here and here.
1 Atomicity is a spectrum, and here I’m really referring to read-modify-write atomicity, not the requirement that any single modification is visible atomically
2 Insertion at this point either needs mutual exclusion or a normal lock-free CAS-based append, and is out-of-scope for this article. By having one list per <em>inserting</em> thread we can support wait-free insertion and wait-free removal, but lose in-order traversal, and this will be the topic of a follow up article.