Understanding the LMAX Disruptor

For performance reasons, most modern CPU employ performance optimizations which may result in out-of-order execution.

Let’s go back to our first example:In this example, it does not matter when is the loop counter updated as no operation within the loop uses it.

The CPU is free to reorder instructions for optimizing the execution performance.

This reordering is not a problem in case of a single threaded execution but can become unpredictable in the context of concurrent execution.

This is the goal of memory barriers:Ensuring that all instructions either side of the barrier appear in the correct order if they are observed from another CPU core.

Making the memory visible by propagating data to the underlying caches.

There are different types of memory barriers:Read: gives a consistent view of the world for write operations ordered before the barrier.

Write: gives an ordered view to the world of the store operations before the barrier.

Full: a composition of the read and the full barrier.

In Java, using a volatile field inserts a write barrier instruction after you write to it, and a read barrier instruction before you read from it.

Meanwhile, a final field of a class is made visible using a write barrier once the constructor completes.

It is also possible to access such instructions from the Unsafe library.

Let’s modify our multi-threaded implementation by using a read memory barrier before to access sharedCounter and a write memory barrier after:Let’s run again this implementation with MAX equals to 1 million.

Bear in mind, we still expect to print counter=1000000 at the end of the execution:– First multi-threaded implementation without a memory barriercounter=522388counter=733903counter=532331– New implementation with memory barrierscounter=999890counter=999868counter=999770As you can see, there’s an undeniable impact here as we are getting closer to the expected result due to the memory barriers.

Yet, the execution is not deterministic as a memory barrier is still not enough to prevent race conditions with non-atomic operations.

Traditional QueuesOne alternative to sharing memory between threads is the message passing paradigm: sharing memory by communicating.

It means we need something in between threads to handle communications.

One of the solutions is to use traditional queues like LinkedBlockingQueue or ArrayBlockingQueue in Java.

Yet, it does not solve concurrency problems as even a queue must ensure mutual exclusion and visibility of change.

If we take a look at the put method of ArrayBlockingQueue, we can verify that both aspects are handled:The accesses to the queue are locked and once an element has been added a signal is sent to an awaiting thread.

Something very interesting, the LMAX team noticed that typically, queues tend to be always close to full or close to empty due to the difference in pace between consumers and producers.

This observation results in a high-level of contention and/or expensive cache management.

If the queue is close to full, it will result in contention between the producers (leading to context switches and perhaps losing cached data and instructions).

Moreover, in a traditional queuing system, the producers claim the head of the queue while the consumers claim the tail.

If the queue is close to empty, it is very likely that the head, the tail and the queue size will all belong to the same cache line which may lead to the false the sharing problem described hereabove.

DisruptorThe creators of the LMAX Disruptor are also famous for having invented the concept of Mechanical Sympathy.

In a nutshell, understanding the underlying hardware should make you a developer when it comes to designing algorithms, data structures etc.

Based on this philosophy, the team has been able to produce this great library:The Disruptor has significantly less write contention, a lower concurrency overhead and is more cache friendly than comparable approaches, all of which results in greater throughput with less jitter at lower latency.

Source: Disruptor technical paperLet’s try to analyze the reasons.

First, the Disruptor is based on a ring buffer structure (also called circular buffer) which is simply a single and fixed-size buffer as if it were connected end-to-end:Ring buffer structureOn startup, the memory is allocated according to the provided size (which must be a power of 2) and a factory to initialize events:Here, we allocated ringBufferSize instances of MyEvent class on startup.

Meanwhile, we provided a factory to create threads for event processors and a wait strategy defining how to handle slow subscribers (those strategies are described here).

We’ll discuss the producer type a bit later.

The event instances will be reused and live for the duration of the Disruptor instance to eliminate issues with garbage collections as in traditional queues, events may survive longer than they should.

Internally, the ring buffer is backed by an array of objects.

There is a good reason for that according to the creators.

At the hardware level, an array has a predictable pattern of access.

The entries can be pre-loaded, so the processor is not constantly going back to the main memory to load the next item in the ring.

This way, we can iterate way more efficiently than with a queue backed by a LinkedList for example.

Configuring a consumer can be done like this:We provided a lambda expression to handleEventsWith().

This lambda has three inputs:The event itselfA sequence identifierA boolean for batch managementIn the case of multiple producers, they are all going to receive every single event.

If we want to distribute the load, we can implement a sharding strategy based on the sequence identifier.

Something like this:Then we can start the Disruptor which returns a RingBuffer instance:Once the Disruptor instance is started, we can publish an event the following way:The sequence identifier is the position of the event in the ring buffer structure.

When we created a Disruptor instance, we also passed a producerType variable which can be equals either to ProducerType.

SINGLE or ProducerType.

MULTI.

This indicates to the Disruptor whether we will have single or multiple producers.

In the case of a single producer, the ringBuffer.

next() is completely lock-free.

On the other side, if we have multiple producers, this function relies on a CAS operation to provide the next sequence identifier available in the ring buffer.

The sequence identifier itself is managed by adding left and right padding to make sure it is never in a cache line with anything else:Furthermore, publishing an event creates a memory barrier to make sure the caches are up to date with this event.

It allows adding an event in the ring buffer structure, without any locking which gives us a huge performance improvement.

The last point to mention.

We said that the ring buffer was simply backed by an array.

It means it is up to the developer to make sure there is no false sharing between the events produced.

For example, in the case of a producer publishing simultaneously multiple events, if we configured two consumers, we need to make sure these events do not belong to the same cache line.

Let’s conclude this post by sharing performance results produced by the LMAX team when they compared the Disruptor with ArrayBlockingQueue.

Throughput performance testing in ops/sec (P stands for provider and C for consumer):Latency performance testing in ns/sec:Any suggestions or improvements related to this post are more than welcome!Further ReadingLMAX-Exchange/disruptorHigh Performance Inter-Thread Messaging Library.

Contribute to LMAX-Exchange/disruptor development by creating an…github.

comteivah/disruptor-demoContribute to teivah/disruptor-demo development by creating an account on GitHub.

github.

comDisruptor technical paper by the LMAX teamDissecting the Disruptor: why it’s so fast (part one) Locks are bad by Trisha GeeDissecting the Disruptor: why it’s so fast (part two) Magic cache line padding by Trisha GeeDissecting the Disruptor: Demystifying memory barriers by Trisha GeeMemory barriers/fences by Martin ThompsonThe LMAX architecture by Martin FowlerA port of the LMAX Disruptor to the Go language by SmartyStreets.. More details

Leave a Reply