Much higher performance with these changes #9

buybackoff · 2020-03-06T21:31:06Z

See #8. This is helpful to see the diff.

See alexandrnikitin#8. This is helpful to see the diff.

buybackoff · 2020-03-06T21:35:38Z

src/MPMCQueue.NET/MPMCQueue.cs

@@ -4,88 +4,113 @@

 namespace MPMCQueue.NET
 {
-    [StructLayout(LayoutKind.Explicit, Size = 192, CharSet = CharSet.Ansi)]
+    [StructLayout(LayoutKind.Explicit, Size = 336)]


typo, should be 384

omariom · 2020-03-06T21:51:09Z

src/MPMCQueue.NET/MPMCQueue.cs

-                    buffer[index].Element = item;
-                    Volatile.Write(ref buffer[index].Sequence, pos + 1);
+                    cell.Element = item;
+                    cell.Sequence = pos + 1;


Volatile.Write is now almost identical to direct reads/writes to volatile fields.

On line 62 cell.Sequence is now a volatile read which is fairly slow on ARM.
I don't know if RuyJIT has changed implementation to use ARMv8 half-barries, but they are not availabe onARMv7 anyway.

I am still only planning to buy Raspberry PI for testing :) Yes, good point. It's next to Interlocked call anyways. I changed that mostly from aesthetics perspective.

omariom · 2020-03-06T21:57:33Z

src/MPMCQueue.NET/MPMCQueue.cs

            _enqueuePos = 0;
            _dequeuePos = 0;
        }

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]


This is a significant change by itself.

Is the perf the same without AI?

Without AI is slightly slower of cause. But the main point of changes in multi-threaded scalability: spinning and two references to the same buffer. Not sure about buffer bounds check with & buffer.Length - 1 - does it work now? Am still lazy to learn assembly.

The check is still there, but JIT reuses the load of Length for both the check and buffer.Length - 1.

Is there any trick to avoid bounds check here? To leverage pow2 somehow. I searched through coreclr issues but couldn't find any. Also do you know if there is a single place with all current optimizations for BC elimination?

I don't think it can avoid bounds check.
With int index = pos & (buffer.Length - 1) pos can be anything and buffer.Length can be zero.
The bounds check is always predictable and not the only branch here. So unlikely change much.

I wouldn't put AggressiveInlining on public methods as we don't control callers code here. That could have some side effects: the caller code gets bigger and won't be inlined inself, other methods in caller code won't be inlined, some optimizations in caller can be missed, etc.
Yes, it's faster in benchmarks obviously but I wouldn't add it anyway.

This is my bad habbit in general. But for low-level building blocks that are intended to be used later it's often better to have it: you could always wrap it into another method if you do not want inlining or prefer JIT to decide. Usually the caller decides this, not callee (relevant discussion dotnet/corefxlab#2592 (comment)). Here I never compared perf to the original implementation in absolute numbers, but the multithreaded scalability of the same method, which doesn't depend on the attribute.

omariom · 2020-03-06T22:03:39Z

The code is already heavily optimized so would be interesting to compare the generated code:

Master (I added AgressiveInlining)

This PR

The diff

omariom · 2020-03-06T23:03:08Z

src/MPMCQueue.NET/MPMCQueue.cs

        public bool TryDequeue(out object result)
        {
+            result = null;


result is a ref. Each access dereference it.
That means you write twice to it on success.

Left from my API changes where I return object? with local object result = null;

omariom · 2020-03-06T23:10:37Z

@buybackoff
After all that's happened, you, as a decent ~~gentleman~~ perfman, must ~~marry her~~ provide proper benchmarks 😄

buybackoff · 2020-03-06T23:26:36Z

@omariom See the link for numbers. Sorry, but won't run BMDN for half an hour.

The key point of these changes is multithreaded scalability. On 12 cores adding spinning increases perf from 8 to 27 MOPS, adding 2 buffer fields - from 27 to 38 MOPS. It's more about thread interaction and not optimizing instructions.

I did a quick count of spinner.NextSpinWillYield and it's quite rare. But given that a thread could be descheduled after Interlocked but before updating cell.Cequence, other threads will burn CPU without spinning. Since it is MPMC more threads than cores is probably one of the intended use cases. There is a decent SPSC already and Disruptor for MPSC. For low latency with fixed number of threads we should add if(spinner.NextSpinWillYield) spinner.Reset() or just use Thread.SpinWait().

With Thread.SpinWait(10); the 4-threads performance is c.16 MOPS on the same bench, or 2x more than without spinning. With SpinWait(20) - 24 MOPS, with SpinWait(50) - 32 MOPS, with SpinWait(100) - 35 MOPS.

buybackoff · 2020-03-06T23:51:21Z

Actually latency analysis is more interesting that throughput. It feels like even though yields are rare, they contribute a lot to throughput increase.

The numbers in the last paragraph above were for SpinWait were for 4 threads. For 12 the pattern is similar, but slower.

alexandrnikitin

Thanks a lot for the improvements. The results are impressive. That's a great job indeed.

Let's move the discussion here since @omariom already started it here.

Do spin with Spinwait, especially given that this queue is not lock-free per author's description.
I did a quick count of spinner.NextSpinWillYield and it's quite rare.

It's interesting that Spinwait gives such a great boost ("from 8 to 27 MOPS"). I'm wondering why. I would expect a fair number of yields then. It would explain it. Thread descheduling could probably explain it and I'm really interested in metrics.

Do not use _bufferMask, but use & _buffer.Length - 1. This probably removes bound checks (already or likely in the future)

I don't think that pos & (buffer.Length - 1) removes bounds checks as it's a complex optimization that crosses boundaries of the method it's in. The compiler needs to know about constructor, that length greater than 0 and a power of two. It's unlikely that JIT will have it at all as it's a rare case.
The op of reading the _bufferMask and &ing it with the pos (var index = pos & _bufferMask;) is essentially free.

Use two fields for the buffer, which point to the same buffer. Enqueue and Dequeue load their own field and prefetch their position value. Most importantly, this avoids false sharing while accessing the buffer pointer.

That's an interesting optimization. I'm curious why it helps. Because there are two fields: _buffer and _bufferMask, they are padded, readonly, we only read them but don't change. Their cache lines shouldn't be affected by _enqueuePos and _dequeuePos changes.

Use var ref for Cell.

I don't think it gives anything as it already reads and writes value directly. Anyway I would compare assembly code (I would need more time for that).

Unfortunately I switched jobs and don't have access to real-world data and scenarios I uses this queue for. So can't really compare except synthetic benchmarks. In general, the bottleneck of the algo is in Interlocked.CompareExchange that doesn't scale well with CPUs.

Regardless of that I said, you have made awesome improvements. I'm glad that there are use cases for this code. Thank you for doing it and raising the discussion.

alexandrnikitin · 2020-03-07T00:38:41Z

src/MPMCQueue.NET/MPMCQueue.cs

            _enqueuePos = 0;
            _dequeuePos = 0;
        }

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]


I wouldn't put AggressiveInlining on public methods as we don't control callers code here. That could have some side effects: the caller code gets bigger and won't be inlined inself, other methods in caller code won't be inlined, some optimizations in caller can be missed, etc.
Yes, it's faster in benchmarks obviously but I wouldn't add it anyway.

buybackoff · 2020-03-07T08:37:22Z

The op of reading the _bufferMask and &ing it with the pos (var index = pos & _bufferMask;) is essentially free.

I should try again with it. I just didn't like the presence of a field that could be removed and hoped that JIT is smart enough.

That's an interesting optimization. I'm curious why it helps. Because there are two fields: _buffer and _bufferMask, they are padded, readonly, we only read them but don't change.

Maybe it prefetches the next load of position.

Their cache lines shouldn't be affected by _enqueuePos and _dequeuePos changes.

Yes, that was a wrong claim about false sharing. I tested after adding 2 buffer fields and then using just one of them, without adjusting offsets. So 27 MOPS is when I introduced false sharing myself. But if a single buffer field is correctly padded, then just spinning (without reset) gives 36.2 MOPS, but then using 2 fields increases the number to 37.4.

I have added additional benchmark that is more typical for queues (vs a pool that I had before). Code is below.

Also I added if(spinner.NextSpinWillYield) spinner.Reset(), since yielding is just batching, which has it's own use case but makes comparison less interesting.

Numbers are for 4 threads.

With spinning+reset and 2 buffer fields: 13.8 MOPS
No spinning and 2 buffer fields: 12.3 MOPS
With spinning+reset and single buffer field: 13 MOPS
No spinning and single buffer field: 12.3 MOPS

Interestingly, the effect of 2 buffer fields is more visible with spinning.

So far I ended up not using MPMCQueue and will stop now doing any additional benchmarking. Just wanted to share interesting result found while playing with different pools implementation. Thanks for the discussion and feedback!

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
        internal void MPMCBenchmark() 
        {
            var count = 2_000_000;
            var threads = 4;
            var oddQueue = new MPMCQueue(128);
            var evenQueue = new MPMCQueue(128);
            using (Benchmark.Run("MPMCQueue", count * threads)) // measures time, calculates MOPS
            {
                Task.WaitAll(Enumerable.Range(0, threads).Select(i =>
                {
                    for (int j = 0; j < 4; j++)
                    {
                        oddQueue.Enqueue(new object());
                        evenQueue.Enqueue(new object());    
                    }
                    
                    if (i % 2 == 0)
                    {
                        return Task.Factory.StartNew(() =>
                        {
                            
                            for (int i = 0; i < count; i++)
                            {
                                object x1;
                                do
                                {
                                    x1 = evenQueue.Dequeue();
                                } while (x1 == null);

                                while (!oddQueue.Enqueue(x1)){}
                                
                            }
                        }, TaskCreationOptions.LongRunning);
                    }
                    else
                    {
                        return Task.Factory.StartNew(() =>
                        {
                            for (int i = 0; i < count; i++)
                            {
                                object x1;
                                do
                                {
                                    x1 = oddQueue.Dequeue();
                                } while (x1 == null);

                                while (!evenQueue.Enqueue(x1)){}
                                
                            }
                        }, TaskCreationOptions.LongRunning);
                    }
                    
                }).ToArray());
            }
        }

qbolec · 2020-03-09T10:18:13Z

src/MPMCQueue.NET/MPMCQueue.cs

+        private readonly Cell[] _dequeueBuffer;
+
+        [FieldOffset(SAFE_CACHE_LINE * 2 + 8)]
+        private volatile int _dequeuePos;


Hello :)
I wonder if the observed improved performance really comes from having two copies of the same read-only variable, or rather from the fact that the frequently read fields are no longer in the same cache line as the fields being modified frequently.
I mean: I can see how constantly writing to a cache line can trash it and cause false sharing issues, but I fail to see how having multiple readers is a problem - the cache line should be in (S)hared state in MESI protocol, right?
I see that in the old code, the read-only _bufferMask had offset 8, and write-intensive _enqueuePos had offset 64, but does it really mean they are guaranteed to be in different cache lines?
What if the MPMCQueue instance itself is not aligned at cache line boundary?
Could it then be that __bufferMask is at the start of the cache line and __enqueuePos at its end?
And if the Facebook's observation that we should rather care about 128 rather than 64 units is correct, then the same applies to __buffer and __enqueuePos (and other pairs of "read-only" and "write-havy" fields as many of them are closer than 128 bytes to each other).
In description of PR you give one more probable reason of why duplication helps: data needed to enqueue is in one line, and data needed for dequeue is in one (other) line, which creates nice "packages" of self-sufficient data. I can see how it can help: it requires one fetch instead of two, in case data was missing from cache. But how frequent is the situation of cache miss in a tight loop of benchmark?
I would appreciate someone educate me on the most probable explanation (ideally: backed with experiment) why this really helps :)

Could it then be that __bufferMask is at the start of the cache line and __enqueuePos at its end?

With probability 8/64 yes. .NET aligns objects to 8 bytes, not cache line. And yes, if anything is modified on a cache line all other threads will need to reload that line, so modifying enqueuePos could interfere with Dequeue reading bufferMask with 8/64 probability.

buybackoff · 2020-03-12T21:34:20Z

Looks like placing mask to separate cache lines in the same way as two buffers helps. Did run my own simple benchmark, improvement is c.3-4% with probability that there is no difference 2.5%. (2-tailed t-test on measurements). The theory is that we could do 3 loads from the same cache line and avoid touching the cache line that stored the mask - even though it's never modified, it needs to be loaded.

        [FieldOffset(Settings.SAFE_CACHE_LINE)]
        protected readonly Cell[] _enqueueBuffer;

        [FieldOffset(Settings.SAFE_CACHE_LINE + 8)]
        private int _enqueuePos;

        [FieldOffset(Settings.SAFE_CACHE_LINE + 16)]
        protected readonly int _enqueueMask;

        [FieldOffset(Settings.SAFE_CACHE_LINE * 2)]
        private readonly Cell[] _dequeueBuffer;
        
        [FieldOffset(Settings.SAFE_CACHE_LINE * 2 + 8)]
        private int _dequeuePos;

        [FieldOffset(Settings.SAFE_CACHE_LINE * 2 + 16)]
        protected readonly int _dequeueMask;

Need to run proper BDN probably to finalize the conclusion. Is it now possible to get t-test from BDN?

How would you test such code that depends on threads interactions more than instructions count and hence inherently unstable? Even without spinning, with turbo-boost off, and minimal background activity, the numbers are quire noisy here. In contrast, absolute results for instructions-bound benchmarking are pretty stable on the same setup.

Much higher performance with these changes

e3f9689

See alexandrnikitin#8. This is helpful to see the diff.

buybackoff mentioned this pull request Mar 6, 2020

Much higher performance with these changes #8

Open

buybackoff commented Mar 6, 2020

View reviewed changes

omariom reviewed Mar 6, 2020

View reviewed changes

Add System.Runtime.CompilerServices NS

31bc31c

omariom reviewed Mar 6, 2020

View reviewed changes

alexandrnikitin reviewed Mar 7, 2020

View reviewed changes

qbolec reviewed Mar 9, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Much higher performance with these changes #9

Much higher performance with these changes #9

buybackoff commented Mar 6, 2020

buybackoff Mar 6, 2020

omariom Mar 6, 2020

buybackoff Mar 6, 2020

omariom Mar 6, 2020 •

edited

Loading

buybackoff Mar 6, 2020

omariom Mar 6, 2020

buybackoff Mar 6, 2020 •

edited

Loading

omariom Mar 6, 2020 •

edited

Loading

alexandrnikitin Mar 7, 2020

buybackoff Mar 7, 2020

omariom commented Mar 6, 2020 •

edited

Loading

omariom Mar 6, 2020

buybackoff Mar 6, 2020

omariom commented Mar 6, 2020 •

edited

Loading

buybackoff commented Mar 6, 2020 •

edited

Loading

buybackoff commented Mar 6, 2020

alexandrnikitin left a comment

alexandrnikitin Mar 7, 2020

buybackoff commented Mar 7, 2020

qbolec Mar 9, 2020

buybackoff Mar 9, 2020

buybackoff commented Mar 12, 2020 •

edited

Loading

Much higher performance with these changes #9

Are you sure you want to change the base?

Much higher performance with these changes #9

Conversation

buybackoff commented Mar 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omariom Mar 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buybackoff Mar 6, 2020 • edited Loading

Choose a reason for hiding this comment

omariom Mar 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omariom commented Mar 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omariom commented Mar 6, 2020 • edited Loading

buybackoff commented Mar 6, 2020 • edited Loading

buybackoff commented Mar 6, 2020

alexandrnikitin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buybackoff commented Mar 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buybackoff commented Mar 12, 2020 • edited Loading

omariom Mar 6, 2020 •

edited

Loading

buybackoff Mar 6, 2020 •

edited

Loading

omariom Mar 6, 2020 •

edited

Loading

omariom commented Mar 6, 2020 •

edited

Loading

omariom commented Mar 6, 2020 •

edited

Loading

buybackoff commented Mar 6, 2020 •

edited

Loading

buybackoff commented Mar 12, 2020 •

edited

Loading