Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much higher performance with these changes #9

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

buybackoff
Copy link

See #8. This is helpful to see the diff.

See alexandrnikitin#8. This is helpful to see the diff.
@@ -4,88 +4,113 @@

namespace MPMCQueue.NET
{
[StructLayout(LayoutKind.Explicit, Size = 192, CharSet = CharSet.Ansi)]
[StructLayout(LayoutKind.Explicit, Size = 336)]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, should be 384

buffer[index].Element = item;
Volatile.Write(ref buffer[index].Sequence, pos + 1);
cell.Element = item;
cell.Sequence = pos + 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volatile.Write is now almost identical to direct reads/writes to volatile fields.

On line 62 cell.Sequence is now a volatile read which is fairly slow on ARM.
I don't know if RuyJIT has changed implementation to use ARMv8 half-barries, but they are not availabe onARMv7 anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still only planning to buy Raspberry PI for testing :) Yes, good point. It's next to Interlocked call anyways. I changed that mostly from aesthetics perspective.

_enqueuePos = 0;
_dequeuePos = 0;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link

@omariom omariom Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant change by itself.

Is the perf the same without AI?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without AI is slightly slower of cause. But the main point of changes in multi-threaded scalability: spinning and two references to the same buffer. Not sure about buffer bounds check with & buffer.Length - 1 - does it work now? Am still lazy to learn assembly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is still there, but JIT reuses the load of Length for both the check and buffer.Length - 1.

Copy link
Author

@buybackoff buybackoff Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any trick to avoid bounds check here? To leverage pow2 somehow. I searched through coreclr issues but couldn't find any. Also do you know if there is a single place with all current optimizations for BC elimination?

Copy link

@omariom omariom Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can avoid bounds check.
With int index = pos & (buffer.Length - 1) pos can be anything and buffer.Length can be zero.
The bounds check is always predictable and not the only branch here. So unlikely change much.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put AggressiveInlining on public methods as we don't control callers code here. That could have some side effects: the caller code gets bigger and won't be inlined inself, other methods in caller code won't be inlined, some optimizations in caller can be missed, etc.
Yes, it's faster in benchmarks obviously but I wouldn't add it anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my bad habbit in general. But for low-level building blocks that are intended to be used later it's often better to have it: you could always wrap it into another method if you do not want inlining or prefer JIT to decide. Usually the caller decides this, not callee (relevant discussion dotnet/corefxlab#2592 (comment)). Here I never compared perf to the original implementation in absolute numbers, but the multithreaded scalability of the same method, which doesn't depend on the attribute.

@omariom
Copy link

omariom commented Mar 6, 2020

The code is already heavily optimized so would be interesting to compare the generated code:

Master (I added AgressiveInlining)

This PR

The diff

public bool TryDequeue(out object result)
{
result = null;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result is a ref. Each access dereference it.
That means you write twice to it on success.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left from my API changes where I return object? with local object result = null;

@omariom
Copy link

omariom commented Mar 6, 2020

@buybackoff
After all that's happened, you, as a decent gentleman perfman, must marry her provide proper benchmarks 😄

@buybackoff
Copy link
Author

buybackoff commented Mar 6, 2020

@omariom See the link for numbers. Sorry, but won't run BMDN for half an hour.

The key point of these changes is multithreaded scalability. On 12 cores adding spinning increases perf from 8 to 27 MOPS, adding 2 buffer fields - from 27 to 38 MOPS. It's more about thread interaction and not optimizing instructions.

I did a quick count of spinner.NextSpinWillYield and it's quite rare. But given that a thread could be descheduled after Interlocked but before updating cell.Cequence, other threads will burn CPU without spinning. Since it is MPMC more threads than cores is probably one of the intended use cases. There is a decent SPSC already and Disruptor for MPSC. For low latency with fixed number of threads we should add if(spinner.NextSpinWillYield) spinner.Reset() or just use Thread.SpinWait().

With Thread.SpinWait(10); the 4-threads performance is c.16 MOPS on the same bench, or 2x more than without spinning. With SpinWait(20) - 24 MOPS, with SpinWait(50) - 32 MOPS, with SpinWait(100) - 35 MOPS.

@buybackoff
Copy link
Author

Actually latency analysis is more interesting that throughput. It feels like even though yields are rare, they contribute a lot to throughput increase.

The numbers in the last paragraph above were for SpinWait were for 4 threads. For 12 the pattern is similar, but slower.

Copy link
Owner

@alexandrnikitin alexandrnikitin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the improvements. The results are impressive. That's a great job indeed.

Let's move the discussion here since @omariom already started it here.

Do spin with Spinwait, especially given that this queue is not lock-free per author's description.
I did a quick count of spinner.NextSpinWillYield and it's quite rare.

It's interesting that Spinwait gives such a great boost ("from 8 to 27 MOPS"). I'm wondering why. I would expect a fair number of yields then. It would explain it. Thread descheduling could probably explain it and I'm really interested in metrics.

Do not use _bufferMask, but use & _buffer.Length - 1. This probably removes bound checks (already or likely in the future)

I don't think that pos & (buffer.Length - 1) removes bounds checks as it's a complex optimization that crosses boundaries of the method it's in. The compiler needs to know about constructor, that length greater than 0 and a power of two. It's unlikely that JIT will have it at all as it's a rare case.
The op of reading the _bufferMask and &ing it with the pos (var index = pos & _bufferMask;) is essentially free.

Use two fields for the buffer, which point to the same buffer. Enqueue and Dequeue load their own field and prefetch their position value. Most importantly, this avoids false sharing while accessing the buffer pointer.

That's an interesting optimization. I'm curious why it helps. Because there are two fields: _buffer and _bufferMask, they are padded, readonly, we only read them but don't change. Their cache lines shouldn't be affected by _enqueuePos and _dequeuePos changes.

Use var ref for Cell.

I don't think it gives anything as it already reads and writes value directly. Anyway I would compare assembly code (I would need more time for that).

Unfortunately I switched jobs and don't have access to real-world data and scenarios I uses this queue for. So can't really compare except synthetic benchmarks. In general, the bottleneck of the algo is in Interlocked.CompareExchange that doesn't scale well with CPUs.

Regardless of that I said, you have made awesome improvements. I'm glad that there are use cases for this code. Thank you for doing it and raising the discussion.

_enqueuePos = 0;
_dequeuePos = 0;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put AggressiveInlining on public methods as we don't control callers code here. That could have some side effects: the caller code gets bigger and won't be inlined inself, other methods in caller code won't be inlined, some optimizations in caller can be missed, etc.
Yes, it's faster in benchmarks obviously but I wouldn't add it anyway.

@buybackoff
Copy link
Author

The op of reading the _bufferMask and &ing it with the pos (var index = pos & _bufferMask;) is essentially free.

I should try again with it. I just didn't like the presence of a field that could be removed and hoped that JIT is smart enough.

That's an interesting optimization. I'm curious why it helps. Because there are two fields: _buffer and _bufferMask, they are padded, readonly, we only read them but don't change.

Maybe it prefetches the next load of position.

Their cache lines shouldn't be affected by _enqueuePos and _dequeuePos changes.

Yes, that was a wrong claim about false sharing. I tested after adding 2 buffer fields and then using just one of them, without adjusting offsets. So 27 MOPS is when I introduced false sharing myself. But if a single buffer field is correctly padded, then just spinning (without reset) gives 36.2 MOPS, but then using 2 fields increases the number to 37.4.

I have added additional benchmark that is more typical for queues (vs a pool that I had before). Code is below.

Also I added if(spinner.NextSpinWillYield) spinner.Reset(), since yielding is just batching, which has it's own use case but makes comparison less interesting.

Numbers are for 4 threads.

  • With spinning+reset and 2 buffer fields: 13.8 MOPS
  • No spinning and 2 buffer fields: 12.3 MOPS
  • With spinning+reset and single buffer field: 13 MOPS
  • No spinning and single buffer field: 12.3 MOPS

Interestingly, the effect of 2 buffer fields is more visible with spinning.

So far I ended up not using MPMCQueue and will stop now doing any additional benchmarking. Just wanted to share interesting result found while playing with different pools implementation. Thanks for the discussion and feedback!

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
        internal void MPMCBenchmark() 
        {
            var count = 2_000_000;
            var threads = 4;
            var oddQueue = new MPMCQueue(128);
            var evenQueue = new MPMCQueue(128);
            using (Benchmark.Run("MPMCQueue", count * threads)) // measures time, calculates MOPS
            {
                Task.WaitAll(Enumerable.Range(0, threads).Select(i =>
                {
                    for (int j = 0; j < 4; j++)
                    {
                        oddQueue.Enqueue(new object());
                        evenQueue.Enqueue(new object());    
                    }
                    
                    if (i % 2 == 0)
                    {
                        return Task.Factory.StartNew(() =>
                        {
                            
                            for (int i = 0; i < count; i++)
                            {
                                object x1;
                                do
                                {
                                    x1 = evenQueue.Dequeue();
                                } while (x1 == null);

                                while (!oddQueue.Enqueue(x1)){}
                                
                            }
                        }, TaskCreationOptions.LongRunning);
                    }
                    else
                    {
                        return Task.Factory.StartNew(() =>
                        {
                            for (int i = 0; i < count; i++)
                            {
                                object x1;
                                do
                                {
                                    x1 = oddQueue.Dequeue();
                                } while (x1 == null);

                                while (!evenQueue.Enqueue(x1)){}
                                
                            }
                        }, TaskCreationOptions.LongRunning);
                    }
                    
                }).ToArray());
            }
        }

private readonly Cell[] _dequeueBuffer;

[FieldOffset(SAFE_CACHE_LINE * 2 + 8)]
private volatile int _dequeuePos;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello :)
I wonder if the observed improved performance really comes from having two copies of the same read-only variable, or rather from the fact that the frequently read fields are no longer in the same cache line as the fields being modified frequently.
I mean: I can see how constantly writing to a cache line can trash it and cause false sharing issues, but I fail to see how having multiple readers is a problem - the cache line should be in (S)hared state in MESI protocol, right?
I see that in the old code, the read-only _bufferMask had offset 8, and write-intensive _enqueuePos had offset 64, but does it really mean they are guaranteed to be in different cache lines?
What if the MPMCQueue instance itself is not aligned at cache line boundary?
Could it then be that __bufferMask is at the start of the cache line and __enqueuePos at its end?
And if the Facebook's observation that we should rather care about 128 rather than 64 units is correct, then the same applies to __buffer and __enqueuePos (and other pairs of "read-only" and "write-havy" fields as many of them are closer than 128 bytes to each other).
In description of PR you give one more probable reason of why duplication helps: data needed to enqueue is in one line, and data needed for dequeue is in one (other) line, which creates nice "packages" of self-sufficient data. I can see how it can help: it requires one fetch instead of two, in case data was missing from cache. But how frequent is the situation of cache miss in a tight loop of benchmark?
I would appreciate someone educate me on the most probable explanation (ideally: backed with experiment) why this really helps :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it then be that __bufferMask is at the start of the cache line and __enqueuePos at its end?

With probability 8/64 yes. .NET aligns objects to 8 bytes, not cache line. And yes, if anything is modified on a cache line all other threads will need to reload that line, so modifying enqueuePos could interfere with Dequeue reading bufferMask with 8/64 probability.

@buybackoff
Copy link
Author

buybackoff commented Mar 12, 2020

Looks like placing mask to separate cache lines in the same way as two buffers helps. Did run my own simple benchmark, improvement is c.3-4% with probability that there is no difference 2.5%. (2-tailed t-test on measurements). The theory is that we could do 3 loads from the same cache line and avoid touching the cache line that stored the mask - even though it's never modified, it needs to be loaded.

        [FieldOffset(Settings.SAFE_CACHE_LINE)]
        protected readonly Cell[] _enqueueBuffer;

        [FieldOffset(Settings.SAFE_CACHE_LINE + 8)]
        private int _enqueuePos;

        [FieldOffset(Settings.SAFE_CACHE_LINE + 16)]
        protected readonly int _enqueueMask;

        [FieldOffset(Settings.SAFE_CACHE_LINE * 2)]
        private readonly Cell[] _dequeueBuffer;
        
        [FieldOffset(Settings.SAFE_CACHE_LINE * 2 + 8)]
        private int _dequeuePos;

        [FieldOffset(Settings.SAFE_CACHE_LINE * 2 + 16)]
        protected readonly int _dequeueMask;

Need to run proper BDN probably to finalize the conclusion. Is it now possible to get t-test from BDN?

How would you test such code that depends on threads interactions more than instructions count and hence inherently unstable? Even without spinning, with turbo-boost off, and minimal background activity, the numbers are quire noisy here. In contrast, absolute results for instructions-bound benchmarking are pretty stable on the same setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants