Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orleans 3.0 compatibility #5

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

justmara
Copy link
Contributor

Orleans 2.3 introduces some breaking change in SiloPersistentStreamConfigurator (in this PR) so older versions of this provider began throwing errors on start:

System.MissingMethodException: Method not found: 'Void Orleans.Streams.SiloPersistentStreamConfigurator..ctor(System.String, Orleans.Hosting.ISiloHostBuilder, System.Func`3<System.IServiceProvider,System.String,Orleans.Streams.IQueueAdapterFactory>)'.
   at Orleans.Streaming.SiloRabbitMqStreamConfigurator`1..ctor(String name, ISiloHostBuilder builder)
   at Orleans.Hosting.SiloBuilderExtensions.AddRabbitMqStream[TSerializer](ISiloHostBuilder builder, String name, Action`1 configure)

@justmara
Copy link
Contributor Author

hmm, will check that failing tests tomorrow

@justmara justmara changed the title orleans 2.3 compatibility orleans 3.0 compatibility Oct 15, 2019
@justmara
Copy link
Contributor Author

Something weird happens with AppVeyor build:

OrleansGenerateCode:
  "dotnet" "C:\projects\orleans-streams-rabbitmqstreamprovider\packages\microsoft.orleans.codegenerator.msbuild\3.0.0-rc1\build\..\tasks\netcoreapp3.0\Orleans.CodeGenerator.MSBuild.dll" SourceToSource "C:\projects\orleans-streams-rabbitmqstreamprovider\Orleans.Streams.RabbitMqStreamProvider.Tests\obj\Release\netcoreapp2.1\RabbitMqStreamTests.orleans.g.args.txt"

It is compiled under netcoreapp2.1, but Orleans.Codegenerator.MSBuild tries to use netcoreapp3.0 task for some reason o_O

@zitmen
Copy link
Owner

zitmen commented Oct 17, 2019

hi, I will allocate some time today and try to resolve this

@justmara
Copy link
Contributor Author

No luck? Maybe I can help somehow?
The most confusing part is that it builds and passes tests locally :) So I cant guess whats wrong with AppVeyor.

@zitmen
Copy link
Owner

zitmen commented Oct 18, 2019

Not sure either but it is not a good sign, I am not able to tell with confidece whether there is a concurrency issue in the Silo or RmqStreamProvider or in the tests; and thus I cannot merge this until this is resolved because it is a red flag.
The previous versions were battle tested in a production environment but I no longer have that luxury so we have to rely solely on the tests.
I will continue with this later.

@zitmen
Copy link
Owner

zitmen commented Oct 21, 2019

the latest stable (2.4.3) is available; v3 is still rc, so will wait with that...maybe by that time the issues with the tests will be resolved; they are pretty random :( and it seems like some issue with the Orleans cluster; I will resolve the conflicts later

@justmara
Copy link
Contributor Author

@zitmen can you point me what issues are you talking about so I could try checking myself? I've ran these tests locally for many times without issues.

And regarding 3.0 is rc: the claim This release is ready for production use. So there is low chance of any changes coming and if any issues exists - better try fix'em now =)

@zitmen
Copy link
Owner

zitmen commented Oct 21, 2019

The following issue appears randomly at appveyor. I am not able to reproduce it locally. It fails 1 out of 3 times, which is not good. I am not sure if it is caused by a time out or one of the silos crashes or something else. I hope to find some more time this week to look into this more. But feel free to investigate yourself. Thanks.

 X TestConcurrentProcessingOnFly [7s 249ms]
  Error Message:
   Orleans.Runtime.OrleansMessageRejectionException : Forwarding failed: tried to forward message Request S127.0.0.1:11111:309255876*cli/a519e186@34c4358a->S127.0.0.1:11111:309255876*grn/6C3F8DE7/00000000@126215b9 #589[ForwardCount=2]:  for 2 times after Non-existent activation to invalid activation. Rejecting now. 
  Stack Trace:
     at RabbitMqStreamTests.IntegrationTestHelpers.TestRmqStreamProviderOnFly(TestCluster cluster, Action`1 setupProxy, Int32 nMessages, Int32 itersToWait, RmqSerializer serializer) in C:\projects\orleans-streams-rabbitmqstreamprovider\Orleans.Streams.RabbitMqStreamProvider.Tests\IntegrationTestHelpers.cs:line 64
   at RabbitMqStreamTests.RmqIntegrationTests.TestConcurrentProcessingOnFly() in C:\projects\orleans-streams-rabbitmqstreamprovider\Orleans.Streams.RabbitMqStreamProvider.Tests\RmqIntegrationTests.cs:line 25
   at NUnit.Framework.Internal.TaskAwaitAdapter.GenericAdapter`1.BlockUntilCompleted() in D:\a\1\s\src\NUnitFramework\framework\Internal\TaskAwaitAdapter.cs:line 95
   at NUnit.Framework.Internal.AsyncToSyncAdapter.Await(Func`1 invoke) in D:\a\1\s\src\NUnitFramework\framework\Internal\AsyncToSyncAdapter.cs:line 60
   at NUnit.Framework.Internal.Commands.TestMethodCommand.Execute(TestExecutionContext context) in D:\a\1\s\src\NUnitFramework\framework\Internal\Commands\TestMethodCommand.cs:line 64
   at NUnit.Framework.Internal.Commands.BeforeAndAfterTestCommand.<>c__DisplayClass1_0.<Execute>b__0() in D:\a\1\s\src\NUnitFramework\framework\Internal\Commands\BeforeAndAfterTestCommand.cs:line 58
   at NUnit.Framework.Internal.Commands.BeforeAndAfterTestCommand.RunTestMethodInThreadAbortSafeZone(TestExecutionContext context, Action action) in D:\a\1\s\src\NUnitFramework\framework\Internal\Commands\BeforeAndAfterTestCommand.cs:line 73

https://ci.appveyor.com/project/zitmen65687/orleans-streams-rabbitmqstreamprovider/builds/28237257

@justmara
Copy link
Contributor Author

justmara commented Oct 21, 2019

Why did you make your own IQueueAdapterCache? Orlean's default SimpleQueueAdapterCache is not enough for your needs? Or does it act somehow different? I've replaced your cache with Orlean's built-in in my pullrequest, don`t really think it can be root of this issue, but need to mention.

@zitmen
Copy link
Owner

zitmen commented Oct 21, 2019

I don't know if they fixed it or not but it was broken and not reliable. Too many race conditions

@justmara
Copy link
Contributor Author

Hmm, it is unchanged since 2017, but used in Azure Queue, GCP, AWS SQS stream providers...

@zitmen
Copy link
Owner

zitmen commented Oct 21, 2019

It does not mean it is correct. I don't know what guarantees the other providers have but I tested it extensively with acking and it did not work correctly. That was the reason why this provider was created in the first place. Otherwise I could use the contrib one.

@justmara
Copy link
Contributor Author

I didn't say if its correct, but only that it is widely used without any issues and maybe in real world scenarios (inside Orleans scheduler) all these races you're talking about does not exist at all?

@zitmen
Copy link
Owner

zitmen commented Oct 22, 2019

you may be right
I am just saying that the first v1.5.2 was tested a lot because it was used in production (maybe still is, I don't know) and I found many issues that were really hard to debug and it all went away when I discovered some problems with the cache; I am not saying it cannot work with the other providers but the RMQ provider I have is using ACKing an in order to ACK the message reliably, I have to wait until the message is processed by a grain and that's why there were issues with the cache and it had to be replaced
that's the background story, simple as that
I will test it both with and without for v3 and we will see, I just don't want to rush this because then it will become really difficult to find and fix the issues

@justmara
Copy link
Contributor Author

justmara commented Oct 22, 2019

From my side i can tell that we're running forked version of your provider (with fixed compatibility for 2.3+ and SimpleQueueAdapterCache) for about 2 or months on production under load like 500msg/s without any issues. Or maybe we're just missing some logs and don't know if we have issues? :)

@justmara
Copy link
Contributor Author

Ahh, I remember another one modification we've made - that stupid 'fix', because did not have time to dig deeper to its root. Without this it were delivering only to first subscriber.
I'll try to investigate it too now.

@justmara
Copy link
Contributor Author

Re-checked that ShouldDeliver - now it works fine so I removed our fix.
Passing tests locally, but appveyor still fails :(
Also passing all tests fine even with higher message counts using that SimpleQueueAdapterCache.

@KevinCathcart
Copy link

So I've looked into the SimpleQueueCache vs ConcurrentQueueCache.

I'm not seeing problematic race conditions, because SimpleQueueCache is supposed to be called only from the PersistentStreamPullingAgent's thread. It is also fully synchronous, so it is not like there can be asynchronous reentrancy race conditions either.

But I do see some things in SimpleQueueCache that could be problematic with respect to ack/nack'ing messages. On the other hand, ConcurrentQueueCache is not a conformant IQueueCache, and causes its behavior to deviate from other providers (and orleans documented behavior) in a rather profound way in some circumstances.

Starting out first with SimpleQueueCache the main problem is that if delivery of a message fails (or the receiver throws an exception) then after 1 minute (sooner in certain cases) the message is marked as having failed delivery. Those messages never get delivered to MessagesDeliveredAsync, so they can never be either acked or nacked. (It also looks to me like there might be a bug in this behavior, where the wrong message gets marked as failed. See dotnet/orleans#6298)

A lesser consideration is that it can only ack messages once they are no longer needed in the cache. Because the messages from different virtual streams are combined into the cache, and it only releases whole cache buckets at a time, a slow consumer on one stream could substantially delay release of messages for other streams, which in turn means delaying acking those messages back to RabbitMQ. I don't think this one is usually a terribly big deal, especially since Azure Queues acts the same way (although it that it never needs to nack messages, with them being requeued based on a timeout instead).

The ConcurrentQueueCache on the other hand always let messages get ack'ed or nack'ed just as soon as they are first delivered. The problem is that it is written in such a way that if there are multiple subscribers they receive messages round-robin style, with each message only getting processed by a single subscriber. While that can be useful (and mirrors how RabbitMq queues work) multiple subscribers in Orleans are each supposed to receive all messages in it. To quote the Orleans documentation:

An Orleans stream may have multiple producers and multiple consumers. A message published by a producer will be delivered to all consumers that were subscribed to the stream before the message was published.

@zitmen
Copy link
Owner

zitmen commented Feb 14, 2020

@KevinCathcart you nailed it. My memory of all the reasoning is quite blurry because it was long time ago but you are right that the main issue with the simple cache was the (n)acking and the fact it was very unreliable in our application. My implementation was intended to behave the exact same as any other RMQ client because I was porting an external service which was triggering silo processing based on RMQ notifications into the Orleans streaming. Also, the application was required to work as close to real-time as possible so this whole issue with buckets was significant.
I don't really want to change the behavior. Forking it like justmara did is a good option. By the way, the other RMQ provider in the OrleansContrib is using the the simple cache which is why I could not use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants