Things service cannot reconnect to mongo #1861

dimabarbul · 2024-01-12T08:28:24Z

dimabarbul
Jan 12, 2024

We deploy Ditto (tried 3.4.0 and 3.4.4) using helm chart. Sometimes, for example, during cluster upgrade, we face following issue.

Things service loses connection to mongo (this is expected as during cluster upgrade mongo might be temporarily unavailable). But later when mongo is back online things service does not reconnect. In logs we can see errors com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'. But /alive and /ready health check endpoints return OK. Request to http://ditto-gateway:8080/health returns 503 as expected. We waited for more than an hour and the issue did not go away.

The only way that we found where we can see this error (apart from logs) is request to http://ditto-things:8080/status/health. The problem with this approach is that port 8080 is not declared in helm chart and thus is not accessible for k8s health probes.

Restarting the pod fixes the issue, but the problem is that this restart has to be manual as k8s does not see any issues.

Logs from things service that might be related

{"@timestamp":"2024-01-11T12:52:00.028+01:00","@Version":"1","message":"Stream failed","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"ERROR","level_value":40000,"stack_trace":"java.util.concurrent.CompletionException: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)\n\tat java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)\n\tat java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:26)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:23)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:484)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:504)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:278)\n\tat scala.concurrent.Promise.tryFailure(Promise.scala:117)\n\tat scala.concurrent.Promise.tryFailure$(Promise.scala:117)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:104)\n\tat org.apache.pekko.stream.impl.SeqStage$$anon$3.onUpstreamFailure(Sinks.scala:274)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:537)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:400)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:662)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute(ActorGraphInterpreter.scala:71)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute$(ActorGraphInterpreter.scala:67)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$BatchingActorInputBoundary$OnError.execute(ActorGraphInterpreter.scala:92)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:637)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.org$apache$pekko$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:813)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:831)\n\tat org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)\n\tat org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:729)\n\tat org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)\n\tat org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)\n\tat org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)\n\tat org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)\n\tat kamon.instrumentation.executor.ExecutorInstrumentation$InstrumentedForkJoinPool$TimingRunnable.run(ExecutorInstrumentation.scala:698)\n\tat org.apache.pekko.dispatch.ForkJoinExecutorConfigurator$PekkoForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:57)\n\tat java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)\n\tat java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)\n\tat java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)\n\tat java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)\n\tat java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)\nCaused by: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-37","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:00.029+01:00","@Version":"1","message":"Stream failed or shutdown. Next stream in starting from <>","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"INFO","level_value":20000,"sourceThread":"ditto-cluster-pekko.actor.default-dispatcher-11","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:07.879+01:00","@Version":"1","message":"Sending the health of this system as requested: StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"See detailed messages for: persistence.\"]], children=[StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\"]], children=[], label=persistence]], label=null]","logger_name":"org.eclipse.ditto.internal.utils.health.status.StatusSupplierActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-6","level":"INFO","level_value":20000,"sourceThread":"Thread-38","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/statusSupplier"} {"@timestamp":"2024-01-11T12:52:12.009+01:00","@Version":"1","message":"Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]","logger_name":"org.eclipse.ditto.internal.utils.persistence.mongo.MongoHealthChecker","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-11","level":"ERROR","level_value":40000,"stack_trace":"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-39","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/healthCheckingActor/persistence"}

Answered by thjaeckle

Jan 12, 2024

Oh, interesting, thanks for sharing @dimabarbul

We run MongoDB Atlas and never faced such an issue during cluster upgrade in Mongo.

I think the best way to fix this would be that /alive would also know about this and k8s could restart the pod.
This should be able via Pekko Management: https://pekko.apache.org/docs/pekko-management/1.0//healthchecks.html#defining-a-health-check via config pekko.management.health-checks.liveness-checks and adding a custom implementation for that.

All /status/health routes are exposed in the helm chart (our should be) via {DITTO_ENDPOINT}/overall/status/health - there you should also be able to see the http://ditto-things:8080/status/health

View full answer

thjaeckle · 2024-01-12T08:56:51Z

thjaeckle
Jan 12, 2024
Maintainer

Oh, interesting, thanks for sharing @dimabarbul

We run MongoDB Atlas and never faced such an issue during cluster upgrade in Mongo.

I think the best way to fix this would be that /alive would also know about this and k8s could restart the pod.
This should be able via Pekko Management: https://pekko.apache.org/docs/pekko-management/1.0//healthchecks.html#defining-a-health-check via config pekko.management.health-checks.liveness-checks and adding a custom implementation for that.

All /status/health routes are exposed in the helm chart (our should be) via {DITTO_ENDPOINT}/overall/status/health - there you should also be able to see the http://ditto-things:8080/status/health

2 replies

dimabarbul Jan 12, 2024
Author

Thank you for such quick response.

Do I get it right that some Java code should be written for this? Or to be precise, following should be done:

a class implementing Supplier<CompletionStage<Boolean>> should be created, it will send retrieveHealth command to /user/thingsRoot/healthCheckingActor and check the response
the class should be packed into jar and placed under /opt/ditto or /opt/ditto/extensions as described at Extending Ditto
the class name should be mentioned in pekko.management,health-checks.liveness-checks as described by the link you provided (https://pekko.apache.org/docs/pekko-management/1.0//healthchecks.html#configuring-health-checks)

thjaeckle Jan 12, 2024
Maintainer

Yes, something like that.
I would however add it to Ditto's codebase instead of providing as extension.. 😉
For the upcoming 3.5.0 release for example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Things service cannot reconnect to mongo #1861

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Things service cannot reconnect to mongo #1861

dimabarbul Jan 12, 2024

Replies: 1 comment · 2 replies

thjaeckle Jan 12, 2024 Maintainer

dimabarbul Jan 12, 2024 Author

thjaeckle Jan 12, 2024 Maintainer

dimabarbul
Jan 12, 2024

Replies: 1 comment 2 replies

thjaeckle
Jan 12, 2024
Maintainer

dimabarbul Jan 12, 2024
Author

thjaeckle Jan 12, 2024
Maintainer