Things service cannot reconnect to mongo #1861
-
We deploy Ditto (tried 3.4.0 and 3.4.4) using helm chart. Sometimes, for example, during cluster upgrade, we face following issue. Things service loses connection to mongo (this is expected as during cluster upgrade mongo might be temporarily unavailable). But later when mongo is back online things service does not reconnect. In logs we can see errors The only way that we found where we can see this error (apart from logs) is request to http://ditto-things:8080/status/health. The problem with this approach is that port 8080 is not declared in helm chart and thus is not accessible for k8s health probes. Restarting the pod fixes the issue, but the problem is that this restart has to be manual as k8s does not see any issues. Logs from things service that might be related{"@timestamp":"2024-01-11T12:52:00.028+01:00","@Version":"1","message":"Stream failed","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"ERROR","level_value":40000,"stack_trace":"java.util.concurrent.CompletionException: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)\n\tat java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)\n\tat java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:26)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:23)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:484)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:504)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:278)\n\tat scala.concurrent.Promise.tryFailure(Promise.scala:117)\n\tat scala.concurrent.Promise.tryFailure$(Promise.scala:117)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:104)\n\tat org.apache.pekko.stream.impl.SeqStage$$anon$3.onUpstreamFailure(Sinks.scala:274)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:537)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:400)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:662)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute(ActorGraphInterpreter.scala:71)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute$(ActorGraphInterpreter.scala:67)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$BatchingActorInputBoundary$OnError.execute(ActorGraphInterpreter.scala:92)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:637)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.org$apache$pekko$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:813)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:831)\n\tat org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)\n\tat org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:729)\n\tat org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)\n\tat org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)\n\tat org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)\n\tat org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)\n\tat kamon.instrumentation.executor.ExecutorInstrumentation$InstrumentedForkJoinPool$TimingRunnable.run(ExecutorInstrumentation.scala:698)\n\tat org.apache.pekko.dispatch.ForkJoinExecutorConfigurator$PekkoForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:57)\n\tat java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)\n\tat java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)\n\tat java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)\n\tat java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)\n\tat java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)\nCaused by: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-37","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:00.029+01:00","@Version":"1","message":"Stream failed or shutdown. Next stream in starting from <>","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"INFO","level_value":20000,"sourceThread":"ditto-cluster-pekko.actor.default-dispatcher-11","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:07.879+01:00","@Version":"1","message":"Sending the health of this system as requested: StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"See detailed messages for: persistence.\"]], children=[StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\"]], children=[], label=persistence]], label=null]","logger_name":"org.eclipse.ditto.internal.utils.health.status.StatusSupplierActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-6","level":"INFO","level_value":20000,"sourceThread":"Thread-38","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/statusSupplier"} {"@timestamp":"2024-01-11T12:52:12.009+01:00","@Version":"1","message":"Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]","logger_name":"org.eclipse.ditto.internal.utils.persistence.mongo.MongoHealthChecker","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-11","level":"ERROR","level_value":40000,"stack_trace":"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-39","pekkoAddress":"pekko://ditto-cluster@192.168.40.197:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/healthCheckingActor/persistence"} |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Oh, interesting, thanks for sharing @dimabarbul We run MongoDB Atlas and never faced such an issue during cluster upgrade in Mongo. I think the best way to fix this would be that All |
Beta Was this translation helpful? Give feedback.
Oh, interesting, thanks for sharing @dimabarbul
We run MongoDB Atlas and never faced such an issue during cluster upgrade in Mongo.
I think the best way to fix this would be that
/alive
would also know about this and k8s could restart the pod.This should be able via Pekko Management: https://pekko.apache.org/docs/pekko-management/1.0//healthchecks.html#defining-a-health-check via config
pekko.management.health-checks.liveness-checks
and adding a custom implementation for that.All
/status/health
routes are exposed in the helm chart (our should be) via{DITTO_ENDPOINT}/overall/status/health
- there you should also be able to see the http://ditto-things:8080/status/health