Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago #1389

Open
jununifi opened this issue Jan 30, 2024 · 14 comments
Assignees

Comments

@jununifi
Copy link

This issue is typical to Osmosis node and for other nodes didn't detect such problem.

Way to reproduce:

  • Start Osmosis node and keep it open for 1 hour
  • Start go relayer connecting to Osmosis node
Jan 29 16:40:01 vultr rly[191642]: ts=2024-01-29T16:40:01.948613Z lvl=warn msg="Relayer start error" error="error querying channels: post failed: Post \"http://127.0.0.1:26657\": context deadline exceeded"
Jan 29 16:40:01 vultr rly[191642]: Error: error querying channels: post failed: Post "http://127.0.0.1:26657": context deadline exceeded

The way of fixing this was only restarting Osmosis node and starting go-relayer.
The problem here is that, once go-relayer stops, it can't restart automatically without Osmosis node restart.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2       600G  279G  296G  49% /

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           32086       11559         586           0       19940       20060
Swap:           8191        2967        5224

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel Core Processor (Broadwell, no TSX, IBRS)
    CPU family:          6
    Model:               61
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            2
    BogoMIPS:            4788.90
@jununifi jununifi changed the title Context deadline exceed issue when starting go-relayer with Osmosis node started more than 1 hr ago Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago Jan 30, 2024
@jtieri
Copy link
Member

jtieri commented Jan 30, 2024

Thanks for opening the issue @jununifi.

Can you share what version of Osmosis you are using, as well as the relayer version?
It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

@danbryan
Copy link
Contributor

in the past, I have had luck changing this setting in config.toml when dealing with Context deadline exceed.

    [rpc]
    timeout_broadcast_tx_commit = "120s"

@jununifi
Copy link
Author

jununifi commented Feb 1, 2024

Thanks for opening the issue @jununifi.

Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

At the moment, we are using v22.0.1 version of Osmosis. But this used to happen in v20.x.x as well.
The context deadline exceed issue would be related to memory consumption?

@jununifi
Copy link
Author

jununifi commented Feb 1, 2024

in the past, I have had luck changing this setting in config.toml when dealing with Context deadline exceed.

    [rpc]
    timeout_broadcast_tx_commit = "120s"

Thanks for your information, we will try updating the configuration. We are using v2.3.1 version of go-relayer.

@jtieri
Copy link
Member

jtieri commented Feb 1, 2024

Thanks for opening the issue @jununifi.
Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

At the moment, we are using v22.0.1 version of Osmosis. But this used to happen in v20.x.x as well. The context deadline exceed issue would be related to memory consumption?

Thanks for the additional information! My coworker had the thought that since the issue only occurs after the Osmosis node has been running for some time, perhaps there is a memory leak in the Osmosis binary that causes a longer than usual response when the relayer attempts to query the channels on startup

@dylanschultzie
Copy link

I experienced this same issue, but with cosmoshub rather than osmosis.

One thing I'd argue is that this shouldn't close down the rly process. Stop attempting to run against that node, sure, but why kill the entire process which has other working channels?

@jtieri
Copy link
Member

jtieri commented Feb 1, 2024

I experienced this same issue, but with cosmoshub rather than osmosis.

One thing I'd argue is that this shouldn't close down the rly process. Stop attempting to run against that node, sure, but why kill the entire process which has other working channels?

interesting 🤔

i do agree that the current behavior is a bit extreme. i think addressing #1268 by only killing the PathProcessor related to the problematic node is a much better design and should be something we address in the next release. implementing support for multiple RPC endpoints to be configured would help alleviate some of this as well

@jtieri
Copy link
Member

jtieri commented Feb 2, 2024

One thing that we noticed is that we were using a single 60s timeout for the entire paginated query used for querying all channels. For chains like Osmosis and Cosmos Hub, that have a ton of channels in state, this could potentially be problematic. In #1395 I've made changes to use a 10s timeout per paginated query request so this should alleviate some of the issues.

Will also prioritize only killing PathProcessors related to a problematic node vs. killing the entire rly process.

@akc2267
Copy link
Contributor

akc2267 commented Feb 12, 2024

this should be resolved in 2.5.1

@akc2267 akc2267 closed this as completed Feb 12, 2024
@ertemann
Copy link
Contributor

@akc2267 @jtieri we still have this constantly even on 2.5.1

Both larger config and smaller config instances just go to an infinite loop where it reports:

2024-02-22T01:11:44.059393Z	warn	Relayer start error	{"error": "error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path"}
Error: error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path```

This it reports constantly for 10s of channels and chains. it stalls after ~1-2s and restarts the process going into the same error.

Our configs are correct and we can curl the URLs from the server proving it has access to the nodes

@jtieri jtieri reopened this Feb 22, 2024
@ertemann
Copy link
Contributor

funny thing is though that although its in a restart loop, we are processing packets.

@jtieri
Copy link
Member

jtieri commented Feb 26, 2024

@akc2267 @jtieri we still have this constantly even on 2.5.1

Both larger config and smaller config instances just go to an infinite loop where it reports:

2024-02-22T01:11:44.059393Z	warn	Relayer start error	{"error": "error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path"}
Error: error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path```

This it reports constantly for 10s of channels and chains. it stalls after ~1-2s and restarts the process going into the same error.

Our configs are correct and we can curl the URLs from the server proving it has access to the nodes

I would not expect to see these Ibc Error: Invalid query path logs for the originally described issue, this seems to be an entirely different issue altogether.

Do you know which chain is reporting these errors? I will likely need to do some local testing to sus out what the issue is with regards to that specific error

@ertemann
Copy link
Contributor

ertemann commented Mar 7, 2024

@jtieri it is indeed a separate issues, smaller configs now start.

The error logging shows no exact chain though which is causing this and i cant seem to distill that without doing an elimination test on the config which supports > 50 chains.

@ertemann
Copy link
Contributor

ertemann commented Mar 7, 2024

happy to share the config though privately if that helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants