Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck Threads when creating a continous replication #80

Open
traugott opened this issue May 31, 2017 · 13 comments
Open

Stuck Threads when creating a continous replication #80

traugott opened this issue May 31, 2017 · 13 comments
Labels
Milestone

Comments

@traugott
Copy link

traugott commented May 31, 2017

I have multiple android devices. One device creates an LiteListener listener where the other devices replicates to. The other devices uses continous replication.

In the class com.couchbase.lite.router.Router in line 1787 (version 1.4.0 from maven repo) the Status is set to zero. In this case the Method

com.couchbase.lite.router.Router.start doesnt call sendResponse(), so the callback block doesnt get called.

The callback block is defined in com.couchbase.lite.listener.LiteServlet. The callback block counts down the doneSignal. Only if this doneSignal is counted down, the
thread gets put back in the threadpool.

The Acme threadpool runs out of threads and no replication occurs any more.

MODIFIED: The callback also doesnt get called when the jvm which replicates terminates

@hideki
Copy link

hideki commented May 31, 2017

@traugott

In the class com.couchbase.lite.router.Router in line 1787 (version 1.4.0 from maven repo) the Status is set to zero. In this case the Method

Can you provide the link to the code? Line 1787 on release/1.4.0 branch might not point to correct line.

@traugott
Copy link
Author

traugott commented Jun 1, 2017

Sorry for writing the wrong line number, it has to be 1878 in the Router class. I downloaded the source from maven with the netbeans download source button.

I also added a test project which shows the stuck threads:

https://github.com/traugott/CouchbaseLiteListenerTest

Short snippet around 1878:

        } else {
            if (timeout == 0)
                timeout = DEFAULT_CHANGES_TIMEOUT;
        }

        startTimeout();

        // Don't close connection; more data to come
        return new Status(0);  // Line 1878
    } else {
        if (options.isIncludeConflicts()) {
            connection.setResponseBody(new Body(responseBodyForChangesWithConflicts(changes, since)));
        } else {
            connection.setResponseBody(new Body(responseBodyForChanges(changes, since)));
        }
        return new Status(Status.OK);

`

@hideki
Copy link

hideki commented Jun 1, 2017

@traugott,
As /_changes REST call is a longpoll, so it intentionally blocks if no changes in the listener side database.

@traugott
Copy link
Author

traugott commented Jun 1, 2017

right, /_changes is a longpoll. But if the client disconnects, the thread in the thread pool stays in busy state. After the threads in the thread pool are all busy (20 threads is the default), no more new replications occur. I agree if the client stays connected, but not if the client gets closed.

@hideki
Copy link

hideki commented Jun 1, 2017

I guess that Listener side sockets can not detect the disconnection when a client closes a connection or the network error occurs. Then Listener side threads are kept blocked. But the client resends the new requests. As result, all threads at Listener side are blocked. Can you capture Http requests between client and listener?

By the way, How many client devices are you using for your tests?

@traugott
Copy link
Author

traugott commented Jun 1, 2017

The listener runs on an android device. The Replicators run on an android device or on linux. In my tests i only use the android device with the listener and one client on my linux machine.

As mentioned earlier, i added a test project, available at
https://github.com/traugott/CouchbaseLiteListenerTest

This is simple and runnable under linux. It starts two managers the the listener/replicators. It shows the same effect (stuck threads).

At the moment i have no infrastructure for capturing the http requests. Would you prefer wireshark ? I only can capture the http requests on the linux side.

@hideki
Copy link

hideki commented Jun 1, 2017

We have tested Android to Android. Under the regular condition, we have confirmed this architecture works.

  • Wireshark data might provide us hint what cause the problem.

  • Maybe you could run the following command on both Android and Linux to check socket status.
    netstat -an | grep <port no>

  • Also, thread dump from Android can confirm if all threads are running state.

@traugott
Copy link
Author

traugott commented Jun 2, 2017

At the moment i was only able to connect from a windows machine to the android device. The Listener port on the android device is 5177.

The Android device has the ip 192.168.43.1, the windows machine 192.168.43.73

I recorded the communication between the android device and the windows machine. The netstat command was called after the client on the windows machine gets closed.

Because my android device is not rooted, i used the adm tool to get a thread dump and copied it via clipboard to a text file. I copied the thread dump after the client disconnects

All three things i zipped and attached.

By the way, the same problem is shown on
https://github.com/traugott/CouchbaseLiteListenerTest
which is easily runnable under your pc.

File: couchbaselite.zip

@djpongh djpongh added this to the 1.4.1 milestone Jun 5, 2017
@djpongh djpongh added the icebox label Jun 9, 2017
@NitzDKoder
Copy link

NitzDKoder commented Sep 17, 2017

@hideki
with context to #83

  1. Listener started on LOCALHOST/127.0.0.1 like LAN on a android phone.

2)App UI JS/HTML make REST request using modules.js

https://github.com/couchbaselabs/TodoLite-PhoneGap/blob/f1a810f9408a9807f3bf25ae69ba1d6a6aaee71c/js/modules.js

  1. As the Listener is running on LOCALHOST like LAN on android device..will the remote LTE network up/down will cause any issues?

Please confirm..

Problem:

Suddenly we see none of the request reaching the Listener. We get xhr.statusCode == 0 for all XMLHttpRequest requests.

https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/status

https://github.com/couchbaselabs/TodoLite-PhoneGap/blob/f1a810f9408a9807f3bf25ae69ba1d6a6aaee71c/js/modules.js#L1361

Why we get from webview xhr.statusCode == 0
?

@hideki
Copy link

hideki commented Sep 17, 2017

@NitzDKoder
What request does your app send? _changes REST API with continuous=true?
If network signal is on/off frequently, I assume Listener cannot detect connection failure. Then it keeps consuming thread till socket timeout occurs. It might cause the problem you are facing.

@NitzDKoder
Copy link

@hideki app sends

  1. replication as longpoll/continous thats for remote sync.
  2. database changes as longpoll/continous for database changes.

How does the LTE network up/down effect local scoket created?

Ex:
1)Local server running say 44444 port.
2)app will uses socket 33333 port and connect 127.0.0.1:44444.

Where is LTE up down involved.?

Thanks

@hideki
Copy link

hideki commented Sep 18, 2017

Hi @NitzDKoder

  1. replication as longpoll/continous thats for remote sync.

Replicator does not connect to Listener. This should not affect to the Listener

  1. database changes as longpoll/continous for database changes.

I believe this indicates Application's JavaScript sends _changes request to Listener with continuous=true. This is that a local client connects to a local server. Theoretically, this should not cause socket (network) failure.

So LTE up and down should not cause the problem that you are observing.

Do you see LTE up and down affect local to local communication? If not, the problem is somewhere else.

@djpongh djpongh modified the milestones: 1.4.1, 1.4.2 Sep 18, 2017
@NitzDKoder
Copy link

NitzDKoder commented Sep 19, 2017

@hideki

  1. we control remote replication by REST via listener
    Ex: Stop a continuous replication.

Stop Sync = {"target":"ptxdata","source":{"url":"https://AAAAAAAAAAAA:XXXXXXXXXXXXXXXXX@mydomain/data"},"continuous":true,"filter":"sync_gateway/bychannel","query_params":{"channels":"PTX-919480109390-meta-CH,919480109390-CH"}}
[08/28 17:40:08:013] [HTML5LOG] ReSTCBL ::: Request started::{"id":"139: POST https://:@localdomain:14480/_replicate"}

  1. also listen to local db changes by REST via listener
    Ex: longpoll _changes.

Line 2050: [08/28 17:54:22:685] [HTML5LOG] ReSTCBL ::: Request started::{"id":"142: GET https://:@localdomain:14480/data/_changes?since=97&include_docs=true&style=main_only&heartbeat=300000&feed=longpoll"}

  1. local database view query done using REST request via listener.
    Ex: view query.

Line 2553: History url = https://:@localdomain:14480/data/_design%2FgetHistory/_view/getHistory?key=%22tel%3A%2B919663733552%22&reduce=false&group=false

All the request fail sometimes on network tranistions.
We see some exception serve.java.(socket closed, socket timeout and broken pipe.)

Line 38690: [09/19 16:29:44:748] [13043] Listener: Serve Run parseRequest IOException Socket[address=localhost/localdomain,port=57987,localPort=24089]
Line 38691: [09/19 16:29:44:748] [13043] Listener: Serve Run parseRequest IOException
Line 38692: [09/19 16:29:44:749] [13043] Listener: Serve Run parseRequest SocketTimeoutException Socket[address=localhost/localdomain,port=57987,localPort=24089]
Line 38693: [09/19 16:29:44:750] [13043] Listener: Serve Run parseRequest SocketTimeoutException

[09/19 13:07:30:338] [5415] Listener: Serve Run parseRequest Exception socket Socket[address=localhost/localdomain,port=35280,localPort=41283]
[09/19 13:07:30:339] [5415] Listener: TJWS: IO error: javax.net.ssl.SSLException: Write error: ssl=0x94cba400: I/O error during system call, Broken pipe in processing a request /ptxdata/_changes from localhost/localdomain:41283 / com.android.org.conscrypt.OpenSSLSocketImpl
[09/19 13:07:30:339] [5415] Listener: Serve Run parseRequest Exception Write error: ssl=0x94cba400: I/O error during system call, Broken pipe

https://github.com/couchbase/couchbase-lite-java-listener/blob/master/vendor/tjws/src/java/Acme/Serve/Serve.java#L2130

Workaround:
Currently restarting listener on some request failures.

TODO:
Understand how the client socket is behaving network transitions.Is the sever socket stable network transitions.Maybe android invalidates all sockets.

Will update more..

Thanks
Nithin

@djpongh djpongh modified the milestones: 1.4.2, 1.4.x Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants