Skip to content

0300 Intermittent or Periodic Network Issue

Malcolm Stewart edited this page Oct 27, 2020 · 12 revisions

0300 Intermittent or Periodic Network Issue

in progress

The network is mostly stable, but there are occasional issues with reliability (intermittent) or issues may happen at about the same time each day, week, or month (periodic).

0300.1 Typical Error Messages

  • Communication Link Failure
  • Connection timeout expired
  • General Network Error
  • Transport-level Error
  • The specified network name is unavailable
  • Semaphore timeout
  • The wait operation timed out
  • A fatal error occurred while reading from the input stream
  • Protocol error in TDS stream
  • The server was not found or was not accessible
  • SQL Server does not exist or access denied (this can also be an authentication error)

In the context of the Workflow, the word "Client" refers to the immediate client to SQL Server, e.g. in a 3-tier application, the client could be a web server.

0300.2 Moving Parts

The initial goal is to try to isolate which of the moving parts is causing the problem. High level isolation will be Client, Server, or Network.

Moving Parts

0300.3 Appropriate Expectations

0300.3.1 This issue may take a while to resolve depending on the frequency that the problem occurs, whether it happens to one of many clients vs. all clients vs. application server, and whether it happens more often at set times of day, e.g. busy periods or during backups or reindexing.

0300.3.2 The most common issues are related to packet drop due to Antivirus, Network optimizations, older network drivers, bad router or switch, and not pooling connections in the application. Some causes, such as Antivirus can be difficult to prove, but are common nonetheless. You may have to uninstall and reboot the computer to prove this without hard proof. Creating an exception may work, also. But turning the AV off generally won't work because the network filter drivers are still being loaded even if they aren't being monitored.

0300.3.3 For intermittent connections to non-Microsoft databases, such as Oracle, DB/2, or MySQL, you may need to engage the appropriate vendor. Microsoft can only provide limited support in this area, such as substituting a different application or driver for the vendor's one and comparing results. Do not ask Microsoft to debug a vendor's product as that may reveal proprietary information. The vendor should be the first point of contact for issues with their product, and can work together with Microsoft on a case, should that be necessary.

0300.3.4 This workflow is designed for SQL Server client/server connections. Other communications, such as SQL Server Mirroring, Always-On, and Service Broker synchronization traffic over port 5022, are not addressed.

0300.4 Pre-Work

Please perform the initial data collection and narrowing steps: 0100 Initial Data Collection and Scoping Questions. This will help get a macro perspective of the scope of an issue, such as whether the issue affects multiple computers or just one, or whether only those computers in a specific data center are facing issues. This can help focus the troubleshooting steps. It will also make you prepared for discussing the issue with Microsoft Support should you choose to do so.

Review the public troubleshooting documents listed in section 0015 Self-Help Articles.

0300.5 Order of Troubleshooting

In general, troubleshooting should be data driven, which may give way to empirical tests in a more focused context. If the issue is very intermittent and network traces will be difficult to capture, then the empirical methods may be applied first.

0300.5.1 Collect Network Traces on Client and Server Collect a Network Trace

0300.5.2 Run SQL Server Network Analyzer SQL Network Analyzer (SQLNA) SQL Network Analyzer UI (SQLNAUI)

Process client and server traces separately. If you have chained traces, process these at the same time if you have enough memory (files equal to max 80% of computer memory).

This will generate a report of suspected problems and a CSV file you can explore in Excel for alternate research.

Try to locate matched conversations in the client trace and the server trace. Generally, the IP addresses and port #s match. However, if the connections go through any sort of Network Address Translation or port mapping, it may be a bit more difficult and you may have to line up using IPV4 packet IDs and comparing payloads.

Examine how the conversations end in NETMON or WireShark. Do the client and server agree on the same thing or do they tell a different story?

0300.5.3 Connection Closed During SSL Handshake In the ServerHello packet, if the Cipher Suite used is a Diffie-Hellman suite (TLS_DHE...) and traffic is between Windows 2012 or earlier and Windows 2016 or later, then there was a change in this algorithm starting with Windows 2016 security patches. You should disable this group of cipher suites.
https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/apps-forcibly-closed-tls-connection-errors

If the connection is closed after the ClientHello, then check if there is a TLS 1.0/TLS 1.2 mismatch between client and server. If they are the same check the enabled Cipher Suites on both machines and enabled hashes.

0300.5.4 Dropped Packets Look at the tail end of the matched conversations. If one has a number of retransmitted packets (or 10 Keep-Alive packets 1-second apart) followed by an ACK+RESET and the other does not or one reports a timely response and the other sees it delayed and closes or resets the conversation, this is indicative of a network device having a problem and dropping or delaying packets. You may also see the client report indicate the server reset the conversation and the server report indicate the client reset the conversation. This is due to a bad switch or router closing the connection from the middle, and they can sometimes be configured to do so if they detect the connection has been idle for a while - often ignoring Keep-Alive packets.

0300.5.5 Both the Server Trace and the Client Trace agree the issue is on the client If both traces show a delay or non-response on the client, or if the client issues an ACK+RESET after acknowledging a server response or otherwise closes the connection early part way through the login sequence, then you will want to take a BID trace and a NETSH trace on the client to see inside the TCP/IP stack and what the driver is thinking. This is common if the antivirus or other network filter driver is delaying the packet being received or the reply being sent. Connection timeouts could also be due to a slow DNS response or slow security API that was called before the initial SYN packet is sent over the wire.

Check the Ephemeral Ports report and make sure the client is not running out of outbound ports.

If the client has a long delay before sending the SYN packet, you may see a pattern showing just the TCP 3-way opening handshake followed immediately, or sometimes after sending the PreLogin packet, by an ACK+FIN originating from the client.

Suggested NETSH command for second level isolation. Note this has a circular buffer so must be terminated immediately after the error occurs.

NETSH TRACE START SCENARIO=NETCONNECTION CAPTURE=YES TRACEFILE=c:\temp\mycap.etl FILEMODE=CIRCULAR MAXSIZE=2048 PACKETTRUNCATEBYTES=180

Suggested BID Trace commands can be found in the Cookbook section. Remember to run the NETSH trace and the BID trace at the same time.

0300.5.6 Both the SQL trace and the Client trace agree the issue is on the server If both issues show a delay or non-response on the server, or if the server closes the connection at an unexpected point in the login sequence, or if the server closes a number of connections at the same time, then this indicates there are some problems on the server. The most likely cause is poor server performance or high MAXDOP and large, parallel queries and blocking. These can cause thread starvation, preventing a login request being handled in a timely manner, especially if you see a lot of connection timeouts end at the same time and the LoginAck column says 'Late'. The SQL ERRORLOG file may show IO operations taking longer than 15 seconds, which is another indicator of performance issues. In the network trace, you may also see a number of conversations in the Reset report of 6 frames or fewer, indicating perhaps the TCP 3-way handshake may not have completed.

Clone this wiki locally