Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks

News



Archives

Programming

Many developers mainly know their application frameworks to rapidly deliver business driven applications. This is a good thing but still there are things like the OS, CPU, memory and the network which are critical to application performance.

The network is often viewed as a bidirectional connection between two computers

 

image

 

This is all well until things start to fail. The network topology then can quickly can change from two arrows to a black hole which tends to suck up all information and

Side Note: Black holes are not entirely black and will eventually fully disclose their stored information as an exploding white hole or a dim radiation originating from quantum effects where e.g. a virtual electron positron pair is separated by the gravitational force. One particle lands in the black hole and the other one has no chance than to become a real particle since there is no anti partner available anymore. If this effect is real no one knows but it is an interesting idea. Search for Hawking radiation if you want to know more.

never gives it back. Sudden hangs in network connections are exceptionally hard to analyze since it involves two computers and the magic between them.

 

image

In that case you need to check both computers and the network in one run to fully analyze the issue. To be fully able to analyze network glitches you need to use WPT (Windows Performance Toolkit) on both machines and Wireshark on one machine to capture all in/out packets. It is difficult to correlate both ETW traces with the Wireshark capture. Things become a lot easier if you have a remote interactive application like an online game were the server reacts on user input. In that case you can simply capture all mouse/keyboard events and save them to ETW locally and send them as well over a dedicated port over the wire and save the captured keyboard/mouse events on the remote machine as well. Then it is much easier to correlate network traces with ETW events because you have the keyboard/mouse events as markers in your ETW and network event stream. I had some time to play with this idea and I did try to learn something new during Christmas. In my case it was WPF so I created a small tool to put this idea to work.

Currently it looks like this.

image

Here I can capture mouse and keyboard events and save them to ETW and over the network on both machines. At the same time I can start on both machines ETW traces.

image

I think this is a powerful capability to correlate network and ETW events captured from different machines. You did perhaps notice the Slow button which defines a hot key which logs an additional Slow ETW event into the stream to mark an interesting condition e.g. a small hang. That makes it easy to correlate perceived user experience with a good marker event in the Wireshark and ETW traces. You can employ this concept in general to distributed systems as well if you know which action on one machine triggers operations on one or more other remote machines. If you log the start requests into your ETW and network stream (e.g. an diagnostics port unencrypted to make reading easier) there is no room for dark spots. The downside of course is that this type of analysis triples the amount of work since you need to check two machines and one network trace. Reading TCP traces is not the most natural thing to a developer but it is an interesting undertaking.

If you want to know more about the TCP protocol, HTTP, SPDY, …. you should have a look at High Performance Browser Networking which explains things from the bits up to the high level HTTP protocol and where you can tune things. Since the author is a performance engineer at Google he knows what he is talking about.

More details are the topic of a future post when I publish the code on CodePlex.

But back to the original topic of network hangs and the Silly Window Syndrome (SWS). If you are e.g. sending every keystroke over the network and you are fast at typing you can create the SWS since a lot of very small packets are going over the wire. In that case SWS avoidance algorithms do kick in which do delay the communication on purpose to force the receiver or the sender to generate bigger packets so the TCP window size can be increased from 0 bytes to a reasonable value. On Windows there is a Silly Window Syndrome avoidance timer used. The receiver of a small packet needs to ACK the small packet. The server will wait up to 5s for the next packet when the client does not respond to the packet with an ACK and an updated bigger tcp receive window size.

This all deep network stuff. Why should I care on my local machine and the loopback device? Well it happens that there is a bug in Windows 7 which causes your network connection to stall for 5s sometimes. The issue is that the SWS Timer is started AFTER the packet has been sent. If the ACK is received before the timer can start it will wait for the ACK (which will never come) for the full five seconds before sending the next packet.

This is known since 2011 but it took many support request to MS to release a hotfix for Windows 7 and Server 2008 R2 July 2013. Windows is used in many mission critical soft real time systems where inter process communication via sockets can break at any time for 5s. This issue has certainly caused much pain in the wild and very few people will have found the root cause of intermittent loopback device hangs.

If you have software communicating over the loopback device you should really get this hotfix to get rid of these sporadic errors. A workaround is to plug in a real loopback network cable which gives you enough delay so that the timer does start before the ACK is received. I am quite sure that this issue is the source of many failed UI automation tests where the UI did not react to user input sometimes within the usual time frame (e.g. 5s) but sometimes it did take much longer. It could be a good idea to add this fix to your test machines as well.

How does this issue look in reality? Lets have a look at some busy processes communication with each other:

clip_image001

The region should be red but that's the color how WPA shows highlighted region. All processes are busy but the whole machine hangs for 5s. There is one innocent looking event named TcpSWSAvoidanceEnd which signals that it has done its full 5s wait and it has fixed the non existing SWS issue. The TcpSwsAvoidanceBegin events are quite common in this scenario. From 200K packets 40K are TcpSwsAvoidanceBegin events which all spin up a 5s timer which is cancelled when the ACK from the client is received. Due to the heavy network load the race condition has a good chance to happen and it will happen from time to time.

Today is my side tracking day. I have another network hang which is until now not fully understood. If you enable in Internet Explorer Automatic Proxy Detection your http requests are routed by a java script downloaded from a proxy server. The mechanism behind this is the Web Proxy Autodiscovery Protocol. This uses your full qualified computer name as starting point to find the server which hosts the http url routing script. Your computer has e.g. the name

myComputer512.department.SuperDepartment.Company.com

The browser will download from the following urls the proxy redirection java script.

WPAD.department.bigdepartment.company.com/wpad.data

WPAD.bigdepartment.company.com/wpad.data

WPAD.company.com/wpad.data

WPAD.com/wpad.data

If you have enabled this it can happen that some http requests take a long time for no apparent reason. The fix is to disable automatic proxy detection and configure the proxy manually. This is a good thing anyway since for some reason only known the NSA a wpad.de server exists. This means if your company has no WPAD server configured you are downloading java script code from a server someone else has control over regardless of the web page you are trying to access …

Freaky.

posted on Wednesday, January 8, 2014 10:32 AM