Random/Intermittant Service Unavailable - IIS7.5

Question 1

We have finally solved this problem. As mentioned previously, we noticed that the IIS logs contained a sc-win32-status 64 error when we experienced the Service Unavailable problem in the browser when (and only when) our site was using the Load Balancer.

To help look into this further, we did a network capture of the traffic on the Load Balancer while testing. We reproduced the random Service Unavailable problem, saw the associated win32-status 64 error in the IIS logs, and identified the specific packet of traffic on the network capture for this event.

Using Wireshark, we followed the TCP stream and noticed that the TCP connection was reset by the Load Balancer immediately after this packet. We reproduced the problem three times and every time there was a TCP reset immediately afterwards.

Walking backwards through the TCP stream, we noticed in all three instances a packet for HTTP/1.1 200 (accplication/octet-stream) and prior to that a request to download a document (ie. .pdf or .xlsx or .docx) from one of our sites. The server that contains all our documents is not a web server and does not have the IIS role active. The document server does not have a way to define the content/media type for the document that is being downloaded. Hence the generic (application/octet-stream) packet in the network capture. The Load Balancer treated the request for a document as potentially malicious and decided to reset the TCP connection if another request is made. To fix the problem, we added a content type library function to our application using this post as a guide. Sorted!

In Summary:

A document was requested from our document server via our web application
The document was sent back to the user with a generic content type = application/octet-stream
The Load Balancer flagged this activity to be potentially malicious
Another request within this TCP connection was made
The Load Balancer reset the TCP connection
This results in a Service Unavailable

Lesson Learned:

Always define your content/media types if you are serving content from a non web server or a web server running an IIS version less than 7 (Heaven forbid).

Question 2

A UC Certificate was originally meant for Microsoft Exchange, but it can also be used to cover multiple domains. We use one and it covers about 60+ domains (actually 4 or 5 domains with lots of subdomains). We also apply the certificate to a load balancer and two web servers and we have multiple sites. So far as I can tell the certificates operate as expected. you can view it from any of the 60+ domains. One odd thing about our setup is that in the IIS UI, you can't bind the same certificate to more than one site so we had to use the appcmd command line interface to bind multiple sites to the same certificate.

Question 3

After looking more closely at our IIS logs it appears that there is indeed something that coincides with this behavior. We get an error of 200 0 64 which is the sc-win32-status 64: "the specified network name is no longer available".

Now our 2 IIS servers are hosted in the cloud on Sungard, and we are using a load balancer that they setup for us. It was our theory that the load balancer was "losing" the proper session id of the user when this 64 error occurs and has no idea where it was supposed to be.

We ran some controlled tests. One group we took OFF the load balancer and sent them directly to one of the servers and another group used the load balancer but made sure to connect to the same server. Both teams conducted the tests of trying to reproduce the error (which is to say we clicked a popup on the site over and over).

The results were interesting. The group that was NOT on the load balancer NEVER received the "Service Unavailable" error! BUT the logs indicated they were getting 64 errors 45 times. The group that WAS on the load balancer was able to produce the "Service Unavailable" message twice and the logs confirmed that there were exactly 2 instances of the 64 error that coincided to the exact moment that the errors were observed.

So what does this mean?
1.) Load balancer has some settings "Sticky Sessions?" that aren't keeping the sessions in right (but we can't find the right settings. It's not even our load balancer it's SunGard's). Anyone have any advice on these settings for ASP.NET?

2.) 64 errors are a part of web life? We gave more cpu power to one of our Virtual IIS servers and received less 64 errors. This is all I can come up with. We've sunk too much time and money trying to solve this, but it appears that I have an option at least of taking people off the load balancer and just routing them to one or the other server and in addition I can at least beef up the server to handle more traffic and reduce the 64 errors.