Question

We have recently implemented a new ASP.NET site to our webservers to replace our old Classic ASP site(Both severs are Windows 2008 R2 Using IIS 7.5). They are hosted on a Load Balancer.

This one .NET webform application is used for approximately 30 clients (each with their own URL. client1.mysite.biz, client2.mysite.biz etc...)

Our original plan was deploy our new application into 3 "WebSites" each with their own app pools and BIND the clients to the relevant Website.

When binding we bound to both Http and Https for the URL (we have certificates for each of the sites)

INITIAL PROBLEM: We noticed that after we bound more than half the sites and tested, we were suddenly being greeted with "Service Unavailable. Service is Temporarily Unavailable" (NO NUMBER just the words) every time. We unbound everything and tried again (meticulously testing each time we bound a site). Each time after binding a certain number of sites the same thing happened.

We ran out of down time and went to Plan B. We put the whole thing in the "Default Website" as a virtual directory (No bindings) (This is how the Classic ASP site was setup)

OUR PROBLEM NOW: Occasionally we get the same dreaded white screen with "Service Unavailable. Service is Temporarily Unavailable" (NO NUMBER just the words). It seems to happen randomly (not load or time dependent as far as we can tell). If using AJAX it simply is caught in the "Error" portion of the AJAX code but I believe it is the same problem. The error occurs INSTANTLY when it does happen. If the user attempts to repeat the action that caused the problem everything is fine (they are not logged out and they proceed on their way).

However this is happening MULTIPLE times a day and it's across ALL of our sites (not just this new one).

One more item of great importance. This appears to be happening to ALL of our sites (Virtual Directories and custom WebSites on BOTH of our web servers). That seems to rule out a "bad" server (both are in the cloud did I mention?) and it also "seems" to rule out App Pool settings but what do I know?

About our IIS servers: We have multiple application pools running multiple different instances of websites (different code). Some are testing sites. Some are using classic ASP and others and using ASP.NET.

What we've tried: We scoured the web looking for answers and have edited our machine.config file to increase all manner of things such as "Threads, Max-Connections etc...". We've edited our App Pool settings by increasing our Queue Length and turning on ALL the logs.

Anyone seen anything like this before? My theory is it has something to do with the bindings and the frequency of the error is increased for each binding I initiate but that is difficult to test when it happens on my production servers only.

Was it helpful?

Solution

We have finally solved this problem. As mentioned previously, we noticed that the IIS logs contained a sc-win32-status 64 error when we experienced the Service Unavailable problem in the browser when (and only when) our site was using the Load Balancer.

To help look into this further, we did a network capture of the traffic on the Load Balancer while testing. We reproduced the random Service Unavailable problem, saw the associated win32-status 64 error in the IIS logs, and identified the specific packet of traffic on the network capture for this event.

Using Wireshark, we followed the TCP stream and noticed that the TCP connection was reset by the Load Balancer immediately after this packet. We reproduced the problem three times and every time there was a TCP reset immediately afterwards.

Walking backwards through the TCP stream, we noticed in all three instances a packet for HTTP/1.1 200 (accplication/octet-stream) and prior to that a request to download a document (ie. .pdf or .xlsx or .docx) from one of our sites. The server that contains all our documents is not a web server and does not have the IIS role active. The document server does not have a way to define the content/media type for the document that is being downloaded. Hence the generic (application/octet-stream) packet in the network capture. The Load Balancer treated the request for a document as potentially malicious and decided to reset the TCP connection if another request is made. To fix the problem, we added a content type library function to our application using this post as a guide. Sorted!

In Summary:

  1. A document was requested from our document server via our web application
  2. The document was sent back to the user with a generic content type = application/octet-stream
  3. The Load Balancer flagged this activity to be potentially malicious
  4. Another request within this TCP connection was made
  5. The Load Balancer reset the TCP connection
  6. This results in a Service Unavailable

Lesson Learned:

Always define your content/media types if you are serving content from a non web server or a web server running an IIS version less than 7 (Heaven forbid).

OTHER TIPS

A UC Certificate was originally meant for Microsoft Exchange, but it can also be used to cover multiple domains. We use one and it covers about 60+ domains (actually 4 or 5 domains with lots of subdomains). We also apply the certificate to a load balancer and two web servers and we have multiple sites. So far as I can tell the certificates operate as expected. you can view it from any of the 60+ domains. One odd thing about our setup is that in the IIS UI, you can't bind the same certificate to more than one site so we had to use the appcmd command line interface to bind multiple sites to the same certificate.

After looking more closely at our IIS logs it appears that there is indeed something that coincides with this behavior. We get an error of 200 0 64 which is the sc-win32-status 64: "the specified network name is no longer available".

Now our 2 IIS servers are hosted in the cloud on Sungard, and we are using a load balancer that they setup for us. It was our theory that the load balancer was "losing" the proper session id of the user when this 64 error occurs and has no idea where it was supposed to be.

We ran some controlled tests. One group we took OFF the load balancer and sent them directly to one of the servers and another group used the load balancer but made sure to connect to the same server. Both teams conducted the tests of trying to reproduce the error (which is to say we clicked a popup on the site over and over).

The results were interesting. The group that was NOT on the load balancer NEVER received the "Service Unavailable" error! BUT the logs indicated they were getting 64 errors 45 times. The group that WAS on the load balancer was able to produce the "Service Unavailable" message twice and the logs confirmed that there were exactly 2 instances of the 64 error that coincided to the exact moment that the errors were observed.

So what does this mean?
1.) Load balancer has some settings "Sticky Sessions?" that aren't keeping the sessions in right (but we can't find the right settings. It's not even our load balancer it's SunGard's). Anyone have any advice on these settings for ASP.NET?

2.) 64 errors are a part of web life? We gave more cpu power to one of our Virtual IIS servers and received less 64 errors. This is all I can come up with. We've sunk too much time and money trying to solve this, but it appears that I have an option at least of taking people off the load balancer and just routing them to one or the other server and in addition I can at least beef up the server to handle more traffic and reduce the 64 errors.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top