Question

My company is using an Azure Service Bus Relay to aggregate summaries of sensitive data into an Azure-hosted application. We have noticed on a pre-production server that after the first few requests are processed CPU utilization by the process hosting the ServiceHost instance jumps up to 70-90% and stays there. The ServiceHost is typically self-hosted in a windows service, but we also have a WPF app that we run it under for various setup and testing scenarios and we can reproduce this behavior on both. We have not been able to reproduce this behavior on our development environment.

I've reviewed the code and compared it to the samples on MSDN, and to me they look equivalent. Here is the condensed version:

ServiceBusEnvironment.SystemConnectivity.Mode = ConnectivityMode.AutoDetect;
this.serviceBusUri = ...;
TransportClientEndpointBehavior sharedSecretServiceBusCredential = new TransportClientEndpointBehavior();
sharedSecretServiceBusCredential.TokenProvider = TokenProvider.CreateSharedSecretTokenProvider(...,...);
ContractDescription contractDescription = ContractDescription.GetContract(typeof(IOurServiceProxy), typeof(OurServiceProxy));
NetTcpRelayBinding binding = new NetTcpRelayBinding(EndToEndSecurityMode.Transport, RelayClientAuthenticationType.RelayAccessToken, true);
binding.ConnectionMode = TcpRelayConnectionMode.Relayed;
this.serviceEndpoint = new ServiceEndpoint(contractDescription);
this.serviceEndpoint.Address = new EndpointAddress(this.serviceBusUri);
this.serviceEndpoint.Binding = binding;
this.serviceEndpoint.Behaviors.Add(sharedSecretServiceBusCredential);
this.host = new ServiceHost(typeof(OurServiceProxy), this.serviceBusUri);
this.host.Description.Endpoints.Add(this.serviceEndpoint);
this.host.Open();
this.host.Faulted += OnFaulted;

We never see the OnFaulted event handler triggered and requests continue to be processed after the CPU jumps. The WPF version of the host app has a button which can disconnect the connection to the service bus via a call to this.host.Close(), and once it is disconnected the CPU immediately goes back to idle.

I've done a trace listener, but the only messages are related to the auto-detection of the SystemConnectivity.Mode when the ServiceHost starts up. The fault location in the stack is a decedent of a call to Microsoft.ServiceBus.NetworkDetector.DetectInternalConnectivityModeForAutoDetect(Uri uri). The fault itself is caught by the Microsoft.ServicBus layers and never bubbles up to my company's code. The specific exception message captured by the trace was

Could not connect to net.tcp://[name_redacted].servicebus.windows.net:9350/. The connection attempt lasted for a time span of 00:00:01.1856021. TCP error code 10061: No connection could be made because the target machine actively refused it [ip_redacted]:9350.

And here's the settings I used for the trace:

   <system.diagnostics>
      <sources>
            <source name="System.ServiceModel" 
                    switchValue="Warning, Error, Critical"
                    propagateActivity="true">
            <listeners>
               <add name="traceListener" 
                   type="System.Diagnostics.XmlWriterTraceListener" 
                   initializeData= "C:\Temp\Traces.svclog" />
            </listeners>
         </source>
      </sources>
   </system.diagnostics>

Next I tried to do some analysis on what threads were consuming all of the CPU. I started with a mem dump of the process, but decided a single snapshot couldn't give me enough information about what was going on over time, so I found Sam Saffron's blog post about CPU analysis for a production .Net application. We grabbed the latest version of the source for cpu-analyzer and ran it on the server in question. All of the most expensive stacks had a signature of System.Threading._IOCompletionCallback.PerformIOCompletionCallback at the base. My understanding was that there were no Service Bus calls into the process during the capture, so I am not sure what this thread would have been doing.

Our next steps are going to run a perfmon capture on the server and take a look at the results to see if anything obvious pops out at us. I do not have direct access to the server and therefore need to schedule time with a SysAdmin in order to do hands-on analysys.

Does anyone have ideas as to what might be causing this hidden CPU spike? Is there anything known to do this behavior in either Azure Service Bus Relay or WCF? Any suggestions would be greatly appreciated.

Was it helpful?

Solution

It turns out that the high CPU is being triggered by an unexpected ACK\FIN packet. We suspect the firewall is what actually sends this, trying to close off the external connection. We were able to recreate the issue on other devices simply by injecting the rogue ACK\FIN packet.

We are following up with Microsoft Azure team to try to get them to better handle the unexpected packet. We will also be following up with the network firewall team to try to isolate and eliminate the packet from being sent.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top