Diagnosing a disk I/O bottleneck?

https://stackoverflow.com/questions/17665246

03-06-2022
|

Domanda

I have a .NET service installed on a Windows 2008 server R2 running in a VDI with 6 Xeon cores. CPU utilization for this service was averaging around 10-20% during the time this issue occurred, although there were DB timeouts occurring intermittently.

My app is really the only thing on the box and CPU usage leads me to believe the following is not a CPU bottleneck or time/slicing issue.

We use log4net for logging

<appender name="LogFileAppender" type="log4net.Appender.RollingFileAppender">
      <File value="C:\app\service.log" />
      <AppendToFile value="true" />
      <rollingStyle value="Date" />
      <datePattern value="yyyyMMdd" />
      <layout type="log4net.Layout.PatternLayout">
        <param name="ConversionPattern" value="%date{yyyy-MM-dd hh:mm:ss:fff} %-5level %logger - %message%newline" />
      </layout>
    </appender>

I have the following code with the actual long running pieces removed

DateTime mst = DateTime.Now;

//...really long running code

DateTime st = DateTime.Now;

//...more long running code

DateTime et = DateTime.Now;
LogHelper.LogMessage("REGISTRATION: " + subscriber.Principal.Identity.Name + " packaging complete in " + (et - st).TotalMilliseconds + "ms");

et = DateTime.Now;
LogHelper.LogMessage("REGISTRATION: " + subscriber.Principal.Identity.Name + " Registration complete in " + (et - mst).TotalMilliseconds + "ms");

normally I get the following log entries

2013-07-15 07:00:02:238 DEBUG (T:118)Register - REGISTRATION: CORP\user packaging complete in 15.6001ms
2013-07-15 07:00:02:238 DEBUG (T:118)Register - REGISTRATION: CORP\user Registration complete in 1294.8083ms

but today I got the following

2013-07-15 11:35:02:498 DEBUG (T:70)Register - REGISTRATION: CORP\user packaging complete in 12589.2807ms
2013-07-15 11:35:04:386 DEBUG (T:70)Register - REGISTRATION: CORP\user Registration complete in 56768.7639ms

In looking at the time stamps, the two log entries are written out 1888ms apart, longer than the entire duration of normal execution.

How can I measure whether this is actually caused by a disk I/O bottleneck?
What else could explain the 1888 milliseconds it took between log entries where normally I get 0 ?

I'm considering implementing this solution to help eliminate I/O as a possible cause.

Soluzione

We ended up using Red-Gate's ANTS profile to identify a memory leak during a 'soft restart' of the service (dispose/reallocate objects). Normal flow of operation had no issues, flat memory profile, clean bill of health.

Only after we introduced a timer to simulate a series of unexpected issues that would result in soft restarts were we able to identify an increasing number of lingering objects and reproduce the issue resulting in massive memory usage and horrible amounts of page faults with deteriorating performance.

My theory is that all this paging to disk impacted both our ability to write to the log file and also to allocate new objects.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow