The hard drive is rarely considered the primary cause in bottleneck cases; we usually tend to suspect the applications installed on the server.
People often think the source of system performance issues is either disk corruption or insufficient disk space, but Physical Disk: %disk time and Physical Disk: Current Disk Queue Length are equally important metrics that work in parallel. There are few other ways to detect hard drive problems using other metrics, but for now I will only focus on these two performance counters.
Physical Disk: %disk time monitors the percentage of time that the disk is in use. If it runs over 90%, then the system is struggling.
Physical Disk: Current Disk Queue Length indicates both the number of requests being served and the number currently waiting for disk access. This number should fluctuate, and not exceed 1.5 to 2 times the number of spindles1 that make up the physical disk.
Figure 1: shows a healthy hard drive. Notice the Current Disk Queue Length (green line) is sometimes high, but it’s not an indication of a bottleneck since the %Disk time (red line) is below 90%.
Figure 2: When the peak stalls at high number, e.g., +90% (the vertical red line), then you must monitor the Current disk queue length (red circle). If the queue length number exceeds 2 or 4 (depending on the number of spindles), this is a good indication of a bottleneck.
Solution:
If you confirm that the hard drive is having issues, here are some steps to follow:
1- Run a defrag on the server: it is strongly recommended to do this OFF hours
2- Move some heavily used files and folders to another disk (not a partition) or another server, such as log files and the mail queue (if possible)
3- Run the command CHKDSK (without /F) to see if you have any disk problems
4- If you are using RAID, make sure that you are using write-back mode
5- Or, finally, get a new hard drive with a higher rpm (10000 or 15000 rpm).
1 The spindle is a shaft that holds the hard disk assembly and rotates the platter(s) at a speed that ranges from 5400 to 15000 rpm
Indeed, Al. How many times do Support teams spend hours debugging a case, running through all applications only to find the issue is a faulty drive. And worse, the faults occur unpredictably and are very hard to reproduce. Prevention is the key and maintenance of drive health.
How about a post on those cases?