Ever have one of those days when you feel like you know absolutely nothing? That pretty much sums up the past week for me. I’ve been fighting with a performance problem and have come up empty on all fronts.
We’re in the middle of a project to upgrade one of our financial databases. The database itself is moving from SQL 2005 on Windows 2003 to SQL 2008R2 on Windows 2008 R2. We’re also upgrading the application to the latest version. Due to the project timeline and some delays in hardware delivery, we were forced to build out our QA database on a VM guest. That guest had 8 processors and 32GB of memory assigned. Disk-wise we were sharing space on an old SAN with pretty much every other application in that environment.
The permanent hardware we ordered was designed to replicate what we would eventually have in Production. We built out a 2-node cluster on IBM System x3850 X5 with 40 logical processors (2 10-core) and 128GB of RAM. The database is housed on a brand new EMC VNX 5700. So far, we’re the only application running on this SAN. On paper, this system should fly. It doesn’t. It’s actually slower than the VM guest.
When I was initially made aware of the disparity, I immediately set up Perfmon to start recording stats on the system while I poked around and ran some of my own tests. I used sys.dm_io_virtual_file_stats to check disk latency on both systems. The new server was showing ~70ms latency for the datafiles, while the VM system said ~20ms. That didn’t look good at all, so I ran a series of SQLIO tests on both systems. I expected the SQLIO results to echo what I was seeing in sys.dm_io_virtual_file_stats, but they didn’t. In fact, as you would expect from a new SAN, the MB/sec and IO/sec were about 4 times what the old SAN was capable of with lower latency. So obviously the new SAN was capable of much higher throughput than what we were seeing with SQL Server.
Thinking that the stats might be out of date (even though the new database was a backup/restore of the old one) I ran update statistics on the entire database with fullscan. Unfortunately, this didn’t have any effect on my test query performance.
We have MAXDOP set to 8 on the new server (4 on the VM server). My highest wait was, of course, CXPACKET. The next highest was EXECSYNC. I expect to see CXPACKET as the top wait on databases using parallelism. The EXECSYNC, though, caught my eye. Normally that’s farther down on the wait list and the average wait is much lower. On this new system it’s number 2 and the average wait is ~30 seconds. Yeah, not 30 milliseconds. 30 seconds. Unfortunately, I haven’t been able to find out a whole lot about EXECSYNC waits and what they actually mean.
Knowing that CXPACKET waits sometimes hide an underlying problem, I set MAXDOP to 1 and reran some of my tests. Now the highest wait was PAGEIOLATCH_SH, but the total wait time for this was only 4 seconds in a 15 minute query. The average wait was 4ms. So it doesn’t appear that disk latency is our problem here.
Throughout our tests, CPU utilization never went higher than 40%. Minimal SOS_SCHEDULER_YIELD waits, and the runnable task count in sys.dm_os_schedulers is never > 0.
Our page life expectancy is over 30,000. Buffer Cache hit ratio is 99%. 0 Free List stalls/sec and free pages is ~1.7 million. No memory grants pending, either.
What puzzles me is that parallelism seems to do nothing to help performance. On other systems, like that VM server, running a large query with MAXDOP 1 vs MAXDOP 4 has a huge effect. On this new server, it has very little effect. The query runs in about the same time, the cpu utilization doesn’t change. It does have a detrimental impact on sys.dm_io_virtual_file_stats. When the system is run with MAXDOP 1, latency according to the DMV is under 15ms. When we turn parallelism back on, it jumps to ~70ms. However the Queued IOs counter in EMC Powerpath never goes above 0. So again, we’re not waiting on actual IO. I always thought that sys.dm_io_virtual_file_stats was a true measure of disk latency, but obviously there are other factors that play into that measurement.
So I’m back to the EXECSYNC waits and trying to figure out why they’re so much longer on this system. And trying to figure out why SQL Server doesn’t seem to be fully utilizing the processing power available to it. I’ve started calling this new system my lazy server.
If anyone has any insight, advice, or wild guesses I’d love to hear them.