A developer came to me last week with a problem. A relatively simple query within his application that normally takes under a second to run, is intermittently taking upwards of 30 seconds. This, in turn, is causing his application to time out. He said he could increase the timeout on his application to work around the problem, but wanted to know if I could help fix it, first.
As I said, the query itself is a very simple one, joining 3 tables, the largest of which only contains about a half a million records. And the query itself only returns about a thousand rows. So I started going through the usual suspects:
The Code: The first thing I did was check out the query itself to ensure the developer wasn’t doing anything silly like applying functions to columns in the where clause, etc. I also saw that he was using literal values, thereby ruling out parameter sniffing.
Blocking: Given the intermittent nature of the problem, blocking initially looked like a viable cause. However, the developer is using nolock on all tables in the query. To make doubly sure, I used the blocked process threshold in conjunction with a server-side trace to track any blocking on the server longer than 10 seconds. There was no blocking at all occurring in the related database.
Statistics/Query plan: Statistics are updated nightly on this database. During the day, the data is static, so stats shouldn’t be the problem. Another server-side trace confirmed that the query was using the exact same query plan for good runs as it did for bad runs.
Waits: That brings me to waits. I decided to use sp_WhoIsActive to try to trap the problem when it happens and see what the process is waiting for. The problem was actually catching it in the act. Up to this point, by the time we were aware of the problem, it was gone. To catch it, I’d need to have the procedure already running in the background, recording data.
My first step was using the following script to create a table to hold the output.
DECLARE @wia_table VARCHAR(4000) ; DECLARE @schema_out VARCHAR(4000) ; SET @wia_table = 'DBA_Temp.dbo.WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ; EXEC sp_WhoIsActive @get_transaction_info = 1, @get_plans = 1, @get_additional_info = 1, @get_task_info = 2, @RETURN_SCHEMA = 1, @SCHEMA = @schema_out OUTPUT ; SET @schema_out = REPLACE(@schema_out, '<table_name>', @wia_table) ; EXEC(@schema_out) ;
The next part was a script to record the sp_WhoIsActive output for the application login every 10 seconds. Given that the bad executions run over 30 seconds, I figured that every 10 seconds would be sufficient to catch the problem and minimize overhead. The script ends itself at 5pm.
DECLARE @wia_table VARCHAR(4000) , @stoptime DATETIME SET @wia_table = 'DBA_temp.dbo.WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) SET @stoptime = cast(CONVERT(VARCHAR, GETDATE(), 112) + ' 17:00:00' as datetime ) WHILE GETDATE() < @stoptime BEGIN EXEC sp_WhoIsActive @get_transaction_info = 1, @get_plans = 1, @get_additional_info = 1, @get_task_info = 2, @filter = 'my_user', @filter_type = 'login', @destination_table = @wia_table; WAITFOR DELAY '00:00:10' END ; GO
And finally, I created a job to run both of these scripts, scheduling it to run daily at 7am. Between this job, the server-side trace, and the blocking trace, I think I have all my bases covered. The only thing left now is to wait… (Get it? Wait??)
What did you find?
The suspense is killing me…
I wish I had an answer for you, Alex. I left this tracing in place for 2 months and the problem never reared its ugly head. I finally disabled the tracing. If it ever starts happening again I’ll be sure to post a follow-up.