This article is related to the article I wrote a while back on parameter sniffing. We recently had another problem with a production application that started to experience poor performance. The C# web application using a SQL 2005 database starting timing out. The timeout was set to 30 seconds. I wasn’t seeing deadlocks on the database and we had some blocking but it wasn’t extreme. So this problem lasted over about two weeks. Here are the steps taken to troubleshoot.
1) Began running SQL Profiler. I started tracing only RPC calls using a filter on the trace for duration > 30000 (30 seconds). My first thoughts were I better figure out which procs are taking the time. The trace showed one particular proc which was executed many times and always took alot of time on the DB. This procedure was also blocking others. We spent some time tuning this procedure and testing it and pushed it into production.
2) The tuned procedure reduced the timeouts from around 150 a day to about 50 a day. A huge improvement but still not a great user experience. I changed my trace for duration > 10,000 and monitored it for many days. I started seeing that one proc to fetch information used for a drop down list was taking more and more time each day. If I took the call from the trace and executed it from SSMS the query would execute in 1 second. If I ran the same call from OSQL on a different machine it would take 30 seconds. The application timeout setting was also increased to 60 seconds.
3) Next I checked my indexes for fragmentation and rebuilt them old style. I mean I dropped and recreated them I didn’t use DBCC commands.
4) I increased the number of times update statistics with full scan from once a week to twice a week.
We still had issues with this one proc and it is now causing the majority of the timeouts. When I looked at the query plan from SSMS in the GUI format it’s doing two index seeks. But when I look at the plan generated from OSQL it’s doing an index scan and generating over 3 million reads on a table that has only about 400K rows in it.
5) At this point an advisory case with Microsoft was opened. My gut was telling me that it has to be related to the query plan but I couldn’t understand why SSMS is picking an efficient plan and OSQL is not.
I checked the connection settings for the database properties coming from my application and I compared these to the settings used by SSMS and OSQL. They are all the same. To check the login set options you use profiler to trace the Audit:Login event. Microsoft told me that these settings can affect your query plan.
6) Next thing to do is start looking at the trace when I execute the stored procedure call from SSMS and OSQL. What I found partially answered my question as to why the procedure call executes quickly from SSMS. I traced all the events under Stored Procedures for my spid and I found that whenever I execute the proc from SSMS it was doing a cache miss. AKA not finding it in the procedure cache. Ok Why? When I do a call from OSQL it’s also doing a cache miss but then picking an inefficient plan. Well at least I can recreate the issue….
7) Next thing is to start looking at is the plans that are actually in the cache. You can do this by using the Dynamic management views. Create the function and view below for an easy look into the cache.
create function SqlAndPlan(@handle varbinary(max))
returns table
as
return select sql.text, cp.usecounts,cp.cacheobjtype,
cp.objtype, cp.size_in_bytes,
qp.query_plan
from
sys.dm_exec_sql_text(@handle) as sql cross join
sys.dm_exec_query_plan(@handle) as qp
join sys.dm_exec_cached_plans as cp
on cp.plan_handle = @handle;
create view PlanCache
as
select sp.* from sys.dm_exec_cached_plans as cp
cross apply SqlAndPlan(cp.plan_handle) as sp
8 ) When viewing the results you will get the XMLversion of your query plan. If you click on the results you can open it and save it as .sqlplan and when you open it again it will be in the graphical format. However the XML Plan gives you views into some things that the graphical plan does not. Some of this information is extremely useful. The thing I personally thought is the greatest is that I can actually see what parameters are used for my procedure execution when the plan is stored in the cache. You can always reverse engineer your procedure calls from this. To find the parameter values search the xml plan for parameter list. You will see something similar to the following:
<ParameterList>ColumnReference Column=”@Default” ParameterCompiledValue=”(1)” />
ColumnReference Column=”@Role” ParameterCompiledValue=”(8)” />
ColumnReference Column=”@Page” ParameterCompiledValue=”Start” />
ColumnReference Column=”@user” ParameterCompiledValue=”(1)” />
…. n
From this information I can see if my plan is an atypical call. For example if 90% of my users have role 1 and the plan cached is for role 8 perhaps you should remove this plan from the cache and replace it with a more efficient one. This happened to be the situation in my case. To change the value in the cache you need to recompile the stored proc and execute it with different set of parameters then verify that is stored in your cache. Another useful bit of information in the cache is the compile time. The proc which was causing us issues took .001 seconds to compile the plan. We decided with the advice of Microsoft to add the recompile option to this proc to ensure the correct plan is always chosen. The over head in this case was .001 seconds. I don’t recommend adding this option all the time and this should be carefully weighed based on the stored procedure and other tuning options available.