Better Living Thru Powershell: Update Statistics in Parallel


As part of a project I’m currently involved with, we’re migrating a database to SQL Server 2008 and performing some internal updates to the schema and data. Once the database portion of the upgrade is complete, we’d like to update all database statistics before continuing.

On a large database, updating all statistics with a full scan can take a looooong time. It takes that much longer when you’re waiting for it to finish. To help expedite the process, I started looking for a way to divide the work into parallel threads. The quick and dirty solution would be to use SQL to generate an update statistics statement for each of my tables, manually split that out into separate scripts and run each script in a separate session. Bo-ring.

That approach is fine if you only ever want to do this once and you enjoy spending your time manually performing such mundane tasks. Me on the other hand, I’d prefer to spend several hours designing, scripting, and refining an automated method. Even if I only ever get to use it once, it’s still time well spent in my book.

Ideally, I wanted

  1. the ability to run this against any database on any instance,
  2. minimal footprint; I don’t want to have to create jobs or objects, etc on a target server for this to work,
  3. the number of threads should be dynamic; a server with a lot of horsepower will be able to run more concurrent threads than some dinky dev server.

So I started thinking about tools I could use. As I’ve made clear many times in the past, I’m not a developer. I’m sure if I were, I’d have a number of weapons in my arsenal. I, however, am a lowly DBA, so I need to keep things simple. Some ideas I batted around:

  1. Powershell
  2. Powershell
  3. …Powershell!

I’m pretty sure you can see where I’m going with this.

My general approach was still to use SQL to script out the update stats commands, write the commands out to n separate .sql files (n being the number of threads) and then execute all of the scripts concurrently.  Again, keeping it simple.

Distributing the workload

The first task was to generate the update stats statements and divide them into groups. I used the ROW_NUMBER() command with a modulus operator to generate group numbers.

 declare @p_threads smallint

SET @p_threads = 4

select 'UPDATE STATISTICS '+quotename(s.name, '[')+'.'+quotename(o.name, '[')+' WITH FULLSCAN;' as sqlstmt , ((ROW_NUMBER() OVER (order by s.name, o.name )) % @p_threads) as threadnum
FROM
		sys.objects o
		join sys.schemas s on o.schema_id = s.schema_id and o.type in ('U', 'V')
		group by s.name, o.name
		order by threadnum, o.name 

I really could have stopped there. That was a perfectly good solution. There was just one teensy detail that bothered me. Ideally, the workload should be evenly distributed among all of the threads. I shouldn’t have one thread that finished in a minute and while the rest chugged along for another 2 hours. This query did nothing to achieve any sort of balance.

So I started trying different methods. (Warning: I spent way too much time on this part. Feel free to skim.) I tried using the reserved field in sysindexes to spread out the load by size. Then I decided that that wasn’t perfect because there might be a big table with only a couple stats on it, but a somewhat smaller table with a whole bunch of stats will, overall, take longer to process.

Taking this into consideration, I then switched to using row counts multiplied by the number of stats on the table. I sorted them in descending order based on this new “size” measurement, and striped them across the groups. 0,1,2,3,0,1,2,3, etc. But then I got to thinking that the first few threads will always have the most work to do, because I was distributing work like this, picture each block as a table:

What I really should be doing is striping back and forth, like so:

Yeah, like I said, I spent a lot of time on this. But here’s what I ended up with, incorporated into Powershell:

 ##### Generate update stats statements #####

$connection = new-object system.data.sqlclient.sqlconnection( `
    "Data Source=$p_instance;Initial Catalog=$p_database;User Id=$p_userid; Password=$p_passwd;");
$connection.Open()
$cmd = $connection.CreateCommand()

$query = ";with statsCTE as (
		select 'UPDATE STATISTICS '+quotename(s.name, '[')+'.'+quotename(o.name,'[')+' WITH FULLSCAN;' as sqlstmt
			, (ROW_NUMBER() OVER (order by (max(i.rowcnt)*count(i.id)) desc)-1) % $p_threads as asc_threadnum
			, ($p_threads - 1) - ((ROW_NUMBER() OVER (order by (max(i.rowcnt)*count(i.id)) desc) -1 ) % $p_threads) as desc_threadnum
			, ((ROW_NUMBER() OVER (order by (max(i.rowcnt)*count(i.id)) desc)-1) / $p_threads) %2 as odd_even
		from sysindexes i
		join sys.objects o on o.object_id = i.id and o.type in ('U', 'V')
		join sys.schemas s on s.schema_id = o.schema_id
		group by s.name, o.name
	)
	select s.sqlstmt
		 ,case(s.odd_even)
			when 0 then s.asc_threadnum
			when 1 then s.desc_threadnum
		end as threadnum
	from statsCTE s
	order by threadnum, s.sqlstmt"

$cmd.CommandText = $query
$reader = $cmd.ExecuteReader() 

Creating the sql files

Over-engineer much? Now that that was finally settled, I could move on to writing those statements to their respective files. Just in case I wanted to run this job against multiple databases, I used the instance and database name in each of the sql filenames, along with the thread number, i.e. MyInstance_MyDatabase_updstats_1.sql. For each record I read in, I check to see if the script file it belongs to exists. If it does, I append the current command. If not, I create it and writing the current command.

 ##### Write commands to script files #####

$jobname = "$instance"+"_"+"$p_database"+"_updstats"

$outfile = "$pwd\$jobname"+".log"
$ofile = New-Item -type file $outfile -force
add-content $outfile "$(get-date) : Building SQL Scripts"

while($reader.Read()) {

	$sqlstmt = $reader['sqlstmt']
	$threadnum = $reader['threadnum']

	$statsfile = $jobname+"_"+$threadnum+".sql"

	if (Test-Path $pwd\$statsfile)
	{
		add-content $pwd\$statsfile $sqlstmt
	}
	else
	{
		$file = New-Item -type file $pwd\$statsfile
		add-content $file "SET NOCOUNT ON;"
		add-content $file $sqlstmt
	}

} 

Executing the threads

Now I just needed a mechanism to actually execute all of these scripts. I didn’t know exactly how to do it in Powershell, but I was pretty darn sure it could be done. I commenced to Googling. What I was looking for was something akin to using “nohup some_command &” in UNIX. I found that in the Start-Job command. Start-Job starts a background Powershell job on the local server. I can use it to kick off another process (like a block of update stats commands) and immediately continue with my next task (another block of update stats commands). Perfect.

 ##### Now run the scripts #####

$files = Get-ChildItem $pwd -filter "$jobname*.sql"

foreach( $file in $files)
{

	$file = "$pwd\$file"

	Start-Job -filepath  "$pwd\update-stats.ps1" -ArgumentList @($p_instance, $p_database, $p_userid, $p_passwd, $file, $outfile) -name $jobname

} 

Here’s where my lack of Powershell expertise comes in. No matter what I tried, I could not get it all to work from that single Start-Job command. The best I could do was have Start-Job execute another Powershell script that, in turn, runs the sql script. Not a big deal, it allowed me to add in some logging statements. That update-stats.ps1 script:

 # update-stats.ps1

param(
	[string]$p_instance=$null,
	[string]$p_database=$null,
	[string]$p_userid=$null,
	[string]$p_passwd=$null,
	[string]$p_file=$null,
	[string]$p_output=$null
     )

add-content $p_output "$(get-date) : $p_file starting."
sqlcmd -S $p_instance -U $p_userid -P $p_passwd -d $p_database -i $p_file #-o $outfile

add-content $p_output "$(get-date) : $p_file complete.  Removing SQL file. "
remove-item $p_file -force 

Putting it all together

Ok. I was pretty sure I had everything I needed. My final product:

 # do-UpdateStats.ps1
# usage: ./do-UpdateStats.ps1 <Instance name> <Database name> <Login> <Password> <Threads>

param(
	[string]$p_instance=$null,
	[string]$p_database=$null,
	[string]$p_userid=$null,
	[string]$p_passwd=$null,
	[int]$p_threads=$null
     )

[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.SMO') | out-null

[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.ConnectionInfo') | out-null

##### Separate out the instance name if not the default instance #####

$inst = $p_instance.Split("\")
if ($inst.Length -eq 1)
{
	$instance = $inst[0]
}
else
{
	$instance = $inst[1]
}

##### Generate update stats statements #####

$connection = new-object system.data.sqlclient.sqlconnection( `
    "Data Source=$p_instance;Initial Catalog=$p_database;User Id=$p_userid; Password=$p_passwd;");
$connection.Open()
$cmd = $connection.CreateCommand()

$query = ";with statsCTE as (
		select 'UPDATE STATISTICS '+quotename(s.name, '[')+'.'+quotename(o.name,'[')+' WITH FULLSCAN;' as sqlstmt
			, (ROW_NUMBER() OVER (order by (max(i.rowcnt)*count(i.id)) desc)-1) % $p_threads as asc_threadnum
			, ($p_threads - 1) - ((ROW_NUMBER() OVER (order by (max(i.rowcnt)*count(i.id)) desc) -1 ) % $p_threads) as desc_threadnum
			, ((ROW_NUMBER() OVER (order by (max(i.rowcnt)*count(i.id)) desc)-1) / $p_threads) %2 as odd_even
		from sysindexes i
		join sys.objects o on o.object_id = i.id and o.type in ('U', 'V')
		join sys.schemas s on s.schema_id = o.schema_id
		group by s.name, o.name
	)
	select s.sqlstmt
		 ,case(s.odd_even)
			when 0 then s.asc_threadnum
			when 1 then s.desc_threadnum
		end as threadnum
	from statsCTE s
	order by threadnum, s.sqlstmt"

$cmd.CommandText = $query
$reader = $cmd.ExecuteReader()

##### Write commands to script files #####

$jobname = "$instance"+"_"+"$p_database"+"_updstats"

$outfile = "$pwd\$jobname"+".log"
$ofile = New-Item -type file $outfile -force
add-content $outfile "$(get-date) : Building SQL Scripts"

while($reader.Read()) {

	$sqlstmt = $reader['sqlstmt']
	$threadnum = $reader['threadnum']

	$statsfile = $jobname+"_"+$threadnum+".sql"

	if (Test-Path $pwd\$statsfile)
	{
		add-content $pwd\$statsfile $sqlstmt
	}
	else
	{
		$file = New-Item -type file $pwd\$statsfile
		add-content $file "SET NOCOUNT ON;"
		add-content $file $sqlstmt
	}

}

##### Now run the scripts #####

$files = Get-ChildItem $pwd -filter "$jobname*.sql"

foreach( $file in $files)
{

	$file = "$pwd\$file"

	Start-Job -filepath  "$pwd\update-stats.ps1" -ArgumentList @($p_instance, $p_database, $p_userid, $p_passwd, $file, $outfile) -name $jobname

} 

Minor snag

I tested running it against my project database. Using 8 threads I was able to take a several hour job and complete it in under 90 minutes. It ran like a champ. Except for one thing: I couldn’t schedule it. My threads would start and immediately die. You see, Start-Job isn’t quite like “nohup some_command &”. With nohup, that some_command will continue to run even if your session ends. With Start-Job, your background jobs end when the calling session ends. So my scheduled job would fire off all the background jobs and then quit, thus terminating the background jobs.

The answer: Wait-Job. I’d given each of those jobs a name that started with MyServer_MyDatabase_updstats. Wait-Job would wait for each of those jobs to finish before ending the calling script.

Wait-Job -name $jobname 

Geekdom realized

After all that, I believe I have lived up to the Geek standard.

Source: Bruno Oliveira

Also recommended:


Leave a comment

Your email address will not be published.

14 thoughts on “Better Living Thru Powershell: Update Statistics in Parallel

  • Michael John

    Am I missing something? (Answer: yes, likely something dumb)

    I stumbled upon this script a few weeks back. Just what the doctor ordered!! I was jumping for joy!

    So, I finally get around to testing this, and it doesn’t appear to be doing anything.

    I can see the threads running in Task Manager, I know that I picked the right server, database, etc.

    But when I run this:

    SELECT COUNT(*),
    CONVERT(datetime, CONVERT(varchar(11), STATS_DATE(I.id, I.indid)))
    FROM sys.sysindexes I
    INNER JOIN sys.[tables] T ON I.[id] = T.[object_id]
    INNER JOIN sys.schemas AS S ON T.[schema_id] = S.[schema_id]
    WHERE T.[type] = ‘U’
    GROUP BY CONVERT(datetime, CONVERT(varchar(11), STATS_DATE(I.id, I.indid)))
    ORDER BY CONVERT(datetime, CONVERT(varchar(11), STATS_DATE(I.id, I.indid))) DESC

    or sp_autostats “table name” I do not see the dates changing.

    Got any ideas?

    • Colleen M. Morrow Post author

      Hi Michael – I ran the query you provided on my system and the results definitely changed pre and post-execution. You mention seeing the threads in Task Manager, but can you see them in your target instance, using something like sp_whoisactive?

  • Michael John

    The powershell commands appear in task manager, however, the only connection I see in SQL is the connection that builds the .sql files.

    Also, this line throws an error:
    Start-Job -filepath “$pwd\UpdateStats.ps1” -ArgumentList @($p_instance, $p_database, $p_userid, $p_passwd, $file, $outfile) -name $jobname
    Error:
    Start-Job : Cannot convert value “H:\\._FSSConnect_test_updstats_0.sql” to type “System.Int32”. Error: “Input string was not in a correct format.”

    I added the threads parm and it ran successfully
    Start-Job -filepath “$pwd\UpdateStats.ps1” -ArgumentList @($p_instance, $p_database, $p_userid, $p_passwd, $p_threads, $file, $outfile) -name $jobname

  • Colleen M. Morrow Post author

    Michael – Based on the error and your fix, it doesn’t look like you’re using the 2 scripts in my post, update-stats.ps1 and do-UpdateStats.ps1. Update-stats.ps1 (which is called by do-UpdateStats.ps1) shouldn’t expect or be passed a threads parameter.

  • mbourgon

    Colleen, this is really quite clever, and I think we’re going to look at using this. I do have a couple questions/requests.

    1) Have you considered using the rowmodctr to make sure the stats needed updating?
    1a) Along those lines, you could narrow things down more by using the same algorithm as trace flag 2371. Granted, you’re updating everything so it doesn’t apply, but might be worth investigating.
    1b) Even if not, change from FULLSCAN to a percentage/rownumber based on the size of the table. FULLSCAN won’t work well on my 1tb DB. : )

    2) Would that get all stats? It looks like it would miss any non-index stats, if you’re just using sysindexes and not sys.stats.

    3) One final way to speed it up further: stripe it not just among the stats, but potentially across drives/files/filegroups (not sure if you’d see a boost if all the same files are on the same drive, but you definitely would across drives). Granted, your normal databases probably aren’t doing this, but the databases you’d be likely to use this on probably are.
    4) Are the scripts actually necessary, or could you pass them via variables?

    Thanks!

  • Colleen

    Hi mbourgon –
    1) Since the script is already using sysindexes to help generate the UPDATE STATISTICS commands, it would be very simple to add a where clause limiting it to tables where the rowmodctr > 0. For the application I originally wrote this for, we wanted to include all tables. On the same token, if you wanted to use a percentage or percentage algorithm such as trace 2371 uses, I don’t see why you couldn’t use a CASE statement in place of the ‘WITH FULLSCAN’ literal string. If I have some time later, I’ll play with this. Or if you come up with something in the meantime, please feel free to leave it in the comments.

    2) The script executes UPDATE STATISTICS [tablename] WITH FULLSCAN, so all index and column stats will be updated. And remember, heaps are represented in sysindexes, too, indid = 0.

    3) Can you describe this some more? I currently break out the threads based on rowcount * number of stats, which is how I’m measuring the amount of work involved to update stats on a particular table. If I have 3 filegroups and I specify 10 threads, what algorithm would you use to break out the threads?

    4) If you’ve read my blog enough, you’ll know I’m not a powershell expert. 🙂 I learn enough to do the things I need to do, so this probably isn’t the only way to write this script. But it’s the way I knew how to write it. 🙂

  • mbourgon

    1) Yup. Just figured I’d toss it out there, in case you get to it before I do.
    2) Ah! You’re doing it table-by-table, not stat-by-stat. I thought you were doing it one statistics at a time. Nevermind.
    3) Good question. My initial thought was to use the number of threads to do a OVER/PARTITION that could force each threads to do a different drive letter, at least at the beginning. I’ll have to think that one over some more.
    4) Just subscribed to it today after coming across this article (not sure how I found it). I’m at the hunt-and-peck level of powershell myself, and I’ll have to bug my cohort who knows more. I wasn’t sure if there was a reason you did it that way, something about how PS operates that wouldn’t allow that method that I didn’t know about.

    Anyhow, thanks again!

    BTW – another coworker walked by, saw your chart at the bottom, and asked if that was my blog, given I have the same “geek standard”. : )

  • Melvin

    Have you ever considered publishing an e-book or guest authoring
    on other websites? I have a blog based on the same information you discuss and would love to have you share some stories/information.
    I know my visitors would enjoy your work. If you are even remotely interested, feel free to shoot me an e mail.