Capacity Planning and Linear Regression

Raymond Gallagher

Principal Business Intelligence Engineer at Aspen Tech.

When using Idera SQL Diagnostic Manager to monitor your SQL Server environment, you’ll probably find yourself using their built in reports and dashboards for capacity planning. Although they are good, I found myself building some additional reports to better answer the questions I was after. Specifically, when will a drive fail due to an out of space condition, and how much space should I add?

In a given environment, you may have hundreds of SQL Servers to monitor, each with their own set of drives and databases. Performing capacity planning exercises can be overwhelming in this case, and an automated process is needed. Below I will walk through the process I developed and provide the code to create your own solution. The code uses linear regression to calculate expected values in the future based on information already gathered. It also calculates correlation coefficients to better estimate how accurate or not accurate your regression is against your current data. Finally, it summarizes the information in tables and e-mails a report to you on a chosen schedule, highlighting drives that need attention. Let’s get started!

Firstly, Idera SQL Diagnostic Manager is required, as the code uses data gathered from the monitoring tool. You’ll also need to make sure you have data gathered for some period of time.

CreateTable_CapacityPlanning

We’ll need to create the table to store the final results. I use a utility database called DBAManagement and I’ve created a scheme for this process called Idera.

CREATE TABLE [Idera].[CapacityPlanning]
    (
      [ServerName] [NVARCHAR](50) NULL ,
      [DriveLetter] [NVARCHAR](1) NULL ,
      [CurrentDriveSizeGB] [DECIMAL](10, 2) NULL ,
      [ThresholdDate] [DATETIME] NULL ,
      [OneYearGrowthGB] [DECIMAL](10, 2) NULL ,
      [CurrentDate] [DATETIME] NULL ,
      [CorrelationCoefficient] [DECIMAL](10, 2) NULL ,
      [RSquared] [DECIMAL](10, 2) NULL ,
      [EstimateError] [DECIMAL](10, 2) NULL
    )
ON  [PRIMARY];

GO

sp_PopulateCapacityPlanning

The stored procedure to populate the table takes two parameters. The first parameter is the ServerID as it is stored in the Idera database. The second parameter is the drive letter. Both of these parameters will be provided in the code within the SQL Agent job that we’ll look at next.

Besides declaring some variables, the first thing the procedure does is create a temporary table to store results and insert data from the DiskUsageForecast table in the SQLDMRepository database, which is the Idera backend database for SQL Diagnostic Manager. By default I only load data from January 1, 2015 onward, but this can be changed. The variable @getdate is by default GETDATE().

--Create a temporary table to store the results
CREATE TABLE #DiskUsageForecast
    (
      x INT IDENTITY(1, 1) ,
      Records INT ,
      LastCollectionInterval DATETIME ,
      DriveName NVARCHAR(256) ,
      y DECIMAL(12, 2) ,
      TotalSizeGB DECIMAL(12, 2) ,
      DividedSizeGB AS y / TotalSizeGB
    );

--Insert data from the DiskUsageForecast table in Idera
INSERT  INTO #DiskUsageForecast
        EXEC SQLDMRepository.dbo.p_DiskUsageForecast @ServerID = @p_ServerID,
            @UTCStart = '2015-01-01 00:00:00', @UTCEnd = @getdate,
            @UTCOffset = -240, @Interval = 2, @DriveName = @p_DriveLetter;

Often times you will make some changes to the drive, whether it be to add space, decrease the database size or move the database to another drive. For our purposes, I made it simple and calculated the last negative jump between values. We only want to focus on upward trends for the model, or it will have a skewed result.

--Calculate the difference for values between two rows so we can figure out where
--the last negative jump is (i.e. focus only on the upward trends)
SELECT  * ,
        ( LEAD(DividedSizeGB, 1, 0) OVER ( ORDER BY x ) ) - DividedSizeGB AS DividedSizeGB_Difference
INTO    #DiskUsageForecast_Difference
FROM    #DiskUsageForecast
ORDER BY x;

--Return the top two rows with a negative value, indicating where the upward trend ends
--The first row will be discarded
SELECT TOP 2
        *
INTO    #DiskUsageForecast_Final
FROM    #DiskUsageForecast_Difference
WHERE   DividedSizeGB_Difference < 0
ORDER BY x DESC;

--There will always be a first row, so if a negative value is found in the second row,
--find the offset and delete up to that offset
IF ( @@ROWCOUNT = 2 )
    BEGIN
        SELECT TOP 1
                @offsetx = ( x + 1 )
        FROM    #DiskUsageForecast_Final
        ORDER BY x;

        DELETE  FROM #DiskUsageForecast
        WHERE   x < @offsetx;
    END;

Next we need to prepare for the linear regression calculation by declaring needed variables and calculating averages and sums for the x and y values. Here x is your datetime and y is your value. The full script is linked below, so I won’t include all the variable declares here.

--Calculate the average and the sums for the x and y values
SELECT  @n = COUNT(*) ,
        @average_x = AVG(x) ,
        @average_y = AVG(y) ,
        @sumX = SUM(x) ,
        @sumY = SUM(y) ,
        @sumXX = SUM(x * x) ,
        @sumYY = SUM(y * y) ,
        @sumXY = SUM(x * y)
FROM    #DiskUsageForecast;

For those of you who want to do some heavy reading, this is the link to the Wikipedia article on simple linear regression, which is what we are using. For everyone else, the calculations are below. The slope value (@m) and the y-intercept (@b) are calculated using the @n variable (the count of all values in the dataset) along with the x and y values.

The correlation coefficient, r-squared and error estimation values are calculated as well. In the calculations, we have to avoid divide by zero errors if they exist across the datasets, and that is the purpose of the IF statement. Calculations will be skipped if the slope, @Sxx or @Syy values are 0.

--Calculate the linear regression and the correlation coefficients
SELECT  @m = ( ( @n * SUM(x * y) ) - ( SUM(x) * SUM(y) ) ) / NULLIF(( ( @n
                                                              * SUM(POWER(x, 2)) )
                                                              - POWER(SUM(x),
                                                              2) ), 0) ,
        @b = AVG(y) - ( ( @n * SUM(x * y) ) - ( SUM(x) * SUM(y) ) )
        / NULLIF(( ( @n * SUM(POWER(x, 2)) ) - POWER(SUM(x), 2) ), 0) * AVG(x)
FROM    #DiskUsageForecast;

SET @Sxx = @sumXX - ( @sumX * @sumX ) / @n;
SET @Syy = @sumYY - ( @sumY * @sumY ) / @n;
SET @Sxy = @sumXY - ( @sumX * @sumY ) / @n;

IF ( @m != 0
     AND @Sxx != 0
     AND @Syy != 0
   )
    BEGIN
        SET @cc = @Sxy / SQRT(@Sxx * @Syy);
        SET @r_squared = ( @b * @sumY + @m * @sumXY - POWER(@sumY, 2) / @n )
            / ( @sumYY - POWER(@sumY, 2) / @n );

        SELECT  @estimateerror = SQRT(SUM(POWER(y - ( @m * x + @b ), 2)) / @n)
        FROM    #DiskUsageForecast;
    END;

With the calculations done, we have some final steps to get the data ready for reporting. First we need to populate #DiskUsageForecast for the last year. I picked a year’s worth of data as an arbitrary value. You can change it if you’d like. Second, we need to select the instance name for the ServerID, the threshold date (the data when the drive will run out of space) and the current drive size. Third, we need to calculate the growth needed for a year’s worth of data.

--Populate #DiskUsageForecast for the last year of data
WHILE ( @count <= 366 )
    BEGIN
        INSERT  INTO #DiskUsageForecast
                SELECT  0 ,
                        DATEADD(d, @count, @lastdate) ,
                        @p_DriveLetter ,
                        0 ,
                        @lasttotalsizegb;
        SET @count = @count + 1;
    END;

--Get the instance name for the ServerID
SELECT  @instancename = InstanceName
FROM    SQLDMRepository.dbo.MonitoredSQLServers
WHERE   SQLServerID = @p_ServerID;

--Get the threshold date and current drive size
SELECT TOP 1
        @thresholddate = LastCollectionInterval ,
        @currentdrivesize = TotalSizeGB
FROM    #DiskUsageForecast
WHERE   ( TotalSizeGB - ( @m * x + @b ) ) < 0
ORDER BY ( TotalSizeGB - ( @m * x + @b ) ) DESC;

--Calculate the linear regression growth year
SELECT TOP 1
        @growthyear = ( @m * x + @b )
FROM    #DiskUsageForecast
ORDER BY x DESC;

--Select the growth year
SELECT  @growthfinal = ( @growthyear - ( @m * x + @b ) )
FROM    #DiskUsageForecast
WHERE   LastCollectionInterval = @lastdate;

--Insert values gathered into the Idera.CapacityPlanning table
INSERT  INTO Idera.CapacityPlanning
VALUES  ( @instancename, @p_DriveLetter, @currentdrivesize, @thresholddate,
          @growthfinal, CONVERT(DATE, GETDATE()), @cc, @r_squared,
          @estimateerror );

All of this data will be inserted into the CapacityPlanning table. One row will be created per combination of server and drive. You will end up with:

The instance name
The drive letter
The current drive size
The date when the drive will run out of space (i.e. where the linear regression line reaches 0)
The growth needed for a year’s worth of space
The date the analysis was completed
The correlation coefficient, r squared and error estimation values for the analysis

Now let’s take a look at the SQL agent job that will be used to call this stored procedure.

CreateAgentJob_PopulateCapacityPlanning

The agent job contains three steps. It is scheduled to run weekly and you can adjust the schedule as you need. The first step is to restore the SQLDMRepository database to your working SQL instance. This is not required if your utility database is on the same instance as the SQLDMRepository database, and your code is located on the same instance.

The second step is to loop through all your servers and drives stored and perform the analysis. For this step, I use a cursor. I select the SQLServerID and the DriveName from the DiskDrives table where the drive is not a C, D or E drive. Again this can be changed if you need. Using the cursor, I call the sp_PopulatedCapacityPlanning stored procedure for each combination.

Since I’m only interested in the current analysis, I truncate the CapacityPlanning table before each run.

USE DBAManagement

TRUNCATE TABLE DBAManagement.Idera.CapacityPlanning

DECLARE @SQLServerID NVARCHAR(5)
DECLARE @DriveName NVARCHAR(1)
DECLARE @sqltext NVARCHAR(4000)

DECLARE vcursor CURSOR
FOR
    SELECT DISTINCT
            SQLServerID ,
            DriveName
    FROM    SQLDMRepository.dbo.DiskDrives
    WHERE   DriveName NOT IN ( 'C', 'D', 'E' )
    ORDER BY SQLServerID ,
            DriveName

OPEN vcursor
FETCH NEXT FROM vcursor INTO @SQLServerID, @DriveName

WHILE @@FETCH_STATUS = 0
    BEGIN
        SET @sqltext = 'EXEC DBAManagement.Idera.sp_PopulateCapacityPlanning @p_ServerID = '
            + @SQLServerID + ', @p_DriveLetter = ' + @DriveName + ''
        EXEC sp_executesql @sqltext
        FETCH NEXT FROM vcursor INTO @SQLServerID, @DriveName
    END

CLOSE vcursor
DEALLOCATE vcursor

For the final step, I use the results in the CapacityPlanning table and send an HTML formatted e-mail using Database Mail to the desired e-mail account. You will need to have Database Mail set up and configured on your SQL Server instance.

Although I store the S drive in the table, I don’t report it. This can be changed if needed.

DECLARE @tableHTML NVARCHAR(MAX);

SET @tableHTML = N'<H1>Capacity Planning</H1>'
    + N'<table border="1" cellpadding="1">'
    + N'<tr><th>ServerName</th><th>DriveLetter</th><th>CurrentDriveSizeGB</th>'
    + N'<th>ThresholdDate</th><th>OneYearGrowthGB</th>'
    + N'<th>CurrentDate</th><th>Corr.Coeff.</th></tr
    + CAST(( SELECT td = ServerName ,
                    td = DriveLetter ,
        td = CurrentDriveSizeGB ,
                    td = ThresholdDate ,
                    td = OneYearGrowthGB ,
                    td = CurrentDate ,
                    td = CorrelationCoefficient
             FROM   DBAManagement.Idera.CapacityPlanning
             WHERE  ThresholdDate IS NOT NULL
             AND DriveLetter != 'S'
             AND OneYearGrowthGB > 0
             ORDER BY ThresholdDate
             FOR XML RAW('tr'), ELEMENTS
           ) AS NVARCHAR(MAX)) + N'</table>';

EXEC msdb.dbo.sp_send_dbmail @profile_name = 'xxxxx',
    @recipients = 'xxxxx', @body_format = 'HTML',
    @subject = 'Capacity Planning', @body = @tableHTML;',
    @database_name = N'master

sp_QueryCapacityPlanning

If you analyze the results in the report and want to look deeper into the data, sp_QueryCapacityPlanning will come in handy. This code is very similar, but will instead output the data in Management Studio so you can look at the numbers and import to Excel, to create charts, for example.

Most of the time I want to just look at the numbers and see what’s going on with the data. If I have a very slow growing drive and then all of a sudden there’s a huge jump, I’ll be able to see that. The other common case is a drive will increase very rapidly, but then be flat for a period of time. This case will give the indication that there is a point in the future when the drive will run out, but since the growth is recently flat, it never will.

The procedure takes two parameters. The first parameter is the server name and the second parameter is the drive letter. Note these parameters are case sensitive. In the future if I make them non-case sensitive, I’ll post the updated versions here. Below is the code to return information from #DiskUsageForecast to the user. Other than the code to return information, the nuts and bolts of the procedure are the same as the previous one.

--Return information from #DiskUsageForecast
SELECT  x AS ID ,
        Records ,
        LastCollectionInterval ,
        DriveName ,
        y AS TotalUsedGB ,
        TotalSizeGB ,
        ( TotalSizeGB - ( @m * x + @b ) ) AS RemainingGB ,
        DividedSizeGB
FROM    #DiskUsageForecast
ORDER BY LastCollectionInterval;

CreateTable_CapacityPlanningOverProvisioned and sp_PopulateCapacityPlanningOverProvisioned

This is a separate table and stored procedure from the above, but very similar. The difference here is this calculates drives that are overprovisioned instead of focusing on drives that will run out of space.

Here is the code to create the table.

CREATE TABLE [Idera].[CapacityPlanningOverProvisioned]
    (
      [ServerName] [NVARCHAR](50) NULL ,
      [DriveLetter] [NVARCHAR](1) NULL ,
      [ThresholdDate] [DATETIME] NULL ,
      [DriveSize] [DECIMAL](10, 2) NULL ,
      [CurrentDate] [DATETIME] NULL ,
      [CorrelationCoefficient] [DECIMAL](10, 2) NULL ,
      [RSquared] [DECIMAL](10, 2) NULL ,
      [EstimateError] [DECIMAL](10, 2) NULL
    )
ON  [PRIMARY];

GO

The stored procedure inputs records into the CapacityPlanningOverProvisioned table as before, but this time the threshold date represents the date in the future (and it can be way in the future) that the drive will run out. The CapacityPlanning table will NULL this value if it is too far out. This will allow you to see drives that have a threshold date of 15, 20 and even 30 years in the future. Those are good candidates for review.

That about wraps it up. The code for the stored procedures, table and SQL Agent jobs are available here.

May 11, 2018

Subscribe to this blog

You can also find Raymond Gallagher on GitHub and LinkedIn. Get in touch by Email.

Capacity Planning and Linear Regression

Share this article with friends