OVERachieving Progressive Sums in T-SQL

Emmanouil Gimissis MSc ICSD

Rate me:

5.00/5 (3 votes)

21 Apr 2016Apache3 min read

113

Calculating progressive sums in T-SQL

In this article, you will see how to calculate progressive sums in T-SQL.

Introduction

A problem that a business developer often encounters is the creation of a progressive sum, over a number of cash inflows/outflows by date. By progressive sum, we mean the sum of cash inflows / outflows, such as:

Date	Inflow/Outflow	Amount	Progressive Sum
1/1/2016	Inflow	10000	10000
3/1/2016	Outflow	5000	5000
3/1/2016	Inflow	3000	8000
4/1/2016	Outflow	12000	-4000

Setup

For example, suppose we create the following Moves table in SQL Server:

SQL

/****** Object:  Table [dbo].[Moves]    Script Date: 12/Απρ/2016 6:37:36 μμ ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Moves](
                [RecID] [int] IDENTITY(1,1) NOT NULL,
                [MoveDate] [datetime] NULL,
                [MoveDescr] [nvarchar](2000) NULL,
                [MoveSignID] [int] NULL,
                [MoveProjValue] [decimal](18, 4) NULL,
                [MoveActualValue] [decimal](18, 4) NULL,
                [MoveComments] [nvarchar](2000) NULL,
 CONSTRAINT [PK_Moves] PRIMARY KEY CLUSTERED
(
                [RecID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, _
 IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
/****** Object:  Index [MovesIndex]    Script Date: 12/Απρ/2016 6:37:42 μμ ******/
CREATE NONCLUSTERED INDEX [MovesIndex] ON [dbo].[Moves]
(
                [MoveDate] ASC,
                [MoveSignID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, _
 SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, _
 ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

MoveSignID has the value of 1 for inflow (i.e., cash input, for example, cash paid to us by a customer), 2 for outflows (i.e., cash output, for example, paying a supplier).

Let’s then create some test data using create sample Moves.zip query, which will create a test load of 40.000 records, with random values, in order to test with a realistic load of roughly 6 years data, such as:

RecID    MoveDate      MoveSignID    MoveProjValue
1        2016-04-17         1          6776.8625
2        2016-04-17         1          7399.8839
...
10       2016-04-17         1          7423.4752
11       2016-04-18         1          6769.2806
...
20001    2016-04-17         2          237.0888
20002    2016-04-17         2         7865.3622
...
20011    2016-04-18         2         1348.9914
20012    2016-04-18         2         5248.4128

Processing

Suppose now, that we want to calculate the running total of these data. The most obvious query will be:

SQL

SELECT a.moveDate, a.moveprojValue, CASE WHEN a.moveSignID=1 _
THEN SUM(b.moveprojValue)  ELSE SUM(-b.moveprojValue) END AS RunningTotal
FROM moves a CROSS JOIN moves b
WHERE (b.moveDate <= a.moveDate)
GROUP BY a.moveDate,a.moveprojValue,a.moveSignID
ORDER BY a.moveDate

The query takes roughly 29 seconds to complete on my W2012R2 virtual machine with SQL Server 2014 Express, with the following statistics:

SQL Server parse and compile time:
   CPU time = 0 ms, elapsed time = 0 ms.
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
(40000 row(s) affected)
Table 'Worktable'. Scan count 1, logical reads 804070, 
physical reads 0, read-ahead reads 0, lob logical reads 0, 
lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, 
physical reads 0, read-ahead reads 0, lob logical reads 0, 
lob physical reads 0, lob read-ahead reads 0.
Table 'Moves'. Scan count 2, logical reads 540, 
physical reads 0, read-ahead reads 0, lob logical reads 0, 
lob physical reads 0, lob read-ahead reads 0.
 SQL Server Execution Times:
   CPU time = 28422 ms,  elapsed time = 29155 ms.

Of course, there are more optimised solutions, like the following:

SQL

SELECT T1.*,
(SELECT SUM(CASE WHEN MoveSignID=1 THEN MoveProjValue ELSE -MoveProjValue END )
FROM Moves T2
WHERE T2.MoveDate <= T1.MoveDate) AS RunningTotal
FROM Moves T1
ORDER BY MoveDate
GO

The query takes roughly 12-13 seconds to complete on the same VM, with the following statistics:

(40000 row(s) affected)
Table 'Worktable'. Scan count 44000, logical reads 449343, 
physical reads 0, read-ahead reads 0, lob logical reads 0, 
lob physical reads 0, lob read-ahead reads 0.
Table 'Moves'. Scan count 2, logical reads 82890, 
physical reads 0, read-ahead reads 0, lob logical reads 0, 
lob physical reads 0, lob read-ahead reads 0.
 SQL Server Execution Times:
   CPU time = 8890 ms,  elapsed time = 13407 ms.

Not very satisfactory, it still takes a noticeable amount of time to produce a result that seems trivial.

Maybe we can try a cursor solution?

For example:

SQL

CREATE TABLE #Moves (MoveDate datetime, Amount decimal(18,4), RunningTotal decimal(18,4))
DECLARE @MoveDate datetime, @SignID int, @Amount decimal(18,4), @RunningTotal decimal(18,4)
SET @RunningTotal = 0
DECLARE rt_cursor CURSOR FOR SELECT MoveDate, MoveSignID, _
MoveProjValue FROM Moves ORDER BY MoveDate,MoveSignID
OPEN rt_cursor
FETCH NEXT FROM rt_cursor INTO @MoveDate, @SignID, @Amount
WHILE @@FETCH_STATUS = 0
BEGIN
IF @SignID=1
                SET @RunningTotal = @RunningTotal + @Amount
ELSE SET @RunningTotal = @RunningTotal - @Amount
INSERT #Moves VALUES (@MoveDate, @Amount,@RunningTotal)
FETCH NEXT FROM rt_cursor INTO @MoveDate, @SignID, @Amount
END
CLOSE rt_cursor
DEALLOCATE rt_cursor
SELECT * FROM #Moves ORDER BY MoveDate
DROP TABLE #Moves

This produces results in 9 seconds, statistics are not clear, since we are doing an insert for every day. However, it’s still not optimum and has a great disadvantage, that it cannot be incorporated in a view, since cursors are only allowed in stored procedures/functions.

If we have access to an SQL 2012 or later, we can try using the OVER clause.

https://msdn.microsoft.com/en-us/library/ms189461.aspx

SQL

SELECT T1.*,
   SUM(CASE WHEN T1.MoveSignID=1 THEN T1.MoveProjValue ELSE -T1.MoveProjValue END )
   OVER(ORDER BY moveDate) AS RunningTotal
FROM Moves T1
order by T1.MoveDate

The OVER operator has been defined in SQL in order to define a window or user specified set of rows, over which to apply an aggregate function such as sum, avg, etc. The query above completes in only 0.6 seconds, nearly 15 times faster than the next best solution, the cursor query and 20 times faster than the obvious subquery solution, presented above. See also the relevant statistics.

SQL Server parse and compile time:
   CPU time = 0 ms, elapsed time = 0 ms.
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
(40000 row(s) affected)

Table 'Worktable'. Scan count 2001, logical reads 174260, 
physical reads 0, read-ahead reads 0, lob logical reads 0, 
lob physical reads 0, lob read-ahead reads 0.

Table 'Moves'. Scan count 1, logical reads 82611, 
physical reads 0, read-ahead reads 0, lob logical reads 0, 
lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 375 ms,  elapsed time = 588 ms.

Additionally, it uses a much more elegant syntax, extremely clear and consise. The main disadvantage is that the syntax used is only available on SQL Server 2012 or later, for example, even SQL 2008 R2 will produce a syntax error. So, either you have to upgrade your SQL Server or use a less efficient solution.

Final Results

To sum up in a table:

Solution        SQL compatibility    CPU time (ms)    Elapsed time (ms)
CROSS JOIN              All              28422            29155
SUBQUERY                All              8890             13407
CURSOR                  All               N/A             ~9000
OVER             SQL 2012 +               375               588

All comments are welcome, hope some people will find it useful!

Using the Code

Just create a blank database, run Moves.SQL, and then create sample moves.SQL.

After that, you can use any sample query provided and especially progressive sum partition order by - only for SQL2012+.SQL.

Points of Interest

SQL Server and TSQL keep improving. However, it still lacks a particular mission critical feature, scaling out...

Copyright [2016] [by E. Gimissis]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

History

17^th April, 2016: Initial version

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0

Written By

Emmanouil Gimissis MSc ICSD

Technical Lead P GIMISIS SA

Greece

I am a passionate programmer/developer since 1983, especially in distributed and niche applications, using whatever tools suits best for the job. Have vast experience with Microsoft tools and technologies, especially VB (since 1991), SQL Server (since 1997) and .NET (since 2000) but constantly love to learn and develop my skills.
I hold an MSc in Interactive Computer System Design from Loughborough Univ. of Technology UK and various Microsoft certifications.
When I am not developing, I prefer to delve into history, swim/bike and going out.