The Product Aggregate in T-SQL Versus the CLR

Scott Burkow

Rate me:

5.00/5 (13 votes)

22 Feb 2013CPOL20 min read

63.4K

203

An exercise in algorithm analysis and design.

Download sample code - 13.5 KB

Introduction
The Basic Aggregate in the CLR
The Aggregate in T-SQL

The Sample Table
Three T-SQL Solutions

CLR SIMULATION
RECURSION
EXP LOG

Choosing Invariance To Nulls

Invariance: CLR
Invariance: CLR SIMULATION
Invariance: RECURSION
Invariance: EXP LOG
Option: Keyword DISTINCT

Okay (oh no) Looping
A Performance Comparison--and Caveats

Execution Plans: CLR, CLR SIMULATION, and EXP LOG
Execution Plans: RECURSION
RECURSION, Looping, and Safety

Safe Tactic: CTE
Performance Tactic: Table Variable

Grouping

Success with EXP LOG
Aggregate Window Functions: A Simple Fix
Poor Grouping Choices

Accuracy
T-SQL Solutions and Aggregate Functions

A Read-Write Conflict Example
Extending the Module

The Report Card
Finally
Afterword

Introduction

Noticeably absent from the SQL Server built-in aggregate functions is product—the multiplication of numbers in a set of values. Perhaps one reason is because it can so often fail; I demonstrate this with a simple loop, meant to simulate the product over 309 values in a column:

SQL

DECLARE @product float = 1.79, @loopRowPosition int = 1;
while( @loopRowPosition <= 308 ) 
begin
    select @product *= 10, @loopRowPosition += 1;
end

The result is 1.79E+308, the maximum positive number for the float data type. But change the start value to 1.798 and you get an arithmetic overflow. Put another way: multiply small numbers in a small column expression and meltdown.

Still the aggregate is useful in limited situations, and so we’ll develop it the expected way—in the CLR—but then also several ways in T-SQL. The ground rule for T-SQL versions is simple: no loops.

To make the solutions in both environments more flexible, we’ll extend them to support a choice over invariance to nulls.

In our analysis, topics we’ll cover include the following:

How they work with grouping
How they rate as to accuracy, safety, and performance
What happens when we try to make generic aggregate-simulation functions from ad hoc code

This article is as much about good algorithm development as it is about the product aggregate. As our tactics evolve, at any point a wrong turn can harm performance or introduce subtle errors.

The Basic Aggregate in the SQL CLR

Below are the four methods of the product solution in C#.

private SqlDouble product;
private SqlInt16  countOfMultipliers;
        
public void Init()       
{
    product = 1;
    countOfMultipliers = 0;
}

public void Accumulate(SqlDouble multiplier)
{
    product            *= (multiplier.IsNull) ? 1 : multiplier;
    countOfMultipliers += (SqlInt16)((multiplier.IsNull) ? 0 : 1);
}

public void Merge(ProductAggregate mergeProductAggregate)       
{           
    Accumulate(mergeProductAggregate.product);       
}
       
public SqlDouble Terminate()       
{           
    return (0 == countOfMultipliers) ? SqlDouble.Null : product;
}

The Accumulate method is invoked once for each number in the input column and computes the product. The ternary operator, which tests for null values, gives us our invariance to nulls property—i.e. the ability to ignore nulls and return a result over non-null numbers, as do the built-in aggregates.

The Merge method is called when the number set is partitioned and the product computed over multiple threads.

The result is returned in the Terminate method. The data member countOfMultipliers is incremented during accumulation using the same test for nulls done for the product member, allowing it to return null when all input values are null or the input is empty.

C’est simple, ce n’est pas?

The Aggregate in T-SQL

Remember the rule for our T-SQL versions: no loops. We can devise quicker, more elegant solutions.

Here is the second and final rule: the code must always return a value, possibly null, just as the system aggregates do.

The Sample Table

This is the sample table for all code examples. The first five columns are populated from system view sys.messages. Table population is limited to 2000 rows—much more and arithmetic overflow occurs for my sample values. I added four columns, whose meanings are as follows:

multiplier. Our multiplicands of random values in the range 0.214-2.382, scale 15.
groupcol. A computed, persisted column of values either ‘a’, ‘b’ or ‘c’ for group testing.
yearcol. A computed, persisted column of values ‘2001’ – ‘2004’ also for group testing.
ID. An IDENTITY surrogate key that aids the performance of one of our query forms.

Three T-SQL Solutions

The three basic T-SQL solutions follow. I’ll refer to them throughout by the labels given.

CLR Simulation

The first solution is a kind of mirror of the CLR code: where data member product accumulates the result, we’ll employ a scalar variable, also initialized to 1.

SQL

DECLARE @product AS float= 1;
SELECT  @product *= multiplier FROM aggr.T_ProductTest;

Hardly a hat trick—until we consider that the product will be one when aggr.T_ProductTest has no rows (we want a null result). Where the CLR solution solves the problem by setting and later testing a second data member, countOfMultipliers, we’ll substitute an outer join on a derived table, which I prefer over introducing another variable outside the central query:

SQL

select
    @product *= ( prod.multiplier * onz.one )
from
    aggr.T_ProductTest prod 
    RIGHT OUTER JOIN
    (
    select 1 as one
    ) onz
    on
    1 = 1; -- any tautology will do

When the table has rows, each number is multiplied by one. When it doesn’t, prod.multiplier isn’t even a null value for the multiplication, so the action in the SELECT statement cannot be applied. In this case the right outer join forces the SELECT clause to be evaluated over one row and @product becomes null because multiplicand prod.multiplier is now null.

Recursion

Starting with our second technique we make a clean break from the CLR approach. The float variable is not needed, although we still need to accommodate an empty result set:

SQL

WITH cteRecursiveProduct( level, product ) as 
(
    select level = ID, product = multiplier
    from   aggr.T_ProductTest
    where  ID  = ( select max( ID ) from aggr.T_ProductTest )

       
           UNION ALL 


    select level = prodCTE.level - 1, product = ( prodCTE.product * prod.multiplier )
    from   cteRecursiveProduct prodCTE
           inner join 
           aggr.T_ProductTest prod
           on
           prodCTE.level - 1 = prod.ID
)
select product = product * onz.one
from   cteRecursiveProduct
      RIGHT OUTER JOIN
       (
      select 1 as one
      ) onz
      on
      isnull(level, 1) = onz.one
OPTION ( MAXRECURSION 0 );

What is required in many cases is that we put more than 100 frames on the call stack—the maximum allowed by default—and so we allow unlimited frames with the query hint MAXRECURSION 0. In recursion, there is one row returned for each intermediate product for the multiplicands seen so far, so we get the final product at level 1 in the reverse ID order strategy. The join condition on onz is modified to return the level 1 value regardless of whether it exists.

EXP LOG

Our third solution involves an arithmetic trick, but shares with the recursive technique the advantage of being able to be placed wholly within a larger query.

Let f(x) be a function that transforms each multiplier in a column into a common base number and sums their logs:

f(x) = sum( log( multiplier ) )

In this case, SQL Server system function log() uses the number e as the base by default. If the multipliers are 8 and 10, for example, they would be represented as e^2.0794 and e^2.3026 to four decimal places, and f(x) would return 4.382.

Let's extend the composition:

g(f(x)) = e^f(x)

The number e raised to the f(x) can of course be represented as a decimal number, and in our example, e^4.382 = 80 (adjusted for rounding error); system function exp(), which is the inverse of log() (exp(log(x)) = log(exp(x)) = x), does this:

product = exp( sum( log( multiplier ) ) )

This all works because of the laws of exponents (multiplication case):

x^a * x^b * … * xⁿ = x^a+b+…+n

If the explanation is a little dense, don’t worry. What we do need to worry about is finding a negative number or zero in the input column, because the log for these is undefined. Here is the error you get when you try a log(0) or log(-5) operation:

Msg 3623, Level 16, State 1, Line 2
An invalid floating point operation occurred.

We circumvent the problem by adding the nullif() function to substitute nulls for zeros and the abs() function to ensure that all numbers are positive, but we’ll need additional code to get the correct answer—zero whenever zeros occur in the column expression and a negative value when the count of negative multipliers is odd:

SQL

exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) )
*
iif( sum( iif( multiplier = 0, 1, null ) ) > 0, 0, 1 )
*
iif( sum( iif( multiplier < 0, 1, 0 ) ) % 2 = 1, -1, 1 )

Note that unlike the other solutions, EXP LOG doesn’t require the outer join or other strategy to return a null on null input.

Now we’ll explore an option not available to the built-in aggregates.

Choosing Invariance to Nulls

The popular built-in aggregates count, sum, min, max, avg, and the others are invariant to nulls, meaning that having nulls in the column expressions over which they operate does not affect the outcome. The user cannot change this property. We, however, can, and so let’s see how it would be done, starting with the CLR.

CLR

Here is the line in the Accumulate method from the sample code above that does the computation:

SQL

product *= (multiplier.IsNull) ? 1 : multiplier;

The ternary operator throws out nulls from the input, making the implementation invariant to nulls. Remove the operator and the product is null whenever the input has a null value, making it variant to nulls. Because aggregates—system or CLR—don’t expose a parameterized constructor, a single struct cannot give the user the option; simply introduce a second struct.

Set the property IsInvariantToNulls on required attribute class SqlUserDefinedAttribute to true on one and false on the other, keeping in mind that enforcement is up to you.

By contrast, were we to place the T-SQL implementations into aggregate-simulating functions—more on this later—a bit parameter specifying invariance would obviate the need for duplication of code/effort.

You may have realized that the code snippets from the previous section differ in variance, the first and second being variant, and the last invariant (because the log() function itself is invariant to nulls). Let’s start with the code that flips the property for the CLR simulation:

CLR Simulation

SQL

DECLARE @product float = 1;
      
select @product *= ( isnull( prod.multiplier, 1 ) * niladj.adjustor )
from   aggr.T_ProductTest prod
       RIGHT OUTER JOIN
        (   
        select iif( count( * ) > 0, 1, null )     
       from   aggr.T_ProductTest
       where  multiplier is not null           
       ) niladj( adjustor )
       on
       1 = 1  -- any tautology will do
       
select @product;

This rewrite of the derived table is correct over two boundary cases: when aggr.T_ProductTest is empty—as before; and when the multiplier column in all rows has null values. But the code is not optimal because the count() aggregate requires a full (table or index) scan.

SQL

DECLARE @product float = 1, @countOfMultipliers smallint = 0; 

select @product            *= isnull( multiplier, 1 ),
      @countOfMultipliers += iif( multiplier is not null, 1, 0 )
from   aggr.T_ProductTest;


select product = @product * iif( 0 = @countOfMultipliers, null, 1 );

The solution now more closely simulates the CLR implementation. Remove the isnull() function and it becomes an alternate solution for the variant case.

Recursion

Because the recursion strategy doesn't depend upon outside variables, it must use the derived table or equivalent CTE to achieve invariance, and pay the performance penalty.

SQL

select level = ID, product = isnull( multiplier, 1 )...             -- anchor
select ..., product = ( prodCTE.product * isnull( prod.multiplier, 1 ) ) -- recursive
...
select product = product * niladj.adjustor  
from   cteRecursiveProduct                                                           
       RIGHT OUTER JOIN 
      (
       select iif( count( * ) > 0, 1, null )
       from   aggr.T_ProductTest
       where  multiplier is not null
       )
       niladj( adjustor )
      on 
       level = 1

EXP LOG

Flipping the property in the opposite direction for EXP LOG means one more multiplier:

SQL

exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) ) -- no log( <= 0 )
*
iif( sum( iif( multiplier = 0, 1, null ) ) > 0, 0, 1 ) -- 0-adjust
*
iif( sum( iif( multiplier < 0, 1, 0 ) ) % 2 = 1, -1, 1 ) -- neg no adjust
*
iif( sum( iif( multiplier is null, 1, 0 ) ) > 0, null, 1 ) -- null-variant

DISTINCT Keyword

Finally, another important option we could implement is the DISTINCT keyword. I won’t expand on it, but suffice it to say that the CLR would need more effort: e.g., a data member vector could buffer all numbers in Accumulate and the vector could be sorted to bypass duplicates during multiplication in Terminate.

Extensions to the T-SQL strategies vary in complexity and performance. Do you see the error in this code? (Hint: let the column contain values 5 and -5.)

SQL

exp( sum( DISTINCT log( nullif( abs( multiplier ), 0 ) ) ) )

Okay (oh no) Looping

While the looping tactic is verboten, I reference its metrics as the baseline in the section on performance next, so here is one optimal form.

SQL

DECLARE @product float = 1.0, @countOfMultipliers smallint = 0,
        @next_row smallint = 1, @max_row smallint;

select @max_row = max( ID ) from aggr.T_ProductTest;

while( @next_row <= @max_row )
begin
    DECLARE @next_multiplier float;
      
    select @next_multiplier = multiplier
    from   aggr.T_ProductTest
    where  ID = @next_row;
      
    select @product            *= isnull( @next_multiplier, 1 ),
           @countOfMultipliers += iif( @next_multiplier is null, 0, 1 );
      
    SET @next_row += 1;
end
 
select product = @product * iif( @countOfMultipliers = 0, null, 1 );

For your eyes only. Destroy the code after reading.

A Performance Comparison—and Caveats

The test I performed for all solutions in their invariant to null forms was a product over the multiplier column for all 2,000 rows in the aggr.T_ProductTest table. I ran each solution code 100 times at an interval 1/20 second apart to get the logical disk reads and rough averages over CPU and duration from the Profiler.

The average times clearly show that the recursive and looping techniques are not viable. The first pair of numbers for their measurements are the values for the code as was displayed, and the second, for their safer versions, to be explained.

The first cost for recursion is very good, and for looping, exceptional. So why the disconnect between the optimizer's estimates and actual performance?

Execution Plans: CLR, CLR SIMULATION, and EXP LOG

Above is the execution plan for the SQL CLR aggregate. If you add one Compute Scalar operator on each side of the Stream Aggregate, you essentially have the plan for the EXP LOG code; subtract the Stream Aggregate, the CLR SIMULATION. In all cases, the index scan on the clustered primary key is known to the optimizer to return a fixed number of rows--the Estimated Number of Rows = the Actual Number of Rows = 2000--and so it can make accurate estimated costs.

Execution Plans: RECURSION

The graphic above depicts the operators that start one branch of the recursive part in the estimated execution plan followed by those that start in the actual execution plan. Where the other strategies employ a one-pass index scan, recursion and also looping must seek on the same index to get multipliers from successive IDs, once for each recursion/iteration. This accounts for more page touches in looping, and for recursion, seek must re-fetch all the multipliers seen so far plus the current one for each stack frame, so the reads skyrocket.

By visually inspecting the code, we can see that the anchor gets the multiplier at ID 2000 (the last row in the sample table), and each recursion operates at the next lower contiguous ID stopping at ID one for a total of 2000. In fact, in the actual plan, the Actual Number of Rows returned by the seek operator is 1999, as indicated by the much thicker outbound arrow.

But the optimizer can't deduce the row count from the recursive definition, and so it puts in a placeholder value of one for Estimated Number of Rows, as indicated by the thin arrow. It is for this reason that the optimizer cannot give a reasonable estimated cost for recursion or for looping as well. (The estimated branch cost may accurately reflect the effort to get the multipliers at IDs 2000 and 1999 or just one multiplier.)

Recursion, Looping, and Safety

When I introduced the T-SQL solutions, I noted that I added column ID as an int IDENTITY clustered primary key to aid the performance of one of the solutions (the compact natural key is message_id, which otherwise would be clustered). That solution of course is RECURSION (add looping). But for this to work, we must guarantee the following: 1) that the minimum ID is one; and 2) that there are no gaps in the ID sequence, such as those resulting from row deletions and rolled-back transactions. And often this is not the case.

Safe Tactic: CTE

Let’s add a CTE that gives us our contiguous IDs starting at one:

SQL

WITH cteMultiplierRank( rankNo, multiplier ) as 
(
select CAST( ROW_NUMBER( ) OVER( ORDER BY ID ) as int ), multiplier
from   aggr.T_ProductTest
),

The recursive CTE is rewritten to reference this CTE instead of the sample table, and the execution plan shows that this CTE as well as the anchor and recursive parts of the recursive CTE all share the starting operators below. The graphic is the start of the recursive branch of the actual execution plan:

In the estimated plan, each operator for all branches outputs 2,000 rows, but run-time information shows that the recursive branch operators each produced 4,000,000 actual rows—2,000 sample table rows times 2,000. The operators essentially set up 2,000 groups of all (rankNo, multiplier) pairings, and a Filter operator to come applies the recursive condition to determine the set of multipliers to use for each group. The logical disk reads go from 22,010 to 156,147, the estimated cost balloons to an unacceptable (untrustworthy! but still...) 2.3517, and the user experience degrades proportionately. This next attempt fares better.

Performance Tactic: Table Variable

SQL

DECLARE @tblMultiplier TABLE( rowno int IDENTITY PRIMARY KEY, multiplier float NULL );

After we rewrite the recursive CTE to reference the table variable, the execution plan is identical to the original and the cost comes back to a healthy 0.016537. But in practice 85% of the total cost comes from populating the table variable from the sample table, bringing the cost to 0.1077. The second sets of numbers in RECURSION and Looping show the additional cost of this safety.

Safer certainly--but is it safe now? Our revised solutions share with the CLR SIMULATION a potential problem inherent in its not being expressible in one atomic statement—a problem to be addressed in the section T-SQL Solutions and Aggregate Functions.

Grouping

Success with EXP LOG

Built-in and user-defined CLR aggregates can be used in SELECT, HAVING, and ORDER BY clauses. Of our T-SQL solutions, only EXP LOG is a wholly self-contained expression, and so it too can be used in these clauses:

SQL

select   groupcol, yearcol, 
	 product = exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) )
from     aggr.T_ProductTest
group by groupcol, yearcol
having   exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) ) > 0
order by exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) );

I’ve left off the part of the calculation that adjusts for zeros and negative numbers for brevity.

Aggregate Window Functions: A Simple Fix

SQL

product_by_year_exp = sum( exp( sum( log( multiplier ) ) ) )
                      OVER( PARTITION BY yearcol ),
 
product_by_year_clr = sum( aggr.PRODUCT( multiplier ) ) 
                      OVER( PARTITION BY yearcol )

if we add these column expressions to the select list in the query, exp is rejected by the compiler as a window function because it is not an aggregate or other acceptable function type. But sum() is, and we can use it as the outer function to get the intended result. Function aggr.PRODUCT is also an aggregate—it is the local name for the CLR aggregate—but this too is rejected (for an unknown reason), and so we reuse the trick.

Poor Grouping Choices

Neither of our remaining strategies, RECURSION or the CLR SIMULATION, is suitable for grouping. For our sample query, either we would need to know in advance the (groupcol, yearcol) paired values of interest—or employ more code to get the pairings—and windowing makes no sense. In particular, recursion is not a solution for grouping.

The CLR SIMULATION, with its individual variable technique, is marginally better but not necessarily safe. Here products for several years are set in one SELECT clause, invariant to nulls form; grouping is implied in the SELECT clause:

SQL

select
    @product2001 *= iif( '2001' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2001' = niladj.yearcol, niladj.adjustor, 1 ),   
    @product2002 *= iif( '2002' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2002' = niladj.yearcol, niladj.adjustor, 1 ),
    @product2003 *= iif( '2003' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2003' = niladj.yearcol, niladj.adjustor, 1 ),
    @product2004 *= iif( '2004' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2004' = niladj.yearcol, niladj.adjustor, 1 )
from
    aggr.T_ProductTest proTest
    RIGHT OUTER JOIN
    (  
    select niladj.yearcol, cnt_year_non_null.cnt
    from
    (
       select yearcol, iif( count( * ) > 0, 1, null )
       from   aggr.T_ProductTest
       where  multiplier is not null
       group by yearcol
    ) cnt_year_non_null( yearcol, cnt )
    RIGHT OUTER JOIN
    (
       select yearcol, nilAdj = null
       from
       (
       select [2001] = 1, [2002] = 1, [2003] = 1, [2004] = 1
       ) p
       UNPIVOT
      (
       nilAdj FOR yearcol in( [2001], [2002], [2003], [2004] )
       ) as unpvt
    ) niladj( yearcol, adjustor )
    on
    cnt_year_non_null.yearcol = niladj.yearcol
    ) 
    niladj( yearcol, adjustor )
    on
    proTest.yearcol = niladj.yearcol;

Oh myyyyyyyyyy! The derived table must now do its own outer join on an unpivot relational operator or employ a similar strategy (think UNION ALL in the second derived table) to ensure that each year has its own adjustor row (with possibly null adjustor), not just those years having rows in the sample table. Of course, the more optimal second form for invariance should have been used, but any developer may decide against using one @countOfMultiplier variable per year while not thinking of the derived table problem. Increment the failure point column.

Accuracy

All product strategies except one agree that the total non-grouped product for the sample data is 6.04851066640848E-310. The exception is LOG EXP, which evaluates to 6.04851066640616E-310. This is a small difference over a tiny number—clearly a rounding error in exp or log or both. In other testing with very small numbers, including grouping, sometimes it matched exactly with the others and sometimes not. With limited testing over small samples and larger numbers, it always agreed. You make the call.

T-SQL Solutions and Aggregate Functions

An aggregate is a scalar function whose input is a column expression of suitable data type. CLR solutions are certainly that, and though we can put any of our T-SQL solutions into scalar functions, they are certainly not.

SQL

CREATE TYPE aggr.tblMultiplier AS TABLE( multiplier float NULL );

select 
    product_clr = aggr.PRODUCT( multiplier ),
    product_tsql = aggr.sf_PRODUCT( cast( multiplier as aggr.tblMultiplier ) )…

In the above, the compiler recognizes the first function, aggr.PRODUCT, as a user-defined aggregate built from a .NET object, but no sleight of hand can make the compiler accept the second function, aggr.sf_PRODUCT, written in T-SQL, as an aggregate, or allow a column to be cast as a user-defined table type. If we want to use the T-SQL function as a generic aggregate, we must, for each product desired, fill a table variable (of type aggr.tblMultiplier) and set a variable to its return in a separate statement. Aside from being inefficient and inelegant, this opens up the door to problems arising from unrepeatable reads.

A Read-Write Conflict Example

Let’s look at a basic problematic scenario, keeping in mind that the error is less likely to happen with aggregates, system or user-defined in the SQL CLR, because they can always be placed into larger (atomic) statements (and locks are held for the duration of the statement).

SQL

-- T1
begin tran;

DECLARE @tblMultiplier aggr.tblMultiplier, @product float, @count int;

INSERT INTO @tblMultiplier
      
    select multiplier from aggr.T_ProductTest where groupcol = 'a';
 
select @product = aggr.sf_PRODUCT( @tblMultiplier, 1 ); -- 1 is 'invariant to nulls'
 
<context switch to T2: INSERT a row having groupcol value = ‘a’>
 
select @count = count( * ) from aggr.T_ProductTest where groupcol = 'a'; 
...

The non-serializable schedule represents a READ-WRITE transaction conflict, and is demonstrated by the sample code. After T1 reads the rows in aggr.T_ProductTest falling under groupcol ‘a,’ T2 commits a row to the group, making T1’s second read a phantom read (a type of unrepeatable read). Without the context switch, (@product, @count) is <1.3322913590615E-104, 675>, but with it, the values are <3.86364494127789E-104, 676>, making the scalar variable values <1.3322913590615E-104, 676>, out of sync with each other.

To prevent the phantom read, we could up the transaction isolation level to SERIALIZABLE or force serializable access to aggr.T_ProductTest only with an appropriate table hint (TABLOCKX e.g.). But that tactic—pessimistic locking—potentially decreases concurrency and increases the likelihood of deadlocks. For this particular example, it would be better to get the count from the table variable; in practice, subtle errors are made.

Extending the Module

If we persist in the code module strategy, we should optionally make it support grouping as well as the HAVING and ORDER BY clauses. For example, if we group by groupcol and yearcol, we would want the result set in one invocation rather than one for each (groupcol, yearcol) pairing, with possible constraints on the groupcol/yearcol groups. Reducing the number of calls lessens the risk of unrepeatable reads but doesn't eliminate it.

For it to be generic, it must work for grouping over any column list from any table with a numeric column. Another desideratum is that it determine the grouping columns without needing a parameter. You may have noticed: the function has morphed into a stored procedure using dynamic SQL.

Finally, as per the code sample above, it should have a parameter for specifying invariance to null behavior. My solution is in the download in the Product Aggregate Generic Procedure folder.

Without a mechanism to pass columns as parameters, the problems—all puns intended—multiply, and our attempts to make aggregate-simulating functions from some good ad-hoc T-SQL code are for naught.

The Report Card

<p[>If the grading seems arbitrary, think of me as being some of the teachers you had when you were in school.

Finally

The MSDN library demonstrates the CLR SQL Server user-defined aggregates with an example that counts the number of vowels in a column of strings (http://msdn.microsoft.com/en-us/library/91e6taax(v=vs.90).aspx). I've written a T-SQL scalar function that counts the values for one input string; the code is in the download. Following is its invocation that matches the CLR functionality along with performance metrics when run over the text column from our sample table:

SQL

select cntVowels = sum( aggr.sf_CountTheVowels( <some_string_column> ) )...

Function aggr.sf_CountTheVowels approximates the Accumulate method, is easy to write, and doesn’t try to be an aggregate—the sum() is the aggregate--making it safe by our standard. This time the optimizer knows up front to expect 2000 rows from the index scan, but probably because of the logical disk reads involved, gives a better cost to the T-SQL solution even though it runs 10 times slower. As the rule, CLR code gives better performance; in the product aggregate example we may have hit a rare exception.

In a solution I did for a recent client, I needed a product aggregate but the client didn’t want to enable the CLR, so I used my logarithm-based technique. But whether we write in a .NET language or T-SQL, rushing in without forethought could be costly. As we well know.

Afterword

After the article appeared, I verified that the search engines would find it. They do—and they also list another Code Project article (Tip/Trick) that discusses computing the product with logarithms (click here[^] ). The tip, by Dr. Alexander Bell, references an earlier work of his that details his research (ours are independent); read the tip’s referenced article for a second view of the problem.

In that latter article, he discusses the performance-universality dilemma, which means that the more cases handled by a solution, the costlier it is. For EXP LOG that implies addressing nulls, zeros, and negative numbers that may occur in the column expression, which we did. I’ll call the code that doesn’t bare bones.<o:p>

Further analysis verified the added cost: although the query optimizer uses the same execution plan and assigns the same final cost to both the bare bones and full EXP LOG solutions, and the page reads are the same, EXP LOG adds machine cycles. Using the same testing method as described in the performance section, I found that CPU usage went from an average of one millisecond in bare bones to two milliseconds in EXP LOG, and duration jumped tenfold (by more than 10 milliseconds). So know your data, and take what you need.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Scott Burkow

United States

Scott is a data architect/database developer specializing in SQL Server. He holds an Ms. C.S. from UCSB engineering, and is MCITP Database Developer in 2005 and 2008.

Based in San Francisco under the incorporated name Ziron Systems, he works with clients to analyze, design, and develop database systems as well as mentor team members and troubleshoot issues.

His passion is working on complex problems, and at the moment, writing articles for advanced practitioners that pose problems and explore solutions.

Reach him at scott.burkow@zironsystems.com or (310)403-1137.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

The Product Aggregate in T-SQL Versus the CLR

Contents

Introduction

The Basic Aggregate in the SQL CLR

The Aggregate in T-SQL

Three T-SQL Solutions

CLR Simulation

Choosing Invariance to Nulls

CLR

CLR Simulation

Recursion

EXP LOG

DISTINCT Keyword

Okay (oh no) Looping

A Performance Comparison—and Caveats

Recursion, Looping, and Safety

Grouping

Success with EXP LOG

Aggregate Window Functions: A Simple Fix

Poor Grouping Choices

Accuracy

T-SQL Solutions and Aggregate Functions

A Read-Write Conflict Example

Extending the Module

The Report Card

Finally

Afterword

License

Comments and Discussions