## Contents

## Introduction

Noticeably absent from the SQL Server built-in aggregate
functions is `product`

—the
multiplication of numbers in a set of values. Perhaps one reason is because it can so often fail; I demonstrate this
with a simple loop, meant to simulate the product over 309 values in a column:

DECLARE @product float = 1.79, @loopRowPosition int = 1;
while( @loopRowPosition <= 308 )
begin
select @product *= 10, @loopRowPosition += 1;
end

The result is 1.79E+308, the maximum positive number for the
float data type. But change the
start value to 1.798 and you get an arithmetic overflow. Put another way: multiply small numbers in a small column expression and
meltdown.

Still the aggregate is
useful in limited situations, and so we’ll develop it the expected way—in the
CLR—but then also several ways in T-SQL. The ground rule for T-SQL versions is simple: no loops.

To make the solutions in both environments more flexible,
we’ll extend them to support a choice over *
invariance to nulls*.

In our analysis, topics we’ll cover include
the following:

- How they work with grouping
- How they rate as to accuracy, safety, and
performance
- What happens when we try to make generic
aggregate-simulation functions from ad hoc code

This article is as much about good algorithm
development as it is about the product aggregate. As our tactics evolve, at any point a wrong turn
can harm performance or introduce subtle errors.

## The Basic Aggregate in the SQL CLR

Below are the four methods of the product solution in C#.

private SqlDouble product;
private SqlInt16 countOfMultipliers;
public void Init()
{
product = 1;
countOfMultipliers = 0;
}
public void Accumulate(SqlDouble multiplier)
{
product *= (multiplier.IsNull) ? 1 : multiplier;
countOfMultipliers += (SqlInt16)((multiplier.IsNull) ? 0 : 1);
}
public void Merge(ProductAggregate mergeProductAggregate)
{
Accumulate(mergeProductAggregate.product);
}
public SqlDouble Terminate()
{
return (0 == countOfMultipliers) ? SqlDouble.Null : product;
}

The `Accumulate`

method is invoked once for each number in
the input column and computes the product. The ternary operator, which tests for null values, gives us our
*invariance to nulls* property—i.e. the
ability to ignore nulls and return a result over non-null numbers, as do the
built-in aggregates.

The `Merge`

method is called when the number set is
partitioned and the product computed over multiple threads.

The result is returned in the `Terminate`

method.
The data member
`countOfMultipliers`

is incremented during accumulation using the same
test for nulls done for the `product`

member, allowing it to return null when all input values are null or the input
is empty.

C’est simple, ce n’est pas?

## The Aggregate in T-SQL

Remember the rule for our T-SQL versions: no loops. We can devise quicker, more elegant solutions.

Here is the second and final rule: the code must always
return a value, possibly null, just as the system aggregates do.

The Sample Table

This is the sample table for all code examples. The first five columns are populated from system view
*sys.messages*.
Table population is limited to 2000
rows—much more and arithmetic overflow occurs for my sample values. I added four columns, whose meanings are as follows:

*multiplier*. Our multiplicands of random values in the range 0.214-2.382, scale
15. *groupcol*. A computed, persisted column of values either ‘a’, ‘b’ or ‘c’ for group
testing. *yearcol*. A computed, persisted column of values ‘2001’ – ‘2004’ also for group
testing. *ID*. An IDENTITY surrogate key that aids the performance of one of our query
forms.

### Three T-SQL Solutions

The three basic T-SQL solutions follow. I’ll refer to them throughout by the labels given.

#### CLR Simulation

The first solution is a kind of mirror of the CLR code:
where data member *product* accumulates
the result, we’ll employ a scalar variable, also initialized to 1.

DECLARE @product AS float= 1;
SELECT @product *= multiplier FROM aggr.T_ProductTest;

Hardly a hat trick—until we consider that the product will
be one when `aggr.T_ProductTest`

has no rows (we want a null result). Where the CLR solution solves the problem by setting and later testing a
second data member, `countOfMultipliers`

, we’ll substitute an outer join on a derived
table, which I prefer over introducing another variable outside the central
query:

select
@product *= ( prod.multiplier * onz.one )
from
aggr.T_ProductTest prod
RIGHT OUTER JOIN
(
select 1 as one
) onz
on
1 = 1;

When the table has rows, each number is multiplied by
one. When it doesn’t, `prod.multiplier`

isn’t even a null
value for the multiplication, so the action in the `SELECT`

statement cannot be
applied. In this case the right
outer join forces the `SELECT`

clause to be evaluated over one row and @*product* becomes null because multiplicand
`prod.multiplier`

is now null.

**Recursion**

Starting with our second technique we make a clean break
from the CLR approach. The float
variable is not needed, although we still need to accommodate an empty result
set:

WITH cteRecursiveProduct( level, product ) as
(
select level = ID, product = multiplier
from aggr.T_ProductTest
where ID = ( select max( ID ) from aggr.T_ProductTest )
UNION ALL
select level = prodCTE.level - 1, product = ( prodCTE.product * prod.multiplier )
from cteRecursiveProduct prodCTE
inner join
aggr.T_ProductTest prod
on
prodCTE.level - 1 = prod.ID
)
select product = product * onz.one
from cteRecursiveProduct
RIGHT OUTER JOIN
(
select 1 as one
) onz
on
isnull(level, 1) = onz.one
OPTION ( MAXRECURSION 0 );

What is required in many cases is that we put
more than 100 frames on the call stack—the maximum allowed by default—and so we
allow unlimited frames with the query hint

`MAXRECURSION 0`

.
In recursion, there is one row returned for each intermediate product for
the multiplicands seen so far, so we get the final product at

* level *1 in the reverse ID order
strategy. The join condition on

`onz`

is modified to return the

*
level *1 value regardless of
whether it exists.

**EXP LOG**

Our third solution involves an arithmetic trick, but shares
with the recursive technique the advantage of being able to be placed wholly
within a larger query.

Let *f(x)* be a
function that transforms each multiplier in a column into a common base number
and sums their logs:

f(x) =
sum( log( multiplier ) )

In this case, SQL Server system function
*log()* uses the number
*e* as the base by default. If the multipliers are 8 and 10, for example, they would be represented
as e^{2.0794} and e^{2.3026} to four decimal places, and f(x)
would return 4.382.

Let's extend the composition:

g(f(x)) = e^{f(x)}

The number e raised to the f(x) can of course be
represented as a decimal number, and in our example, e^{4.382} = 80
(adjusted for rounding error); system function *exp()*, which is the
inverse of l*og*() (exp(log(x)) = log(exp(x)) = x), does this:

product =
exp( sum( log( multiplier ) ) )

This all works because of the
*laws of exponents* (multiplication case):

x^{a} * x^{b} * … * x^{n} = x^{a+b+…+n}

If the explanation is a little dense, don’t worry. What we *do* need to worry about
is finding a negative number or zero in the input column, because the log for these is
undefined. Here is the error you get
when you try a log(0) or log(-5) operation:

Msg 3623, Level 16, State 1, Line 2
An invalid floating point operation occurred.

We circumvent the problem by adding the `nullif()`

function to substitute nulls for zeros and the `abs()`

function to ensure that all numbers are positive, but we’ll need additional code to get the correct
answer—zero whenever zeros occur in the column expression and a negative value when the count of negative multipliers is odd:

exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) )
*
iif( sum( iif( multiplier = 0, 1, null ) ) > 0, 0, 1 )
*
iif( sum( iif( multiplier < 0, 1, 0 ) ) % 2 = 1, -1, 1 )

Note that unlike the other solutions, EXP LOG doesn’t
require the outer join or other strategy to return a null on null input.

Now we’ll explore an option not available to the
built-in aggregates.

## Choosing Invariance to Nulls

The popular built-in aggregates
*count*, *sum*,
*min*,
*max*,
*avg*, and the others are*
invariant to nulls*, meaning that having nulls in the column expressions over
which they operate does not affect the outcome.
The user cannot change this property. We, however, can, and so let’s see how it would be done, starting with
the CLR.

#### CLR

Here is the line in the `Accumulate`

method from the sample
code above that does the computation:

product *= (multiplier.IsNull) ? 1 : multiplier;

The ternary operator throws out nulls from the input,
making the implementation invariant to nulls. Remove the operator and the product is null whenever the input has a null
value, making it variant to nulls. Because
aggregates—system or CLR—don’t expose a parameterized constructor, a single
struct cannot give the user the option; simply introduce a second struct.

Set the property `IsInvariantToNulls`

on required attribute
class `SqlUserDefinedAttribute`

to true
on one and false on the other, keeping in mind that enforcement is up to you.

By contrast, were we to place the T-SQL implementations
into aggregate-simulating
functions—more on this later—a bit parameter specifying invariance
would obviate the need for duplication of code/effort.

You may have realized that the code snippets from the previous
section differ in variance, the first and second being variant, and the last
invariant (because the `log()`

function itself is invariant to nulls). Let’s start with the code that flips the property for the CLR simulation:

#### CLR Simulation

DECLARE @product float = 1;
select @product *= ( isnull( prod.multiplier, 1 ) * niladj.adjustor )
from aggr.T_ProductTest prod
RIGHT OUTER JOIN
(
select iif( count( * ) > 0, 1, null )
from aggr.T_ProductTest
where multiplier is not null
) niladj( adjustor )
on
1 = 1
select @product;

This rewrite of the derived table is correct over two boundary cases: when
`aggr.T_ProductTest`

is empty—as before; and when the
*multiplier* column in all rows has null
values. But the code is not optimal because the `count()`

aggregate requires a
full (table or index) scan.

DECLARE @product float = 1, @countOfMultipliers smallint = 0;
select @product *= isnull( multiplier, 1 ),
@countOfMultipliers += iif( multiplier is not null, 1, 0 )
from aggr.T_ProductTest;
select product = @product * iif( 0 = @countOfMultipliers, null, 1 );

The solution now more closely simulates the CLR implementation. Remove the
`isnull()`

function and it becomes an alternate solution for the variant case.

#### Recursion

Because the recursion strategy doesn't depend upon outside
variables, it must use the derived table or equivalent CTE to achieve
invariance, and pay the performance penalty.

select level = ID, product = isnull( multiplier, 1 )...
select ..., product = ( prodCTE.product * isnull( prod.multiplier, 1 ) )
...
select product = product * niladj.adjustor
from cteRecursiveProduct
RIGHT OUTER JOIN
(
select iif( count( * ) > 0, 1, null )
from aggr.T_ProductTest
where multiplier is not null
)
niladj( adjustor )
on
level = 1

#### EXP LOG

Flipping the property in the opposite direction for EXP LOG means one more multiplier:

exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) )
*
iif( sum( iif( multiplier = 0, 1, null ) ) > 0, 0, 1 )
*
iif( sum( iif( multiplier < 0, 1, 0 ) ) % 2 = 1, -1, 1 )
*
iif( sum( iif( multiplier is null, 1, 0 ) ) > 0, null, 1 )

#### DISTINCT Keyword

Finally, another important option we
could implement is the
`DISTINCT`

keyword. I won’t expand on it, but
suffice it to say that the CLR would need more effort: e.g., a data member vector
could buffer all numbers in `Accumulate`

and the vector could be sorted to bypass
duplicates during multiplication in `Terminate`

.

Extensions to the T-SQL strategies vary in complexity and
performance. Do you see the error in this code? (Hint: let the column contain values 5 and -5.)

exp( sum( DISTINCT log( nullif( abs( multiplier ), 0 ) ) ) )

## Okay (oh no) Looping

While the looping tactic is verboten, I reference its metrics as the baseline
in the section on performance next, so here is one optimal form.

DECLARE @product float = 1.0, @countOfMultipliers smallint = 0,
@next_row smallint = 1, @max_row smallint;
select @max_row = max( ID ) from aggr.T_ProductTest;
while( @next_row <= @max_row )
begin
DECLARE @next_multiplier float;
select @next_multiplier = multiplier
from aggr.T_ProductTest
where ID = @next_row;
select @product *= isnull( @next_multiplier, 1 ),
@countOfMultipliers += iif( @next_multiplier is null, 0, 1 );
SET @next_row += 1;
end
select product = @product * iif( @countOfMultipliers = 0, null, 1 );

For your eyes only. Destroy the code after
reading.

## A Performance Comparison—and Caveats

The test I performed for all solutions in their
*invariant to null* forms was a product
over the *multiplier* column for all
2,000 rows in the `aggr.T_ProductTest`

table. I ran each solution code 100
times at an interval 1/20 second apart to get the logical disk reads and rough
averages over CPU and duration from the Profiler.

The average times clearly show that the recursive and
looping techniques are not viable. The first pair of numbers for their
measurements are the values for the code as was displayed, and the second, for
their *safer* versions, to be explained.

The first cost for recursion is very good, and for
looping, exceptional. So why the disconnect between the optimizer's
estimates and actual performance?

Execution Plans: CLR, CLR SIMULATION, and EXP LOG

Above is the execution plan for the SQL CLR aggregate.
If you add one Compute Scalar operator on each side of the Stream Aggregate, you
essentially have the plan for the EXP LOG code; subtract the Stream Aggregate,
the CLR SIMULATION. In all cases, the index scan on the clustered primary
key is known to the optimizer to return a fixed number of rows--the *
Estimated Number of Rows* = the *Actual Number of Rows*
= 2000--and so it can make accurate estimated costs.

Execution Plans: RECURSION

The graphic above depicts the operators that start one
branch of the recursive part in the estimated execution plan followed by those
that start in the actual execution plan. Where the other strategies employ
a one-pass index *scan*, recursion and also looping must *seek*
on the same index to get multipliers from successive IDs, once for each
recursion/iteration. This accounts for more page touches in looping, and
for recursion, seek must re-fetch all the multipliers seen so far plus the
current one for each stack frame,
so the reads skyrocket.

By visually inspecting the code, we can see that the anchor
gets the multiplier at ID 2000 (the last row in the sample table), and each
recursion operates at the next lower contiguous ID stopping at ID
one for a total of 2000. In fact, in the actual plan, the* *Actual
Number of Rows returned by the seek operator is 1999, as indicated by the much
thicker outbound arrow.

But the optimizer can't deduce the row
count from the recursive definition, and so it
puts in a placeholder value of one for Estimated Number of Rows, as indicated by
the thin arrow. It is for this reason that the optimizer cannot give a
reasonable estimated cost for recursion or for looping as well. (The
estimated branch cost may accurately reflect the effort to get the multipliers
at IDs 2000 and 1999 or just one multiplier.)

#### Recursion, Looping, and Safety

When I introduced the T-SQL solutions, I noted that I added
column ID as an `int IDENTITY`

clustered primary key to aid the performance of one
of the solutions (the compact natural key is message_id, which otherwise would
be clustered). That solution of
course is
RECURSION (add looping). But for this to
work, we must guarantee the following: 1) that the minimum ID is one; and 2)
that there are no gaps in the ID sequence, such as those resulting from row
deletions and rolled-back transactions. And often this is not the case.

**Safe Tactic: CTE**

Let’s add a CTE that gives us our contiguous IDs
starting at one:

WITH cteMultiplierRank( rankNo, multiplier ) as
(
select CAST( ROW_NUMBER( ) OVER( ORDER BY ID ) as int ), multiplier
from aggr.T_ProductTest
),

The
recursive CTE is rewritten to reference this CTE instead of the sample table,
and the execution plan shows that this CTE as well as the anchor and recursive
parts of the recursive CTE all share the starting operators below. The graphic is the start of the recursive branch of the
*actual* execution plan:

In
the estimated plan, each operator for all branches outputs 2,000 rows, but
run-time information shows that the recursive branch operators each produced
4,000,000 *actual* rows—2,000 sample
table rows times 2,000. The operators essentially set up 2,000 groups of
all (rankNo, multiplier) pairings, and a Filter operator to come applies the
recursive condition to determine the set of multipliers to use for
each group. The logical disk reads go from 22,010 to 156,147, the
estimated cost balloons to an unacceptable (untrustworthy! but still...) 2.3517, and the user experience
degrades proportionately. This next
attempt fares better.

**Performance Tactic: Table Variable**

DECLARE @tblMultiplier TABLE( rowno int IDENTITY PRIMARY KEY, multiplier float NULL );

After we rewrite the recursive CTE to reference the table
variable, the execution plan is
identical to the original and
the cost comes back to a healthy 0.016537. But in practice 85% of the total cost comes from populating the table variable
from the sample table, bringing the cost to 0.1077. The second sets of numbers in RECURSION and Looping show the additional cost of
this
safety.

Safer certainly--but is it *safe* now? Our
revised solutions share with the
CLR SIMULATION a potential problem inherent in its not being expressible in one
atomic statement—a problem to be addressed in the section
*T-SQL Solutions and Aggregate Functions*.

## Grouping

#### Success with EXP LOG

Built-in and user-defined CLR aggregates can be used in
`SELECT`

, `HAVING`

, and `ORDER BY`

clauses. Of our T-SQL solutions, only EXP LOG is a wholly self-contained
expression, and so it too can be used in these clauses:

select groupcol, yearcol,
product = exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) )
from aggr.T_ProductTest
group by groupcol, yearcol
having exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) ) > 0
order by exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) );

I’ve left off the part of the calculation that adjusts for zeros and negative numbers for brevity.

#### Aggregate Window Functions: A Simple Fix

product_by_year_exp = sum( exp( sum( log( multiplier ) ) ) )
OVER( PARTITION BY yearcol ),
product_by_year_clr = sum( aggr.PRODUCT( multiplier ) )
OVER( PARTITION BY yearcol )

if we add these column expressions to the select list in
the query, *exp* is rejected by the
compiler as a window function because it is not an aggregate or other acceptable
function type. But
`sum()`

is, and we can use it as the outer
function to get the intended result.
Function `aggr.PRODUCT`

is also an
aggregate—it is the local name for the CLR aggregate—but this too is rejected
(for an unknown reason), and so we reuse the trick.

####
Poor Grouping Choices

Neither of our remaining strategies, RECURSION or the CLR
SIMULATION, is suitable for grouping. For our sample query, either we would need
to know in advance the (*groupcol, yearcol*)
paired values of interest—or employ more code to get the pairings—and windowing
makes no sense. In particular,
recursion is not a solution for grouping.

The CLR SIMULATION, with its individual variable technique,
is marginally better but not necessarily safe. Here
products for several years are set in one SELECT clause, invariant to
nulls form; grouping is implied in the `SELECT`

clause:

select
@product2001 *= iif( '2001' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2001' = niladj.yearcol, niladj.adjustor, 1 ),
@product2002 *= iif( '2002' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2002' = niladj.yearcol, niladj.adjustor, 1 ),
@product2003 *= iif( '2003' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2003' = niladj.yearcol, niladj.adjustor, 1 ),
@product2004 *= iif( '2004' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2004' = niladj.yearcol, niladj.adjustor, 1 )
from
aggr.T_ProductTest proTest
RIGHT OUTER JOIN
(
select niladj.yearcol, cnt_year_non_null.cnt
from
(
select yearcol, iif( count( * ) > 0, 1, null )
from aggr.T_ProductTest
where multiplier is not null
group by yearcol
) cnt_year_non_null( yearcol, cnt )
RIGHT OUTER JOIN
(
select yearcol, nilAdj = null
from
(
select [2001] = 1, [2002] = 1, [2003] = 1, [2004] = 1
) p
UNPIVOT
(
nilAdj FOR yearcol in( [2001], [2002], [2003], [2004] )
) as unpvt
) niladj( yearcol, adjustor )
on
cnt_year_non_null.yearcol = niladj.yearcol
)
niladj( yearcol, adjustor )
on
proTest.yearcol = niladj.yearcol;

Oh myyyyyyyyyy! The derived table must now do its own
outer join on an unpivot relational operator or employ a similar strategy (think
UNION ALL in the second derived table) to ensure that each year has its
own adjustor row (with possibly null adjustor), not just those years having rows
in the sample table. Of course, the more optimal second form for
invariance should have been used, but any developer may decide against using one
`@countOfMultiplier`

variable per year while not thinking of the derived table
problem. Increment the failure point column.

## Accuracy

All product strategies except one agree that
the total non-grouped product for the sample data is
6.04851066640848E-310. The exception is LOG EXP, which evaluates to 6.04851066640616E-310. This is a small difference over a tiny number—clearly a rounding error in
*exp* or
*log* or both. In other testing with very small numbers, including grouping, sometimes
it matched exactly with the others and sometimes not. With limited testing over small samples and larger numbers, it always
agreed. You make the call.

## T-SQL Solutions and Aggregate Functions

An aggregate is a scalar function whose input is a column
expression of suitable data type. CLR solutions are certainly that, and though we can put any of our T-SQL
solutions into scalar functions, they are certainly not.

CREATE TYPE aggr.tblMultiplier AS TABLE( multiplier float NULL );
select
product_clr = aggr.PRODUCT( multiplier ),
product_tsql = aggr.sf_PRODUCT( cast( multiplier as aggr.tblMultiplier ) )…

In the above, the compiler recognizes the first
function, `aggr.PRODUCT`

, as a
user-defined aggregate built from a .NET object, but no sleight of hand can make
the compiler accept the second function, `aggr.sf_PRODUCT`

, written in T-SQL, as an aggregate, or allow a
column to be cast as a user-defined table type. If we want to use the T-SQL function as a generic aggregate, we must, for
each product desired, fill a table variable (of type
`aggr.tblMultiplier`

) and set a variable to its return in a separate
statement. Aside from being
inefficient and inelegant, this opens up the door to problems arising from
unrepeatable reads.

#### A Read-Write Conflict Example

Let’s look at a basic problematic scenario, keeping in mind that the error is less likely to happen with aggregates, system
or user-defined in the SQL CLR, because they can always be placed into larger (atomic) statements (and locks are held for the duration of the statement).

begin tran;
DECLARE @tblMultiplier aggr.tblMultiplier, @product float, @count int;
INSERT INTO @tblMultiplier
select multiplier from aggr.T_ProductTest where groupcol = 'a';
select @product = aggr.sf_PRODUCT( @tblMultiplier, 1 );
<context switch to T2: INSERT a row having groupcol value = ‘a’>
select @count = count( * ) from aggr.T_ProductTest where groupcol = 'a';
...

The non-serializable schedule represents a READ-WRITE
transaction conflict, and is demonstrated by the sample code. After T1 reads the rows in `aggr.T_ProductTest`

falling under *
groupcol* ‘a,’ T2 commits a row to the group, making T1’s second read a
phantom read (a type of unrepeatable read). Without the context switch, (@product, @count) is <1.3322913590615E-104,
675>, but with it, the values are <3.86364494127789E-104,
676>, making the scalar variable values <1.3322913590615E-104,
676>, out of sync with each other.

To prevent the phantom read, we could up the transaction isolation level to
`SERIALIZABLE`

or force
serializable access to `aggr.T_ProductTest`

only with an appropriate table hint (TABLOCKX e.g.). But that tactic—pessimistic
locking—potentially decreases concurrency and increases the likelihood of
deadlocks. For this particular
example, it would be better to get the count from the table variable; in
practice, subtle errors are made.

#### Extending the Module

If we persist in the code module strategy, we should optionally make it support
grouping as well as the HAVING and
ORDER BY clauses. For example, if we group by groupcol and yearcol, we would
want the result set in one invocation rather than one for each (groupcol,
yearcol) pairing, with possible constraints on the groupcol/yearcol groups.
Reducing the number of calls lessens the risk of unrepeatable reads but doesn't
eliminate it.

For it to be generic, it must work for grouping over
any column list from any table with a numeric column. Another desideratum
is that it determine the grouping columns without needing a parameter.
You may have noticed: the function has morphed into a stored procedure
using dynamic SQL.

Finally, as per the code sample above, it should have a parameter for
specifying invariance to null behavior. My solution is in the download in
the Product Aggregate Generic Procedure folder.

Without a mechanism to pass columns as parameters, the problems—all puns
intended—multiply, and our attempts to make aggregate-simulating functions from
some good ad-hoc T-SQL code are for naught.

## The Report Card

<p[>If the grading seems arbitrary, think of me as being some of the teachers you had when you were in school.

## Finally

The MSDN library demonstrates the CLR SQL Server
user-defined aggregates with an example that counts the number of vowels in a
column of strings (http://msdn.microsoft.com/en-us/library/91e6taax(v=vs.90).aspx). I've written a T-SQL scalar function that counts the values for one input
string; the code is in the download. Following is its invocation that
matches the CLR functionality along with performance metrics when run over the *text* column
from our sample table:

select cntVowels = sum( aggr.sf_CountTheVowels( <some_string_column> ) )...

Function `aggr.sf_CountTheVowels`

approximates the `Accumulate`

method, is easy to write, and doesn’t try to be an aggregate—the `sum()`

is the aggregate--making it safe
by our standard. This time the optimizer knows up front to expect 2000
rows from the index scan, but probably because of the logical disk reads
involved, gives a better cost to the T-SQL solution even though it runs 10 times
slower. As the rule, CLR code gives better performance; in the product
aggregate example we may have hit a rare exception.

In a solution I did for a recent client, I needed a product aggregate but the
client didn’t want to enable the CLR, so I used my logarithm-based
technique. But whether we write in a .NET
language or T-SQL, rushing in without forethought could be costly. As we well know.

## Afterword

After the article appeared, I verified that the search
engines would find it. They do—and they also
list another Code Project article (Tip/Trick) that discusses computing the product
with logarithms (click here[^]
). The tip,
by Dr. Alexander Bell, references an earlier work of his that details his
research (ours are independent); read the tip’s referenced article for a second
view of the problem.

In that latter article, he discusses the *performance-universality dilemma*, which means
that the more cases handled by a solution, the costlier it is. For EXP LOG that implies addressing nulls,
zeros, and negative numbers that may occur in the column expression, which we
did. I’ll call the code that doesn’t *bare bones*.<o:p>

Further analysis verified the added cost:
although the query optimizer uses the same execution plan and assigns the same
final cost to both the bare bones and full EXP LOG solutions, and the page
reads are the same, EXP LOG adds machine cycles. Using the same testing method as described in
the performance section, I found that CPU usage went from an average of one millisecond in bare
bones to two milliseconds in EXP LOG, and duration jumped tenfold (by more than
10 milliseconds). So know your data, and
take what you need.

Scott is a data architect/database developer specializing in SQL Server. He holds an Ms. C.S. from UCSB engineering, and is MCITP Database Developer in 2005 and 2008.

Based in San Francisco under the incorporated name Ziron Systems, he works with clients to analyze, design, and develop database systems as well as mentor team members and troubleshoot issues.

His passion is working on complex problems, and at the moment, writing articles for advanced practitioners that pose problems and explore solutions.

Reach him at scott.burkow@zironsystems.com or (310)403-1137.