How to Use columnstore Indexes in SQL Server

rusanu

0/5 (0 vote)

Jul 15, 2011

Apache

3 min read

16289

A very simplified explanation of how column store indexes can be used

Column oriented storage is the data storage of choice for data warehouse and business analysis applications. Column oriented storage allows for a high data compression rate and as such it can increase processing speed, primarily by reducing the IO needs. Now SQL Server allows for creating column oriented indexes (called COLUMNSTORE indexes) and thus brings the benefits of these highly efficient BI oriented indexes in the same engine that runs the OLTP workload. The syntax for creating columnstore indexes is described in MSDN at CREATE COLUMNSTORE INDEX. Let's walk through a very simple example of how to create and use a columnstore index. First, let's have a dummy sales table:

create partition function pf (date) as range left for values
  ('20110712', '20110713', '20110714', '20110715', '20110716');
go

create partition scheme ps as  partition pf all to ([PRIMARY]);
go

create table sales (
    [id] int not null identity (1,1),
    [date] date not null,
    itemid smallint not null,
    price money not null,
    quantity numeric(18,4) not null)
    on ps([date]);
go

create unique clustered index cdx_sales_date_id on sales ([date], [id]) on ps([date]);
go

Notice how I created this table on a partitioning scheme that has one partition a day. See my follow up article How to update a table with a columnstore index to understand why I chose this particular arrangement. For now, let's populate the table with 1 million ‘sales’ facts:

set nocount on;
go

declare @i int = 0;
begin transaction;
while @i < 1000000
begin
    declare @date date = dateadd(day, @i /250000.00, '20110712');
    insert into sales ([date], itemid, price, quantity)
        values (@date, rand()*10000, rand()*100 + 100, rand()* 10.000+1);
    set @i += 1;
    if @i % 10000 = 0
    begin
        raiserror (N'Inserted %d', 0, 1, @i);
        commit;
        begin tran;
    end
end
commit;
go

If we look now at the structure of the sales table, we see that each partition has 250K rows spread along 1089 pages:

select * from sys.system_internals_partitions p
    where p.object_id = object_id('sales');

select au.* from sys.system_internals_allocation_units au
    join sys.system_internals_partitions p
        on p.partition_id = au.container_id
    where p.object_id = object_id('sales');
go

If we now run a BI type of query like get the number of sales facts and the total sales for a day, the query would have to scan an entire partition, generating 1089 logical reads:

set statistics io on;
select count(*), sum(price*quantity) from sales where date = '20110713'
set statistics io off;
go

Table 'sales'. Scan count 1, logical reads 1089, physical reads 0,...

So let's create a columnstore index on this table:

create columnstore index cs_sales_price on sales ([date], price, quantity) on ps([date]);
go

If we look at the structure of the columnstore index, we'll see that it has a much smaller footprint, only 362 pages:

select * from sys.system_internals_partitions p
    where p.object_id = object_id('sales')
    and index_id = 2;

select au.* from sys.system_internals_allocation_units au
    join sys.system_internals_partitions p
        on p.partition_id = au.container_id
    where p.object_id = object_id('sales')
        and index_id = 2;
go

Note how the columnstore index has no pages allocated for the IN_ROW_DATA allocation unit, but instead has pages allocated to the LOB_DATA allocation unit. So a columnstore index has no rows, instead it uses the BLOB storage to store the column 'segments'. Due to compression possible with column oriented storage, it needs only about one third of the pages needed by the clustered index, although it contains the same columns and the same number of sales facts. If we run again the very same query as before, we'll see how it uses the columnstore index and generates less IO:

set statistics io on;
select count(*), sum(price*quantity) from sales where date = '20110713'
set statistics io off;
go

Table 'sales'. Scan count 1, logical reads 358, physical reads 0, read-ahead reads 0, ...

This article is just a very, very simplified explanation of how column store indexes can be used. Column oriented storage is one of the major features that ships with SQL Server 11 and there is much more we could talk about it, but I only wanted to give a short introduction. You should look into column oriented storage for BI and data warehousing projects, where a columnstore index could significantly speed up certain types of analytic queries, specially those that use aggregate functions.

On a final note, you have to understand the restrictions that columnstore indexes have; these restrictions are described in detail in the MSDN Columnstore Indexes article. The most severe restriction, by far, is the fact that a table that has columnstore indexes cannot be updated, it becomes read-only. For the specific DW and BI scenarios that columnstore indexes addresses, this is actually not such a hard restriction, as the ETL process can easily circumvent this problem by using staging tables and partitioning. More on this in the next article: How to update a table with a columnstore index.