Click here to Skip to main content
15,886,573 members
Articles / Database Development / SQL Server

How to Use columnstore Indexes in SQL Server

Rate me:
Please Sign up or sign in to vote.
0.00/5 (No votes)
15 Jul 2011Apache3 min read 16K  
A very simplified explanation of how column store indexes can be used

Column oriented storage is the data storage of choice for data warehouse and business analysis applications. Column oriented storage allows for a high data compression rate and as such it can increase processing speed, primarily by reducing the IO needs. Now SQL Server allows for creating column oriented indexes (called COLUMNSTORE indexes) and thus brings the benefits of these highly efficient BI oriented indexes in the same engine that runs the OLTP workload. The syntax for creating columnstore indexes is described in MSDN at CREATE COLUMNSTORE INDEX. Let's walk through a very simple example of how to create and use a columnstore index. First, let's have a dummy sales table:

SQL
create partition function pf (date) as range left for values
  ('20110712', '20110713', '20110714', '20110715', '20110716');
go

create partition scheme ps as  partition pf all to ([PRIMARY]);
go

create table sales (
    [id] int not null identity (1,1),
    [date] date not null,
    itemid smallint not null,
    price money not null,
    quantity numeric(18,4) not null)
    on ps([date]);
go

create unique clustered index cdx_sales_date_id on sales ([date], [id]) on ps([date]);
go

Notice how I created this table on a partitioning scheme that has one partition a day. See my follow up article How to update a table with a columnstore index to understand why I chose this particular arrangement. For now, let's populate the table with 1 million ‘sales’ facts:

SQL
set nocount on;
go

declare @i int = 0;
begin transaction;
while @i < 1000000
begin
    declare @date date = dateadd(day, @i /250000.00, '20110712');
    insert into sales ([date], itemid, price, quantity)
        values (@date, rand()*10000, rand()*100 + 100, rand()* 10.000+1);
    set @i += 1;
    if @i % 10000 = 0
    begin
        raiserror (N'Inserted %d', 0, 1, @i);
        commit;
        begin tran;
    end
end
commit;
go

If we look now at the structure of the sales table, we see that each partition has 250K rows spread along 1089 pages:

SQL
select * from sys.system_internals_partitions p
    where p.object_id = object_id('sales');

select au.* from sys.system_internals_allocation_units au
    join sys.system_internals_partitions p
        on p.partition_id = au.container_id
    where p.object_id = object_id('sales');
go

Image 1

If we now run a BI type of query like get the number of sales facts and the total sales for a day, the query would have to scan an entire partition, generating 1089 logical reads:

SQL
set statistics io on;
select count(*), sum(price*quantity) from sales where date = '20110713'
set statistics io off;
go

Table 'sales'. Scan count 1, logical reads 1089, physical reads 0,...

So let's create a columnstore index on this table:

SQL
create columnstore index cs_sales_price on sales ([date], price, quantity) on ps([date]);
go

If we look at the structure of the columnstore index, we'll see that it has a much smaller footprint, only 362 pages:

SQL
select * from sys.system_internals_partitions p
    where p.object_id = object_id('sales')
    and index_id = 2;

select au.* from sys.system_internals_allocation_units au
    join sys.system_internals_partitions p
        on p.partition_id = au.container_id
    where p.object_id = object_id('sales')
        and index_id = 2;
go

Image 2

Note how the columnstore index has no pages allocated for the IN_ROW_DATA allocation unit, but instead has pages allocated to the LOB_DATA allocation unit. So a columnstore index has no rows, instead it uses the BLOB storage to store the column 'segments'. Due to compression possible with column oriented storage, it needs only about one third of the pages needed by the clustered index, although it contains the same columns and the same number of sales facts. If we run again the very same query as before, we'll see how it uses the columnstore index and generates less IO:

SQL
set statistics io on;
select count(*), sum(price*quantity) from sales where date = '20110713'
set statistics io off;
go

Table 'sales'. Scan count 1, logical reads 358, physical reads 0, read-ahead reads 0, ...

This article is just a very, very simplified explanation of how column store indexes can be used. Column oriented storage is one of the major features that ships with SQL Server 11 and there is much more we could talk about it, but I only wanted to give a short introduction. You should look into column oriented storage for BI and data warehousing projects, where a columnstore index could significantly speed up certain types of analytic queries, specially those that use aggregate functions.

On a final note, you have to understand the restrictions that columnstore indexes have; these restrictions are described in detail in the MSDN Columnstore Indexes article. The most severe restriction, by far, is the fact that a table that has columnstore indexes cannot be updated, it becomes read-only. For the specific DW and BI scenarios that columnstore indexes addresses, this is actually not such a hard restriction, as the ETL process can easily circumvent this problem by using staging tables and partitioning. More on this in the next article: How to update a table with a columnstore index.

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0


Written By
bugcollect.com
United States United States
Remus Rusanu is a developer with the SQL Server team.

Comments and Discussions

 
-- There are no messages in this forum --