Click here to Skip to main content
15,888,984 members
Articles / Database Development / SQL Server / SQL Server 2008

Very fast test data generation using exponential INSERT

Rate me:
Please Sign up or sign in to vote.
3.50/5 (2 votes)
28 Aug 2009CPOL1 min read 28.9K   11   9
Instead of using incremental INSERT to generate test data, this method effectly copies the existing data multiple times.

Introduction

This article (my first) will describe an algorithm that enables a large amount of data to be generated very quickly using a SQL query. The test data can be static or incremental, such as “Item Name” and “Item ID”, respectively, as shown below:

tableSample.png

Background

One of the tasks I did in a project involves generating a testing table with 103,680,000 records. The conventional method of data generation would take a month; hence, a fast method of data insertion was required. The new method took only 5 hours.

Using the code

Conventional method – Sequential INSERT

The conventional way of generating a list of numbers from 0…100000 would be using a loop and an INSERT statement as follows:

SQL
CREATE TABLE #tempTable([Item ID] [bigint], [Item Name] nvarchar(30))
DECLARE @counter int
SET @counter = 1
WHILE (@counter < 100000)
BEGIN
        INSERT INTO #tempTable VALUES (@counter, 'Hammer')
        SET @counter = @counter + 1
END
SELECT * FROM #tempTable
DROP TABLE #tempTable

Let's call this method of data generation “Sequential INSERT”.

New method – Exponential INSERT

The new method effectively makes a copy of the existing data and appends it as new data, and does so repeatedly until the desired amount of data is generated.

Image 2

Here is the code for the exponential INSERT:

SQL
CREATE TABLE #tempTable([Item ID] [bigint], [Item Name] nvarchar(30))
INSERT INTO #tempTable VALUES (1, 'Hammer')
WHILE((SELECT COUNT(*) FROM #tempTable) < 100000)
BEGIN
    INSERT INTO #tempTable ([Item ID], [Item Name])
        (SELECT [Item ID] + (SELECT COUNT(*) FROM #tempTable), 
                                 'Hammer' FROM #tempTable)
END
SELECT * FROM #tempTable
DROP TABLE #tempTable

Points of interest

The condition for the WHILE..LOOP is (SELECT COUNT(*)). This condition statement takes a long time to be evaluated. A faster method would be to calculate how many iterations are needed to generate the desired number of records, i.e., 100,000 records in this case, which is 2^17=131,072, so we can rewrite the code to stop after the 17th iteration.

It took 4 seconds to execute the number count from 1 to 100,0000; the exponential method took two seconds with the code below:

SQL
CREATE TABLE #tempTable([Item ID] [bigint], [Item Name] nvarchar(30))
INSERT INTO #tempTable VALUES (1, 'Hammer')
DECLARE @counter int
SET @counter = 1
WHILE(@counter <= 17)
BEGIN
    INSERT INTO #tempTable ([Item ID], [Item Name])
        (SELECT [Item ID] + (SELECT COUNT(*) FROM #tempTable), 
                           'Hammer' FROM #tempTable)
    SET @counter = @counter + 1
END
SELECT * FROM #tempTable
DROP TABLE #tempTable

Also, not only can you use this to increment a number field, but it can be applied to datetime fields as well.

History

  • This is v1.0.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United Kingdom United Kingdom
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 2 Pin
Aborq31-Aug-09 21:50
Aborq31-Aug-09 21:50 
GeneralRe: My vote of 2 Pin
weilidai20011-Sep-09 0:03
weilidai20011-Sep-09 0:03 
GeneralRe: My vote of 2 Pin
Aborq3-Sep-09 3:50
Aborq3-Sep-09 3:50 
First of all, and as I've already stated, I don't understand the need to create a row-by-row solution when a set-based solution will give you the exact same result, with simpler code and while putting less strain on the db. (Especially since set-based operations are what the db is designed to do best in the first place.) Generating the ~65,000 rows while running the profiler yielded the following for the two solutions respectively:

		Row-by-row	Set-based
		==========	=========
CPU		562		46
Reads		273618		890
Writes		1108		271
Duration (ms)	1721		998


Second, predictability. You say in the article, that it takes you "only 5 hours" to generate 100 million rows of data with the new method and that your exponential insert took two seconds for 100,000 rows. If 100,000 rows takes two seconds, then 100 million rows should take 2,000 seconds, give or take. But thats just a little more than half an hour... Where does the remaining time go? If your code doesn't scale and is only designed to solve your immediate need, you are bound to run into trouble sooner or later. I'm currently trying to fix stuff that runs perfect in testing, runs perfect with a medium-sized set of data, but is utterly useless when put into production with a real set of customers running against it.

So third, no I don't currently have a personal need for a table with 100 million rows, but yes, I do have a use case for you: Generating 100 million rows of test data. I ran the set-based version and it took my machine (ordinary, average laptop CPU with 2Gb RAM and a local SQL-server installation) less than 7 minutes. Our production box did this in 2½ minutes. Effectively one line of code (two, if you include the slightly modified CREATE TABLE), and it returns *exactly* the amount of rows I need in one go. I'm sorry, but I really don't see a single good argument for going row-by-row in this case - even if the difference is slight when working with 100k rows of data.
QuestionWhy go row-by-row? Pin
Aborq31-Aug-09 21:37
Aborq31-Aug-09 21:37 
AnswerRe: Why go row-by-row? Pin
weilidai20011-Sep-09 0:04
weilidai20011-Sep-09 0:04 
QuestionCould a variation be used to speed up creation of non-trivial records? Pin
supercat926-Aug-09 12:10
supercat926-Aug-09 12:10 
AnswerRe: Could a variation be used to speed up creation of non-trivial records? Pin
weilidai200126-Aug-09 13:12
weilidai200126-Aug-09 13:12 
GeneralRe: Could a variation be used to speed up creation of non-trivial records? Pin
supercat927-Aug-09 5:23
supercat927-Aug-09 5:23 
GeneralRe: Could a variation be used to speed up creation of non-trivial records? Pin
Md. Marufuzzaman27-Aug-09 10:26
professionalMd. Marufuzzaman27-Aug-09 10:26 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.