Tuesday, April 08, 2008

Generating Test Data

I'm an advocate of testing things - it can be a great way to discover different behaviors and learn how things work (or perhaps why they didn't work the way you thought they might!). When it comes to testing database applications, having some test data is certainly helpful. Ideally, I like to have a fully-loaded production database from which to draw this data into a test environment. Sometimes, however, this is not possible. In cases like these I use the "trick" described below to create my own test data.

When I need to generate test data I frequently call upon a seemingly little-known PL/SQL Package supplied with Oracle Database called "dbms_random". As you may guess this package can be used to generated random data. Rather than explain the package details (they are short and you can read them using the link), I'll just present a quick way to generate some data using SQL*Plus and the dbms_random package.

First, I'll need a table:

create table test_data
  id number,
  first_name varchar2(16),
  last_name varchar2(24),
  amount number(6,2),
  purchase_date date

As you can see, this table is nothing to write home about, but it does mirror what a real-world table would look like in a lot of cases.

When I am creating test data I generally prefer the data to be reproducible. That is, if I execute the process more than once I like to get the same data each time. To accomplish this with the dbms_random package I call the "seed" procedure and provide an initial seed value:

exec dbms_random.seed(42);

Let's say I wanted to populate this table with 10,000 rows of data. I use the following to do this:

insert into test_data
  to_date('01-JAN-2008', 'DD-MON-YYYY') + dbms_random.value(-100,100)
  (select level from dual connect by level <= 10000);

Starting at the bottom of the SQL text is a neat trick using "dual" and "connect by" to essentially create an "unlimited dual" for generating lots of rows. This trick was (I believe) originally put forth by Mikito Harakiri and I discovered it via an Ask Tom thread.

The remainder of the text is fairly straight-forward in its use of the dbms_random package. I use the "string" function to generate a random, lowercase value (which is subsequently passed to "initcap" to capitalize the first character of each string), the "value" function is used to create a random numeric value (which is passed to "round" to make it look like a purchase amount), and then I use a fixed-date to which I add (possibly a negative value) to create a set of valid dates within a range.

The first 10 rows of this data when selected from the table look like:

--- ---------------- ------------------------ ---------- --------------------
  1 Oqq              Mxsazbwyx                    521.33 22-MAR-2008 16:49:40
  2 Jjgqrywtxbdn     Fwwbzshhkbqzb                921.47 04-OCT-2007 09:10:00
  3 Zxflhufls        Mstwydowbaogeyyjiles         172.34 20-MAR-2008 10:22:05
  4 Zjjxtyysitsog    Zxrzqeflxgo                  882.16 26-DEC-2007 18:56:44
  5 Kjmuvbrqx        Hfu                          742.61 16-OCT-2007 14:35:27
  6 Oywaibiyp        Angvlehlmeujfdlhdmtt          664.5 29-FEB-2008 12:50:40
  7 Uhwyvla          Nhbwcv                       168.99 27-DEC-2007 22:29:59
  8 Kpdiqafanbvzt    Phjeqwelyugrmahybocwbhvp     813.81 01-MAR-2008 09:15:59
  9 Tvezuvrgnzzqkpq  Pjyygoqx                     880.09 21-NOV-2007 00:42:07
10 Olchylbeft       Nflaxjqfkmkgt                847.71 07-DEC-2007 16:53:23

10 rows selected.

Can this technique always be used? No, probably not. For example, the names are not exactly what you might call "proper" names. However, I do find that this technique can be useful when I need to create some data to perform some testing with. Perhaps it will be helpful to you too if you experiment with it and find the right combination of values to use in your tests.