Tuesday, April 08, 2008

Generating Test Data

I'm an advocate of testing things - it can be a great way to discover different behaviors and learn how things work (or perhaps why they didn't work the way you thought they might!). When it comes to testing database applications, having some test data is certainly helpful. Ideally, I like to have a fully-loaded production database from which to draw this data into a test environment. Sometimes, however, this is not possible. In cases like these I use the "trick" described below to create my own test data.

When I need to generate test data I frequently call upon a seemingly little-known PL/SQL Package supplied with Oracle Database called "dbms_random". As you may guess this package can be used to generated random data. Rather than explain the package details (they are short and you can read them using the link), I'll just present a quick way to generate some data using SQL*Plus and the dbms_random package.

First, I'll need a table:

create table test_data
  id number,
  first_name varchar2(16),
  last_name varchar2(24),
  amount number(6,2),
  purchase_date date

As you can see, this table is nothing to write home about, but it does mirror what a real-world table would look like in a lot of cases.

When I am creating test data I generally prefer the data to be reproducible. That is, if I execute the process more than once I like to get the same data each time. To accomplish this with the dbms_random package I call the "seed" procedure and provide an initial seed value:

exec dbms_random.seed(42);

Let's say I wanted to populate this table with 10,000 rows of data. I use the following to do this:

insert into test_data
  to_date('01-JAN-2008', 'DD-MON-YYYY') + dbms_random.value(-100,100)
  (select level from dual connect by level <= 10000);

Starting at the bottom of the SQL text is a neat trick using "dual" and "connect by" to essentially create an "unlimited dual" for generating lots of rows. This trick was (I believe) originally put forth by Mikito Harakiri and I discovered it via an Ask Tom thread.

The remainder of the text is fairly straight-forward in its use of the dbms_random package. I use the "string" function to generate a random, lowercase value (which is subsequently passed to "initcap" to capitalize the first character of each string), the "value" function is used to create a random numeric value (which is passed to "round" to make it look like a purchase amount), and then I use a fixed-date to which I add (possibly a negative value) to create a set of valid dates within a range.

The first 10 rows of this data when selected from the table look like:

--- ---------------- ------------------------ ---------- --------------------
  1 Oqq              Mxsazbwyx                    521.33 22-MAR-2008 16:49:40
  2 Jjgqrywtxbdn     Fwwbzshhkbqzb                921.47 04-OCT-2007 09:10:00
  3 Zxflhufls        Mstwydowbaogeyyjiles         172.34 20-MAR-2008 10:22:05
  4 Zjjxtyysitsog    Zxrzqeflxgo                  882.16 26-DEC-2007 18:56:44
  5 Kjmuvbrqx        Hfu                          742.61 16-OCT-2007 14:35:27
  6 Oywaibiyp        Angvlehlmeujfdlhdmtt          664.5 29-FEB-2008 12:50:40
  7 Uhwyvla          Nhbwcv                       168.99 27-DEC-2007 22:29:59
  8 Kpdiqafanbvzt    Phjeqwelyugrmahybocwbhvp     813.81 01-MAR-2008 09:15:59
  9 Tvezuvrgnzzqkpq  Pjyygoqx                     880.09 21-NOV-2007 00:42:07
10 Olchylbeft       Nflaxjqfkmkgt                847.71 07-DEC-2007 16:53:23

10 rows selected.

Can this technique always be used? No, probably not. For example, the names are not exactly what you might call "proper" names. However, I do find that this technique can be useful when I need to create some data to perform some testing with. Perhaps it will be helpful to you too if you experiment with it and find the right combination of values to use in your tests.

Sunday, April 06, 2008

Applications and Data == Importance Envy?

I was recently reading an interesting posting (and comments) over on Doug Burns' Blog:


Doug's blog is a great resource and I encourage you to check it out if you have a few spare minutes.

I found this interesting because the "debate" around which is more important -- the application or the data -- raises its head every now and then. Now, my being a DBA in my "day job" may automatically bias me in this context; however, I was a developer before I was a DBA. This, of course, does not mean I am bias-free, but I do hope that I am not so closed-minded that I may see both sides of a discussion.

When I hear this question (which is more important? the application? or the data? OK, so that is three questions, but you understand.) I immediately think of a question I was posed long ago in school - which is more important, the heart or the blood? Of course there is not a perfect correlation between biology and applications and data, but I think that is not an unreasonable analogy. While not exactly easy, it is possible to find other mechanisms to transport the blood, but it is not so easy to replace the blood itself.

Clearly both applications and data are important. However, rather than necessarily declare one to be more important than the other, my inclination is to view them in terms of life span. Often I hear something along the lines of: An application is used to access the data. To me this implies that there may be more than one application or a series of applications over time whereas the data itself is not really viewed as having multiple incarnations. If I may borrow a quote from The Kinks: "Rocks bands will come and rock bands will go, but Rock 'n Roll will go on forever!" (From a live performance of "All Day and All of the Night" as I recall).