DataGen: Generate Large Test Data Files - Like A Boss

A couple of months ago I was doing some volume and performance testing against an application that was expecting a 500% data growth, which meant I had to generate lots and lots of dummy data to test whether the storage would hold up and whether the application itself would still perform well.

I quickly came up with a script that loops through N times, generates dummy data, and creates an XML file. I left the script running while working on something else in parallel, and at the end the script finished after about a couple of hours.

There were 3 issues with this approach: 1) It sure wasn’t going to be the last time I had to generate large test data. The next time I had to do something similar, I much preferred a simpler solution than scripting. 2) A couple of hours were too long. I wanted a better solution to cut down data generation time. 3) The script ran on an i7 with multiple cores, but only 1 core was being utilised. Plain wrong.

That’s why I wrote DataGen.

Use npm to install:

npm install -g datagen

Ease of use

You don’t need to know any scripting language, you only need to create templates to construct your test data in this structure:

header segment 1 segment 2 ... segment N (number of segments) footer

Start by creating example header, segment, and footer templates:

datagen init

Example header:

Example segment:

{gen_id}-{worker_id}-{segment_id} {first_name()} {last_name()} {date('dd-mm-yyyy')}

Example footer:

The above templates can be used to generate an XML like this:

1-1-1 Niels Bryant 12-08-1992 1-1-2 John Bohr 01-11-1970

If you set segment flag (-s/–num-segments) to 10 million in the above example, then DataGen will generate 10 million segments as the XML body containing sequential IDs, random names, and random date of birth.

Check out DataGen README for a list of available template parameters and functions which can be used to generate sequential number, random number, random name, random date, random word, random email, and random phone number.

Better performance

To reduce the test data generation time and to utilise those spare CPU cores, DataGen allows an easy way to spawn multiple processes to generate multiple test data files at the same time by specifying how many workers (-w/–num-workers) DataGen should spawn. Each worker runs on its own process. Data value generation is CPU bound, while file streaming is IO bound.

As an example, I tested generating 10 million segments where each segment contained all template functions available in DataGen (random number, random dates, all of them, to simulate the most resource intensive processing), using just one worker. It generated a single test data file, 13Gb in size. Here’s how long it took to finish:

$ time datagen gen -s 10000000 -w 1 real 47m17.610s user 40m26.612s sys 9m13.167s

Now compare that with the result of generating 10 million segments over 10 workers, 1 million segments each. This generated 10 test data files, 1.3Gb each.

$ time datagen gen -s 1000000 -w 10 real 10m51.262s user 61m24.546s sys 16m38.534s

That’s 36 minutes and 26.348 seconds faster, roughly a 76% improvement.

The above result came from running DataGen on a quad core machine, so 10 processes might involve too many CPU context switches. I’m assuming that the best performance would be to have the number of workers equal to or slightly more than the number of cores. I need to run more tests to verify this.

I’ve since used DataGen to generate more test data files. It is simple and easy to use, it puts those spare cores to work. Give DataGen a try and let me know what you think.

Share Comments
comments powered by Disqus