8 Jul 2012, 12:17pm
Projects:
by

2 comments

  • DataGen: Generate Large Test Data Files – Like A Boss

    A couple of months ago I was doing some volume and performance testing against an application that was expecting a 500% data growth, which meant I had to generate lots and lots of dummy data to test whether the storage would hold up and whether the application itself would still perform well.

    I quickly came up with a script that loops through N times, generates dummy data, and creates an XML file. I left the script running while working on something else in parallel, and at the end the script finished after about a couple of hours.

    There were 3 issues with this approach: 1) It sure wasn’t going to be the last time I had to generate large test data. The next time I had to do something similar, I much preferred a simpler solution than scripting. 2) A couple of hours were too long. I wanted a better solution to cut down data generation time. 3) The script ran on an i7 with multiple cores, but only 1 core was being utilised. Plain wrong.

    That’s why I wrote DataGen.

    Use npm to install:

    npm install -g datagen

    Ease of use

    You don’t need to know any scripting language, you only need to create templates to construct your test data in this structure:

    header
    segment 1
    segment 2
    ...
    segment N (number of segments)
    footer
    

    Start by creating example header, segment, and footer templates:

    datagen init

    Example header:

    <?xml version="1.0" encoding="UTF-8"?>
    <data>
    

    Example segment:

    <segment>
      <id>{gen_id}-{worker_id}-{segment_id}</id>
      <name>{first_name()} {last_name()}</name>
      <dob>{date('dd-mm-yyyy')}</dob>
    </segment>
    

    Example footer:

    </data>
    

    The above templates can be used to generate an XML like this:

    <?xml version="1.0" encoding="UTF-8"?>
    <data>
    <segment>
      <id>1-1-1</id>
      <name>Niels Bryant</name>
      <dob>12-08-1992</dob>
    </segment>
    <segment>
      <id>1-1-2</id>
      <name>John Bohr</name>
      <dob>01-11-1970</dob>
    </segment>
    </data>
    

    If you set segment flag (-s/–num-segments) to 10 million in the above example, then DataGen will generate 10 million segments as the XML body containing sequential IDs, random names, and random date of birth.

    Check out DataGen README for a list of available template parameters and functions which can be used to generate sequential number, random number, random name, random date, random word, random email, and random phone number.

    Better performance

    To reduce the test data generation time and to utilise those spare CPU cores, DataGen allows an easy way to spawn multiple processes to generate multiple test data files at the same time by specifying how many workers (-w/–num-workers) DataGen should spawn. Each worker runs on its own process. Data value generation is CPU bound, while file streaming is IO bound.

    As an example, I tested generating 10 million segments where each segment contained all template functions available in DataGen (random number, random dates, all of them, to simulate the most resource intensive processing), using just one worker. It generated a single test data file, 13Gb in size. Here’s how long it took to finish:

    $ time datagen gen -s 10000000 -w 1
    
    real 47m17.610s
    user 40m26.612s
    sys 9m13.167s
    

    Now compare that with the result of generating 10 million segments over 10 workers, 1 million segments each. This generated 10 test data files, 1.3Gb each.

    $ time datagen gen -s 1000000 -w 10
    
    real 10m51.262s
    user 61m24.546s
    sys 16m38.534s
    

    That’s 36 minutes and 26.348 seconds faster, roughly a 76% improvement.

    The above result came from running DataGen on a quad core machine, so 10 processes might involve too many CPU context switches. I’m assuming that the best performance would be to have the number of workers equal to or slightly more than the number of cores. I need to run more tests to verify this.

    I’ve since used DataGen to generate more test data files. It is simple and easy to use, it puts those spare cores to work. Give DataGen a try and let me know what you think.

    [...] DataGen: Generate Large Test Data Files – Like A Boss « Blog [...]

    [...] DataGen (GitHub: cliffano / datagen, License: MIT, npm: datagen) by Cliffano Subagio is a multi-process test data file generator. It can be used to generate files in various formats, including CSV and JSON, based on template files that describe the output. Random numbers, dates, and strings can be generated. [...]

     

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

     

    Recent Posts

    Recent Comments

    • Ddroberliga.De: If you are having a baby, Aetna Family Health Insurance Company has interactive tools to help you...
    • Alicia: Hello Emma, Hello…, I read your message and wanted to help you, because I have been in Siem Reap a lot of...
    • Alphonse Gallegoz: Psy likened the Gangnam District to Beverly Hills, California, and said in an interview that he...
    • Hans: Your snippets have inspired me to go out and buy a copy. I hope it lives up to the quality quotes you’ve...
    • abu: bakmi gm is the most famous restaurant in indonesia especialy noodles..nice and you must try..

    Most Commented Posts

    Linkroll