DataGen Workers Optimisation

I released DataGen v0.0.9 during lunch break yesterday. This version includes the support to limit how many workers can run concurrently, which is something that I’ve always wanted to add since day one. I finally got the time to do it last weekend, and it turned out to be an easy task thanks to Rod Vagg‘s worker-farm module.

Why is this necessary?

The problem with previous versions of DataGen was that when you want to generate 20 data files, then 20 worker processes will be created and run concurrently. It’s obviously not a great idea to have 20 processes fighting over 2 CPUs.

With v0.0.9, you can specify this limit using the new -m/–max-concurrent-workers flag: (if unspecified, it will default to the number of CPUs)

datagen gen -w 20 -m 2

When I first wrote about DataGen last year, I mentioned that I still needed to run some tests to verify my assumption about the optimal number of workers. So here it is one year later…

The first test is on a Linux box with 8 cores, where each data file contains 500,000 segments, each segment contains a segment ID, 6 strings, and 3 dates.

The second test is on an OSX box with 2 cores, where each data file contains 500,000 segments, but this time each segment only contains a segment ID.

As you can see, the performance is almost always best when the concurrent running worker processes are  limited to the number of available CPUs (8 max concurrent workers on the first chart, and 2 on the second chart).

When you specify 20 workers and your laptop only has 2 CPUs, only 2 workers will generate the data file concurrently at any time, and you can be sure that it will be faster than having 20 workers generating 20 data files at the same time. And that’s why DataGen’s default setting allows as many concurrent workers as the available CPUs.

Jenkins Build Status On Ninja Blocks RGB LED

Nestor v0.1.2 is out and one of its new features is nestor ninja for monitoring Jenkins and displaying the latest build status on Ninja Blocks RGB LED device (if you have a block, it’s the ninja’s eyes).

Here’s a usage example:
export JENKINS_URL=<url>
export NINJABLOCKS_TOKEN=<token_from_>
nestor ninja

Red for build failure, green for build success, yellow for build warning, and white for unknown status. The yellow light looks quite similar to green, and the white one does look blue-ish.

And the best place to run nestor ninja? On the block itself of course!

ssh ubuntu@ninjablock.local
apt-get install upstart
npm install -g nestor
cat /usr/lib/node_modules/nestor/conf/ninja_upstart.conf > /etc/init/nestor_ninja.conf
vi /etc/init/nestor_ninja.conf # and change JENKINS_URL and NINJABLOCKS_TOKEN values
shutdown -r now

Log messages will then be written to /var/log/nestor_ninja.log

Sapi, A Node.js Client For Sensis API

We had a hack day at Sensis a couple of days ago where I ended up writing Sapi, a Node.js client for Sensis API. This module is now available on NPM.

The latest version (v0.0.5) was tested against Sensis API open beta version ob-20110511. I think I’ve got all of its documented features covered. Let me know if I missed anything.

Sapi module provides a chainable interface to construct the endpoint parameters, here’s an example:

  .search(function (err, result) {

The above snippet performs a search for restaurants in Melbourne. It uses Sensis API search endpoint with query and location parameters.

If you want to add more parameters, e.g. for postcode and state filtering, just keep chaining them like this:

  .search(function (err, result) {

Simple enough? Check out Sapi README for installation and further usage guide.

So, if you would like to write an application which requires Australian business listings information, check out Sensis API and apply for a key. And if you’re a Node.js coder, give Sapi a try.

Note: I know there’s also node-sapi module published to NPM, but it seems to be just a placeholder as the source code currently only implements API key checking.

DataGen: Generate Large Test Data Files – Like A Boss

A couple of months ago I was doing some volume and performance testing against an application that was expecting a 500% data growth, which meant I had to generate lots and lots of dummy data to test whether the storage would hold up and whether the application itself would still perform well.

I quickly came up with a script that loops through N times, generates dummy data, and creates an XML file. I left the script running while working on something else in parallel, and at the end the script finished after about a couple of hours.

There were 3 issues with this approach: 1) It sure wasn’t going to be the last time I had to generate large test data. The next time I had to do something similar, I much preferred a simpler solution than scripting. 2) A couple of hours were too long. I wanted a better solution to cut down data generation time. 3) The script ran on an i7 with multiple cores, but only 1 core was being utilised. Plain wrong.

That’s why I wrote DataGen.

Use npm to install:

npm install -g datagen

Ease of use

You don’t need to know any scripting language, you only need to create templates to construct your test data in this structure:

segment 1
segment 2
segment N (number of segments)

Start by creating example header, segment, and footer templates:

datagen init

Example header:

<?xml version="1.0" encoding="UTF-8"?>

Example segment:

  <name>{first_name()} {last_name()}</name>

Example footer:


The above templates can be used to generate an XML like this:

<?xml version="1.0" encoding="UTF-8"?>
  <name>Niels Bryant</name>
  <name>John Bohr</name>

If you set segment flag (-s/–num-segments) to 10 million in the above example, then DataGen will generate 10 million segments as the XML body containing sequential IDs, random names, and random date of birth.

Check out DataGen README for a list of available template parameters and functions which can be used to generate sequential number, random number, random name, random date, random word, random email, and random phone number.

Better performance

To reduce the test data generation time and to utilise those spare CPU cores, DataGen allows an easy way to spawn multiple processes to generate multiple test data files at the same time by specifying how many workers (-w/–num-workers) DataGen should spawn. Each worker runs on its own process. Data value generation is CPU bound, while file streaming is IO bound.

As an example, I tested generating 10 million segments where each segment contained all template functions available in DataGen (random number, random dates, all of them, to simulate the most resource intensive processing), using just one worker. It generated a single test data file, 13Gb in size. Here’s how long it took to finish:

$ time datagen gen -s 10000000 -w 1

real 47m17.610s
user 40m26.612s
sys 9m13.167s

Now compare that with the result of generating 10 million segments over 10 workers, 1 million segments each. This generated 10 test data files, 1.3Gb each.

$ time datagen gen -s 1000000 -w 10

real 10m51.262s
user 61m24.546s
sys 16m38.534s

That’s 36 minutes and 26.348 seconds faster, roughly a 76% improvement.

The above result came from running DataGen on a quad core machine, so 10 processes might involve too many CPU context switches. I’m assuming that the best performance would be to have the number of workers equal to or slightly more than the number of cores. I need to run more tests to verify this.

I’ve since used DataGen to generate more test data files. It is simple and easy to use, it puts those spare cores to work. Give DataGen a try and let me know what you think.

Publishing Node.js Module To Ivy Repository

Let me guess what you’re going to say in 3… 2… 1…

WTF??? Why would anyone want to do that?


Some of us don’t have the luxury of a local NPM repository, while some others have their delivery pipeline tightly integrated to an Ivy repository.

So, for those few who are stuck with the unholy union of Node.js and Apache Ivy, you can publish your Node.js module to an Ivy repository using Bob. Here’s how:

Update (23/08/2012): The instruction below is for Bob v0.4.x or older. If you’re using Bob v0.5.0, please scroll further down for the updated instruction.

1. Create an ivy.xml file template in your Node.js module project directory

<?xml version="1.0" encoding="ISO-8859-1"?>
<ivy-module version="2.0" xmlns:xsi="" xsi:noNamespaceSchemaLocation="">
    <info organisation="" module="modulename" status="integration" revision="${version}" publication="${now('yyyymmddHHMMss')}"/>
        <conf name="default" visibility="public" description="..." extends="runtime,master"/>
        <artifact name="modulename" type="tar.gz" conf="default"/>

2. Create a .bob.json file in your project directory, specifying ivy.xml location and details of the Ivy repository server

  "packagemeta": {
    "dir": "path/to",
    "file": "ivy.xml"
  "template": [
  "deploy": {
    "user": "username",
    "key": "path/to/keyfile",
    "host": "hostname",
    "port": portnumber,
    "dir": "/path/to/ivy/repo/${name}/${version}"

3. Run Bob

bob template package package-meta ssh-mkdir deploy (yup, this could’ve been simpler, on my TODO list)

This will create /path/to/ivy/repo/modulename/version/ directory with the following files:

  • modulename.tar.gz
  • modulename.tar.gz.md5
  • modulename.tar.gz.sha1
  • ivy.xml
  • ivy.xml.md5
  • ivy.xml.sha1

This module can then be referenced as, and used just like any other artifact using Ivy.

If the repository is accessible via HTTP, then you can also specify the Ivy artifact as a dependency of another Node.js module in its package.json file:

  "dependencies": {
    "modulename": "http://ivyserver/com/company/modulename/version/modulename-version.tar.gz",

Update (23/08/2012): The instruction below is for Bob v0.5.x or newer.

1. Create an ivy.xml file template in your Node.js module project’s root directory
If you’re upgrading from Bob v0.4.x, all you need to do is remove the $ from the parameter syntax.

<?xml version="1.0" encoding="ISO-8859-1"?>
<ivy-module version="2.0" xmlns:xsi="" xsi:noNamespaceSchemaLocation="">
    <info organisation="" module="modulename" status="integration" revision="{version}" publication="{now('yyyymmddHHMMss')}"/>
        <conf name="default" visibility="public" description="..." extends="runtime,master"/>
        <artifact name="modulename" type="tar.gz" conf="default"/>

2. Create a .bob.json file in your project directory, specifying ivy.xml location and details of the Ivy repository server
If you’re upgrading from Bob v0.4.x, you need to move ivy.xml to the project’s root directory, and modify .bob.json by removing packagemeta, change template structure, renaming deploy to publish, adding publish.type: ivy .

  "template": [".bob/artifact/ivy.xml"],
  "publish": {
    "type": "ivy",
    "user": "username",
    "key": "path/to/keyfile",
    "host": "hostname",
    "port": portnumber,
    "dir": "/path/to/ivy/repo/${name}/${version}"

3. Run Bob
If you’re upgrading from Bob v0.4.x, simply replace template package package-meta ssh-mkdir deploy targets, with package publish

bob package publish

This will create /path/to/ivy/repo/modulename/version/ directory with the following files:

  • modulename.tar.gz
  • modulename.tar.gz.md5
  • modulename.tar.gz.sha1
  • ivy.xml
  • ivy.xml.md5
  • ivy.xml.sha1

This module can then be referenced as, and used just like any other artifact using Ivy.

If the repository is accessible via HTTP, then you can also specify the Ivy artifact as a dependency of another Node.js module in its package.json file:

  "dependencies": {
    "modulename": "http://ivyserver/com/company/modulename/version/modulename-version.tar.gz",

Despite my dislike towards XML configuration files, Ivy has worked just fine all these years and I’ve been using it to store various types of artifacts. Even though its main popularity is within the Java community, you can store pretty much anything there (YMMV).

Bob itself is still at an early stage, there are lots of things I want to improve. I just need that mythical spare time :).

Australia According To NodeUp

I’m a fan of NodeUp, a podcast of all things Node.js-related, and a great source of thoughts/opinions from the who’s who in Node.js community.

Putting the serious stuff aside, the show has a running joke where the hosts put on their best effort to prop up Bislr, one of the show’s sponsors, by saying hilarious things about Australia. And it actually worked, us Australians (at least myself and those I know) love it, and I sure won’t forget the name Bislr for at least the next couple of years.

Here are some interesting facts about Australia… (Note: some of these are actually true)

Ep 37:

  • They’ve imported a bunch of kangaroos and maybe a wallaby or two.
  • They’ve bought these special toilets so that they free bidet with your thing, for washing your backside.
  • Australian toilets are like Portuguese toilets.

Ep 36:

  • They brought kangaroos, and boomerangs.
  • They’re issuing boomerangs to all new hires.
  • They smuggled the boomerangs in the kangaroo pouches, in the kangaroos. And probably some beer.

Ep 35:

  • Great Aussies, jumpin’ around on backs of kangaroo.
  • The Australian office is great and we love talking about Australia…
  • I’m sure that they brought at least one kangaroo with them. So there’s probably a kangaroo,  just hopin’ around the office and shitin’ everywhere.
  • You can hang out with that kangaroo in San Francisco or jump around with them in the outback in Australia.

Ep 34:

  • See what it likes to hang out with kangaroos.
  • Literally, quite literally, all baristas in Europe are from Australia.
  • With a couple sneaking in from New Zealand, but I think by way of Australia.
  • If you didn’t want to hang out with kangaroos, I don’t know why…


  • They have imported their kangaroos all the way back from Australia to San Francisco.

Ep 16:

  • It is sunny as always.
  • It’s in the southern hemisphere, the water spirals in a different direction in your toilet, it spirals upwards rather than downwards.
  • It’s f***** messy, it really is, you got to bring a hose.
  • Kangaroos, best transport ever.
  • It’s warm ever.
  • Boomerang, it’s pretty much how you grab things at the bar, like beer.
  • You throw boomerang, it comes back, it brings things with it.
  • Boomerang only does one point of damage, you don’t want to use it on any kind of large enemies, plus the larger enemies have shields, so it bounces right off.

Ep 15:

  • More interesting, always sunny, always beer.
  • Fairly laid back working environment.
  • They also have marshmallows.
  • Kangaroos, 1980 fashion.
  • The kangaroos bring the marshmallows for your beer while you’re at the beach.
  • It’s a small island. Its own continent by some definition.
  • Inhabitable, spiders, and snakes. Large poisonous things.

Ep 14:

  • It’s sunny, in the 80s, and there is infinite beer.

Ep 13:

  • It’s always sunny.
  • The 1980s are still going strong after 30 odd years.
  • You can ride a kangaroo to work.

Ep 12:

  • Not Austria, that’s the one with Hitler.
  • The other one, the south one, the good one, the kangaroo.
  • Hang out at the beach, ride around in kangaroos, drink beer out of garbage cans.

Ep 11:

  • We have learned about Australia from Looney Tunes.
  • You order a fish, they put a shark on the table, and a giant bucket of beer.
  • They do have buckets of bad beer at all the bars.
  • Getting a visa to Australia is incredibly easy.

Ep 10:

  • It’s just daylight all day, all day long, and all night, and it never stops.
  • It’s really warm, and there is kangaroos.

Ep 9:

  • It’s sunny, and they play table tennis down there, and drink beer.

Ep 8:

  • M****f***** Australia, kangaroos!
  • In Australia, it’s still in the 80s. So… neon.
  • The movies are still a little more awesome.
  • Sunny, warm there, good food actually.
  • Be in Australia, it’s a positive thing.
  • You can just go to Australia, it’s populated by criminals, they have kangaroos there.
  • It has fewer criminals than the United States.
  • They are commonwealth, still technically part of the theocracy.
  • They don’t like paying taxes to a country that they have nothing to do with, they are very upset about that.
  • There, they really hate the English.
  • There is a pretty active node community down there.
  • They speak English, which is not the norm for other countries.

Ep 7:

  • It’s apparently very nice there, and usually it’s sunny, in the 80s.
  • Beer flowing and table tennis in the afternoon.
  • They are pretty cool people.
  • It’s really an awesome place, specially if you like surfing or good weather.

Ep 6:

  • Beer kinda sucks in Australia, but they have kangaroo.
  • They bring you the huge giant trash can, as seen on the commercial.
  • They don’t drink Foster’s.
  • You can’t go wrong with the kangaroo.
  • They import beer from other places too, they do have Budweiser too in Australia.

Ep 4:

  • Australia is awesome.
  • They have kangaroos there, actual kangaroos.

TODO: episode 5

NOTE: No sponsor(s) on Ep 1-3.

This post will be updated with future episodes. NodeUp hosts, please keep telling the world about how awesome Australia is! :D

Nestor – A Faster And Simpler CLI For Jenkins

It all started because at one point I was using a rather resource-challenged machine running Windows and an Ubuntu VM at the same time, and Firefox froze every so often, rendering Jenkins BuildMonitor and Jenkins web interface useless most of the time. So I looked for an alternative and gave Jenkins CLI a go.

Like most Java applications, Jenkins built-in CLI also suffers from slow start up time (flame suit: ON) due to core Java libraries loading (Kohsuke later told me on #jenkins that there’s also a handshaking process involved). This led me to try Jenkins Remote Access API with curl, which performed significantly faster than Jenkins CLI.

So that’s great, but I have another issue with the fact that Jenkins CLI’s commands start with “java -jar jenkins-cli.jar …”, that’s a finger twister right there, and lengthy curl + URL obviously doesn’t help.

Enter Nestor, a Jenkins CLI written in Node.js that aims to be a faster and simpler alternative to the existing solutions. The catch? Node.js and npm support on Windows is not there yet, so if you managed to run Nestor on Windows please let me know about it. Nestor has been tested and used daily on OS X and Linux.

Simple setup

Install Nestor using npm install -g nestor

Configure the Jenkins instance you want to use using export JENKINS_URL=http://user:pass@host:port/path

Simple usage

Nestor commands are simple, it’s always nestor <action> <param>

To trigger a build

> nestor build studio-bob
Job was started successfully

To view a job status

> nestor job studio-bob
Status: OK
No xml report files found for checkstyle
Build stability: 3 out of the last 5 builds failed.

To list the executors

> nestor executor
* master
39%	studio-bob

To view the queue

> nestor queue
Queue is empty

To view all jobs status on the dashboard

> nestor dashboard
WARN	blojsom-bloojm
OK	jenkins-buildmonitor
FAIL	studio-ae86
OK	studio-bob

Check out Nestor’s GitHub README page for more commands available.

Hopefully that’s simple enough.

Note: The name Nestor was inspired by Captain Haddock’s butler at Marlinspike Hall, not the Argonaut one.

Node.js Presentations

I gave two Node.js-related talks within the past week.

The first one was titled “From Java To Node.js”, at Shine Technologies‘ developers meeting on August 5th, 2011.

The second one was titled “JavaScript Everywhere From Nose To Tail”, at Melbourne JavaScript usergroup on August 10th, 2011, with Carl Husselbee from Sensis.

Happy with the positive feedback from the audience of both talks, thanks folks, much appreciated!

Update (08/09/2011):

And here’s the video from the second talk…

JavaScript Everywhere – From Nose To Tail from Benjamin Pearson on Vimeo.

Using Node.js To Discover Jenkins On The Network

I’ve just added a new feature to Nestor to discover Jenkins on the network, and as it turned out, it’s pretty simple to do thanks to Node.js Datagram sockets API (hat tip Paul Querna).

Jenkins has a discovery feature as part of its remote access API where it listens on UDP port 33848, and whenever it receives a message, Jenkins will respond with an XML containing the instance’s URL, version number, and slave port information.

So how do you send a UDP message using NodeJS?
Here’s a sample function adapted from Nestor’s lib/service.js:

function sendUdp(message, host, port, cb) {
    var socket = require('dgram').createSocket('udp4'),
        buffer = new Buffer(message);
    socket.on("error", function (err) {
    socket.on("message", function (data) {
        cb(null, data);
    socket.send(buffer, 0, buffer.length, port, host, function (err, message) {
        if (err) {

For Jenkins discovery purpose, send any message to any hostname on port 33848:

sendUdp('Long live Jenkins!', 'localhost', 33848, function () { ... });

and if there’s any Jenkins instance running on localhost, it will respond with an XML like this:



konan cliffano$ nestor discover
Jenkins 1.414 running at http://localhost:8080/