I released DataGen v0.0.9 during lunch break yesterday. This version includes the support to limit how many workers can run concurrently, which is something that I’ve always wanted to add since day one. I finally got the time to do it last weekend, and it turned out to be an easy task thanks to Rod Vagg‘s worker-farm module.
Why is this necessary?
The problem with previous versions of DataGen was that when you want to generate 20 data files, then 20 worker processes will be created and run concurrently. It’s obviously not a great idea to have 20 processes fighting over 2 CPUs.
With v0.0.9, you can specify this limit using the new -m/–max-concurrent-workers flag: (if unspecified, it will default to the number of CPUs)
datagen gen -w 20 -m 2
When I first wrote about DataGen last year, I mentioned that I still needed to run some tests to verify my assumption about the optimal number of workers. So here it is one year later…
The first test is on a Linux box with 8 cores, where each data file contains 500,000 segments, each segment contains a segment ID, 6 strings, and 3 dates.
The second test is on an OSX box with 2 cores, where each data file contains 500,000 segments, but this time each segment only contains a segment ID.
As you can see, the performance is almost always best when the concurrent running worker processes are limited to the number of available CPUs (8 max concurrent workers on the first chart, and 2 on the second chart).
When you specify 20 workers and your laptop only has 2 CPUs, only 2 workers will generate the data file concurrently at any time, and you can be sure that it will be faster than having 20 workers generating 20 data files at the same time. And that’s why DataGen’s default setting allows as many concurrent workers as the available CPUs.