[ale] [Sorta-OT] Some interesting stats, and a lesson...

Michael B. Trausch mike at trausch.us
Thu Oct 14 14:00:35 EDT 2010


So today I needed to generate a list of invalid SSNs for the purpose of
creating a testing database with some data in it that includes SSN data.
The government is, after 2011, increasing the range of allowed SSNs,
such that they shall follow the following invariants:

  * Numbers starting with 000, 666, and 900-999 are invalid.
  * Numbers with only zeroes in any segment are invalid.

So, I thought I would use the shell to generate this list.  After all,
it's just three loops and an printf command, should be pretty fast,
right?

Nope.  After waiting 45 minutes or so, I decided to write a C program to
do it.  Twice.  Because I wanted to see the difference between the size
of the data file in an unpacked binary format vs. a formatted ASCII
format.  It wasn't finished when I got the results from the C program,
so I killed it.

It took 14.963 seconds wall-clock time for a C program to generate a
binary file with every invalid SSN in it, and that file is 555,349,000
bytes long (each "record" is fixed-length, a uint16_t, a uint8_t, and a
uint16_t).  Of course, if I were going to use this in a real
application, I would probably pack the values better and eliminate a lot
of the redundant data, ensuring that the file were ordered, indexed, and
so forth, so it could potentially be a lot smaller.  Note that this does
not include _every_ potential invalid 

It took 50.945 seconds (again, wall-clock time) for a C program to
generate a formatted ASCII file with every such number in it, and that
file is 1,332,837,600 bytes long.  Again, wildly inefficient, it's just
a list of "%03d-%02d-%04d\n" entries.

The lesson, of course, is to not use the shell to do a job that would be
much better suited to a simple C99 program...

Honestly, I would not have expected this result.  Even though I know
that it takes the shell 7 children to do what I asked it to do, I
figured that I'd be able to generate the sample dataset in a reasonable
amount of time.  I guess I was wrong.  And the shell command wasn't even
generating as comprehensive a list.  It was omitting the invalid numbers
that could be generated due to the second rule listed above.

	--- Mike



More information about the Ale mailing list