[ale] coding practices

Michael B. Trausch mike at trausch.us
Wed Mar 3 23:45:49 EST 2010


On 03/03/2010 05:31 PM, Jim Kinney wrote:
> Third party also wants the total row count in the data file appended to
> the data file.
>
> This is where I disagree. I would far rather put the additional data in
> the done file and not alter the output in anyway.
>
> Granted, adding the row count is trivial (wc -l filename >> filename)
> and that last line will be nothing like the actual data lines. It does
> make reprocessing the data files more complicated as they have to be
> checked for the presence of the row count on the last line before
> rerunning the import process again.
>
> Other views?

I don't like that idea at all.  If I am writing something to process 
data, I treat the data in question as immutable.  I would expect that 
they should have no problem using a metadata file.

If it absolutely _must_ be in the same file, I would consider using a 
structured storage file format (essentially, a dynamically growing, 
file-oriented pseudo-filesystem) of some sort, or a ZIP file.  I would 
lean towards the former, as opposed to the latter, though, because of 
the lower amount of overhead incurred to read and update the contents of 
a structured storage file.  Of course, neither of those options leave 
you with a single file that is plain-text.

I can think of a few viable options.  You could use DBM files, assuming 
that they permit arbitrarily-sized values for a given key.  You could 
also use XML, if you're not allergic to it.  I'd probably not, myself, 
since most data has to be transformed in order to be embedded properly 
in XML.

You could use something like a compound document format, though I'd say 
that the best one to use in this case would likely be a light-weight, 
home-brew sort of system that doesn't utilize any sort of compression or 
other sorts of overhead-producing things.  The added bonus there is that 
you could use it like a primitive stream or block oriented key/value 
storage system and add additional metadata to the file if needed.

Of course, whether or not you can do any of those things is up to the 
people you're working for.  I would at the _very_ least force the issue 
that the original data should be immutable and that there should be some 
other means of storing the metadata.  The most important thing in doing 
so is that there is a convention for doing it; the technical details as 
to how don't matter as much.

	--- Mike

-- 
Michael B. Trausch                                    ☎ (404) 492-6475


More information about the Ale mailing list