[ale] Dealing with really big log files....

Tue Mar 24 16:12:14 EDT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mar 23, 2009, at 2:41 PM, Michael B. Trausch wrote:
> Possibly.  Depends on how heavily loaded the system is from an I/O
> standpoint---you've got the advantage of readahead caching if you're
> scanning sequentially, and that advantage doesn't seem like much until
> the system is really heavily bogged down.  I wasn't making the
> assumption that this work was being done on a lightly-loaded desktop
> machine.  A very heavily loaded machine would really suck to be  
> doing a
> binary search on, since you'd likely be I/O bound at every iteration.
>
> Anyway, tomato/tomahto.  They're both valid approaches, but the right
> one would (as is always the case) depend on more variables than were
> ever discussed in the first place.

In this particular case, the machine was a 64bit Debian with 8 gigs of  
RAM and a fairly light load, it had 2 slaves, so the only real  
operations it was performing were updates and inserts and replication,  
the majority of the workload were on the two slaves, which serve about  
10 web frontends.

I knew roughly where the data I needed was at, it was at the end of  
the log file, about 4 billion lines in length, and I needed about the  
57 million lines at the end. It wasn't an extrememly time sensitive  
issue, I just needed to be able to pare the log file down enough to  
the point where I could press it with my normal selection of grep and  
awk to extract the connections, and the UPDATES and INSERTS that went  
along with them, and pass it along to the customer. Once I knew what  
line the data I needed started on and the total number of lines in the  
file, tail -n was enough to get me the rough cut, brought the log file  
down to 2.4 gigs in size, and from there, I was able to easily cut it  
down to 1.6 gigs, extract the information the customer wanted, and  
pass that along to them along with the raw log file for the time  
period they were interested in case they wanted to parse it  
themselves. All in all, the log file size growing as much as it did  
added about 7 hours of latency to being able to fill the customers  
request. The underlying problem is that mysql was no longer rotating  
it's  log files properly, which was directly responsible for the added  
latency. Made damn sure that particular problem was fixed.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (Darwin)

iEYEARECAAYFAknJPqMACgkQXzanDlV0VY6WOACeOdlciE9d/5vsAM3EpIfmNWU9
xiMAoOWRDB0emYlCwFJERKNHU+5lAR24
=Fs5g
-----END PGP SIGNATURE-----