[ale] How to debug a program that just goes away
Jim Lynch
ale_nospam at fayettedigital.com
Fri Feb 26 07:23:07 EST 2010
I have a multi-threaded c++ program that occasionally just stops
running. At the time it stops it is usually not doing anything. Every
thread is either waiting on a semaphore or sleeping (Thread::sleep).
It's event driven and no events have arrived for some time. I have lots
of prints to be able to tell where it is and what it's doing. No core
file generated. No strange messages in any log file, either system or
application. No rogue processes killing it off.
The program runs successfully on multiple other machines but not this
one. It's a newer system than the others. I recompiled on this system,
thinking it may help but no. Access to this system is limited to two
people, myself and one other. I trust him since he's got more to lose
than I do if it doesn't work. I can work around it with a wrapper,
restarting when it fails, but I'd really like to understand how it's
happening.
I have ulimit -c 50000 in the script that runs it, so a core will be
generated if it aborts. I trap SIGHUP, SIGINT, SIGCHLD and SIGQUIT and
will see something in the log file if a signal is trapped. It's on a
Centos 4.7 system. Same OS as the other running systems. The only
difference is that this is a newer dual core system. Considerably
faster also. I've run both a conventional kernel and an openvz kernel.
I'm compiling with " -g -O2" flags.
I have no idea how to proceed from here. Can anyone suggest something I
could do to find out what's the cause?
Thanks,
Jim.
More information about the Ale
mailing list