[ale] How to debug a program that just goes away

Jim Lynch ale_nospam at fayettedigital.com
Sat Feb 27 06:49:32 EST 2010


David Tomaschik wrote:
> Jim Lynch wrote:
>   
>> I have a multi-threaded c++ program that occasionally just stops 
>> running.  At the time it stops it is usually not doing anything.  Every 
>> thread is either waiting on a semaphore or sleeping (Thread::sleep).  
>> It's event driven and no events have arrived for some time.  I have lots 
>> of prints to be able to tell where it is and what it's doing.  No core 
>> file generated.  No strange messages in any log file, either system or 
>> application.  No rogue processes killing it off. 
>>
>> The program runs successfully on multiple other machines but not this 
>> one.  It's a newer system than the others.  I recompiled on this system, 
>> thinking it may help but no.  Access to this system is limited to two 
>> people, myself and one other.  I trust him since he's got more to lose 
>> than I do if it doesn't work.  I can work around it with a wrapper, 
>> restarting when it fails, but I'd really like to understand how it's 
>> happening. 
>>
>> I have ulimit -c 50000 in the script that runs it, so a core will be 
>> generated if it aborts.  I trap SIGHUP, SIGINT, SIGCHLD and SIGQUIT and 
>> will see something in the log file if a signal is trapped.  It's on a 
>> Centos 4.7 system.  Same OS as the other running systems.  The only 
>> difference is that this is a newer dual core system.  Considerably 
>> faster also.  I've run both a conventional kernel and an openvz kernel.  
>> I'm compiling with " -g -O2" flags.
>>
>> I have no idea how to proceed from here.  Can anyone suggest something I 
>> could do to find out what's the cause?
>>
>> Thanks,
>> Jim.
>>   
>>     
> Have you considered running the app through gdb?  Multi-threaded apps
> are a bit more difficult to debug than gdb, but not impossibly so.
>
> David
>
>   
I have, but I didn't learn anything.  When it dies nothing is available 
to inspect.  While it's running everything seems to be OK.  There are 7 
permanent threads  with one or two transient threads run and destroyed. 

This program runs fine for many days on other hardware.  It's just these 
two systems that gives it grief.  I run it locally on a dual core system 
and it works fine here.  They have two identical systems and it fails on 
both.  I've run both the standard Centos kernel and the openvz kernel.  
Nothing seems to affect it.

Jim.

Jim.


More information about the Ale mailing list