[ale] How to debug a program that just goes away

Jim Lynch ale_nospam at fayettedigital.com
Mon Mar 1 06:00:21 EST 2010


David Ritchie wrote:
>>>>> I have a multi-threaded c++ program that occasionally just stops
>>>>> running.  At the time it stops it is usually not doing anything.  Every
>>>>> thread is either waiting on a semaphore or sleeping (Thread::sleep).
>>>>> It's event driven and no events have arrived for some time.  I have lots
>>>>> of prints to be able to tell where it is and what it's doing.  No core
>>>>> file generated.  No strange messages in any log file, either system or
>>>>> application.  No rogue processes killing it off.
>>>>>
>>>>>           
>
> Have you thought about running the sar data collector on a 5 minute
> interval, and run ' date >>log; ps -ef | grep process name >>log'
> every minute or so? If you do that, you would have some idea
> when the process is dying, its size (dependent on ps options you
> pass), and overall system memory usage. This might give you an idea if
> it is the OOM module causing the problem. Does this get
> better is the machine has more memory? Also, are you catching all
> signals in the application
> sp that you can log them as they occur?
>
> Just a few thoughts...
>
>   
Thanks all for the suggestions.  This weekend I tweaked on the code, 
changing sleeps, doing threads differently, using semaphores more 
intelligently and the result is that it no longer dies without comment, 
however it does just come to a screeching halt.  I can bring it off with 
a killall -SEGV which gives me a core file.  Inspecting that core tells 
me there are 8 threads.  The first one is waiting on a join, which is 
normal. Four of them are waiting on a semaphore, which is also where 
they spend most of their lives.  Three of them are in nano_sleep.  None 
of those threads are supposed to sleep for more than 10 seconds.  But 
they never wake up.  I'm using the Thread::sleep() static method from 
the gnu commonc++ library which is supposed to be thread safe.  I'm 
about to take out all the sleeps except one and convert the others to 
waiting on a semaphore which is triggered by a single thread. 

I sometimes really hate computers.  ;)

Thanks,
Jim.


More information about the Ale mailing list