[ale] How to debug a program that just goes away

Doug McNash dmcnash at charter.net
Sun Feb 28 15:49:31 EST 2010


---- Jim Lynch <ale_nospam at fayettedigital.com> wrote: 
> David Tomaschik wrote:
> > Jim Lynch wrote:
> >   
> >> I have a multi-threaded c++ program that occasionally just stops 
> >> running.  At the time it stops it is usually not doing anything.  Every 
> >> thread is either waiting on a semaphore or sleeping (Thread::sleep).  
> >> It's event driven and no events have arrived for some time.  I have lots 
> >> of prints to be able to tell where it is and what it's doing.  No core 
> >> file generated.  No strange messages in any log file, either system or 
> >> application.  No rogue processes killing it off. 
> >>
> >> The program runs successfully on multiple other machines but not this 
> >> one.  It's a newer system than the others.  I recompiled on this system, 
> >> thinking it may help but no.  Access to this system is limited to two 
> >> people, myself and one other.  I trust him since he's got more to lose 
> >> than I do if it doesn't work.  I can work around it with a wrapper, 
> >> restarting when it fails, but I'd really like to understand how it's 
> >> happening. 
> >>
> >> I have ulimit -c 50000 in the script that runs it, so a core will be 
> >> generated if it aborts.  I trap SIGHUP, SIGINT, SIGCHLD and SIGQUIT and 
> >> will see something in the log file if a signal is trapped.  It's on a 
> >> Centos 4.7 system.  Same OS as the other running systems.  The only 
> >> difference is that this is a newer dual core system.  Considerably 
> >> faster also.  I've run both a conventional kernel and an openvz kernel.  
> >> I'm compiling with " -g -O2" flags.
> >>
> >> I have no idea how to proceed from here.  Can anyone suggest something I 
> >> could do to find out what's the cause?
> >>
> >> Thanks,
> >> Jim.
> >>   
> >>     
> > Have you considered running the app through gdb?  Multi-threaded apps
> > are a bit more difficult to debug than gdb, but not impossibly so.
> >
> > David
> >
> >   
> I have, but I didn't learn anything.  When it dies nothing is available 
> to inspect.  While it's running everything seems to be OK.  There are 7 
> permanent threads  with one or two transient threads run and destroyed. 
> 
> This program runs fine for many days on other hardware.  It's just these 
> two systems that gives it grief.  I run it locally on a dual core system 
> and it works fine here.  They have two identical systems and it fails on 
> both.  I've run both the standard Centos kernel and the openvz kernel.  
> Nothing seems to affect it.
> 
> Jim.
> 

Since it's not leaving a core and all signals are caught, the only mechanism I know left is the OOM killer(out of memory).  If as you say it is a large program sitting idle, that would tend to raise it its /proc/<pid>/oom_score.  You can lower it's oom_score with echo -17 > /proc/<pid>/oom_adj where it will never be killed. If you do that either you will get a panic or some other process will have to give up it's life.

see:
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/1.0/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Swapping_and_Out_Of_Memory_Tips.html

--
doug mcnash



More information about the Ale mailing list