|
Hi,
we have freeradius 2.1.8 running on a couple of servers and are very happy with it. But every few days FR crashes on one of the servers (a random one, not always the same). The load is significant (average 150 requests/s per server, 400/s peak) but sureley not too high. So everything seems to run fine besides the annoying crashes, which alarms people and make the weekly availibility reports look bad (even though FR is restarted automatically, of course). The previous 1.1.8 installation we upgraded 6 months ago from did not have this problem. Anyways, I really want to find out what's going wrong, so I wanted to get core dumps of these crashes. Only that I just don't get them. - radiusd.conf has allow_core_dumps = yes (and FR says "Info: Core dumps are enabled." at startup) - /proc/sys/kernel/core_pattern is set to '/tmp/core.%t.%e.%p', so core dumps can be written to disk (tested with a little programm that forces a segfault) - I put "ulimit -c unlimited" in the startup script. cat /proc/$(pidof freeradius)/limits shows "unlimited" for soft and hard limit of "Max core file size" So what's missing? The only indication of the crash is this line in syslog: > Apr 10 17:57:19 xxxxxxxx kernel: [12268615.000288] freeradius[14846]: segfault at 73818 ip 00007f0cb40e875e sp 00007fff9c6304c0 error 4 in libfreeradius-radius-2.1.8.so[7f0cb40d1000+1f000] (This is debian lenny x86_64, btw.) Any hints? I even thought about running FR as a foreground process or even with gdb, but I wanted to check here first. Regards and thanks in advance, Jakob - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Jakob Hirsch wrote:
> we have freeradius 2.1.8 running on a couple of servers and are very > happy with it. But every few days FR crashes on one of the servers (a > random one, not always the same). The load is significant (average 150 > requests/s per server, 400/s peak) but sureley not too high. So > everything seems to run fine besides the annoying crashes, which alarms > people and make the weekly availibility reports look bad (even though FR > is restarted automatically, of course). The previous 1.1.8 installation > we upgraded 6 months ago from did not have this problem. Hmm... I've run it at 20K pps for *days*.... > Anyways, I really want to find out what's going wrong, so I wanted to > get core dumps of these crashes. Only that I just don't get them. > - radiusd.conf has allow_core_dumps = yes (and FR says "Info: Core dumps > are enabled." at startup) > - /proc/sys/kernel/core_pattern is set to '/tmp/core.%t.%e.%p', so core > dumps can be written to disk (tested with a little programm that forces > a segfault) > - I put "ulimit -c unlimited" in the startup script. > cat /proc/$(pidof freeradius)/limits shows "unlimited" for soft and hard > limit of "Max core file size" Often 'root' can't core dump, and programs that change uid can't core dump. It's hard to know what's going on with the OS. > So what's missing? The only indication of the crash is this line in syslog: > >> Apr 10 17:57:19 xxxxxxxx kernel: [12268615.000288] freeradius[14846]: segfault at 73818 ip 00007f0cb40e875e sp 00007fff9c6304c0 error 4 in libfreeradius-radius-2.1.8.so[7f0cb40d1000+1f000] > > (This is debian lenny x86_64, btw.) > > Any hints? doc/bugs. You'll need symbols to find out what's going on. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Hi all,
I have a similar problem on a machine with CentOS 5 Update 4. The freeradius packages which I use are taken from jdennis repository (http://people.redhat.com/jdennis/freeradius-rhel-centos/) Packages versions are: freeradius2-utils-2.1.8-2.el5 freeradius2-2.1.8-2.el5 freeradius2-mysql-2.1.8-2.el5 freeradius2-perl-2.1.8-2.el5 I configured two different freeradius servers on the same machine. Compared to the default configuration, the one radius has the addition of using sqlippool and the second radius calls an external perl script which in turns connects via ssh to another server and then runs some other scripts on it. The first radius with the sqlippool runs for more than 2 months without any problem at all. The second one which calls an external perl script hungs after a few hours. When I issue the status command the result is > # /etc/init.d/radiusd status > radiusd dead but pid file exists I configured freeradius to call the script with the perl module as well as using the exec module with identical results. After the radius stops, I see that the perl script log file stops always when it tries to ssh to the other server. From the other server statistics it doesn't show anything unusual (e.g. high cpu) or any error in the ssh log file. Of course, the issue is not that there is some problem with the perl script, the ssh command or the remote server but that the radius hungs when the external script which calls hungs. Note that this behavior can be reproduced by calling an external script like the following > #!/usr/bin/perl > > use strict; > > my $rc=system("ssh 1.1.1.1"); > exit($rc); 1.1.1.1 is just an IP address that would cause the ssh to timeout. Note that the freeradius server does not hang when started in debug mode. We use exactly the same perl script for the last few years without any problem on another machine which runs freeradius version 1.0.1. Regards, Stylianos On 16/4/2010 1:05 μμ, Alan DeKok wrote: > Jakob Hirsch wrote: > >> we have freeradius 2.1.8 running on a couple of servers and are very >> happy with it. But every few days FR crashes on one of the servers (a >> random one, not always the same). The load is significant (average 150 >> requests/s per server, 400/s peak) but sureley not too high. So >> everything seems to run fine besides the annoying crashes, which alarms >> people and make the weekly availibility reports look bad (even though FR >> is restarted automatically, of course). The previous 1.1.8 installation >> we upgraded 6 months ago from did not have this problem. >> > Hmm... I've run it at 20K pps for *days*.... > > >> Anyways, I really want to find out what's going wrong, so I wanted to >> get core dumps of these crashes. Only that I just don't get them. >> - radiusd.conf has allow_core_dumps = yes (and FR says "Info: Core dumps >> are enabled." at startup) >> - /proc/sys/kernel/core_pattern is set to '/tmp/core.%t.%e.%p', so core >> dumps can be written to disk (tested with a little programm that forces >> a segfault) >> - I put "ulimit -c unlimited" in the startup script. >> cat /proc/$(pidof freeradius)/limits shows "unlimited" for soft and hard >> limit of "Max core file size" >> > Often 'root' can't core dump, and programs that change uid can't core > dump. It's hard to know what's going on with the OS. > > >> So what's missing? The only indication of the crash is this line in syslog: >> >> >>> Apr 10 17:57:19 xxxxxxxx kernel: [12268615.000288] freeradius[14846]: segfault at 73818 ip 00007f0cb40e875e sp 00007fff9c6304c0 error 4 in libfreeradius-radius-2.1.8.so[7f0cb40d1000+1f000] >>> >> (This is debian lenny x86_64, btw.) >> >> Any hints? >> > doc/bugs. You'll need symbols to find out what's going on. > > Alan DeKok. > > > - > List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html > List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Stylianos Stylianou wrote:
> The second one which calls an external perl script hungs after a few hours. > When I issue the status command the result is >> # /etc/init.d/radiusd status >> radiusd dead but pid file exists Whoops. > Note that this behavior can be reproduced by calling an external script > like the following It's always good to have a test case. > We use exactly the same perl script for the last few years without any > problem on another machine which runs freeradius version 1.0.1. Hm... the problem code seems to be the same in all versions of the server. However, in older versions, the server core would notice, and kill the thread. This worked around the problem without fixing it. In any case... a fix will be in 2.1.9. Until it's released, you could try grabbing the v2.1.x branch from git.freeradius.org. The problem is that it was blocked trying to read STDOUT of the child. The solution is to not block, and give up reading if it takes too long. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
In reply to this post by Alan DeKok-2
On Fri, Apr 16, 2010 at 12:05:38PM +0200, Alan DeKok wrote:
> Jakob Hirsch wrote: > > Anyways, I really want to find out what's going wrong, so I wanted to > > get core dumps of these crashes. Only that I just don't get them. > > > So what's missing? The only indication of the crash is this line in syslog: > > > >> Apr 10 17:57:19 xxxxxxxx kernel: [12268615.000288] freeradius[14846]: segfault at 73818 ip 00007f0cb40e875e sp 00007fff9c6304c0 error 4 in libfreeradius-radius-2.1.8.so[7f0cb40d1000+1f000] > > > > (This is debian lenny x86_64, btw.) > > > > Any hints? > > doc/bugs. You'll need symbols to find out what's going on. For Debian users you can recommend installing the symbols from the package freeradius-dbg See also http://packages.debian.org/freeradius-dbg -- 2. That which causes joy or happiness. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
In reply to this post by Alan DeKok-2
Alan DeKok, 2010-04-16 12:05:
> Often 'root' can't core dump, and programs that change uid can't core > dump. It's hard to know what's going on with the OS. ok, I digged deeper into this and made some tests: - no core dump with kill -11 - /proc/sys/fs/suid_dumpable is 0, set it to 1 and restart FR - kill -11 -> core dump, yeah! So it's probably a problem with the uid change disabling the process' dumpability (I found nothing in /proc/[pid]/* where I can see this. So we have now all machines running with /proc/sys/fs/suid_dumpable set to 1. Strange thing is, this should not be neccessary with the prctl(PR_SET_DUMPABLE, 1) in mainconfig.c:698. Anyway, I'm now looking forward for FR to crash :) >> Any hints? > doc/bugs. You'll need symbols to find out what's going on. I know, and I have them (in the -dbg package), but they are useless without a core dump :) Maybe the info about /proc/sys/fs/suid_dumpable should be added to doc/bugs... Thanks for your input! Regards, J - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Hi,
> Maybe the info about /proc/sys/fs/suid_dumpable should be added to > doc/bugs... to quote the man page: /proc/sys/fs/suid_dumpable (since Linux 2.6.13) The value in this file determines whether core dump files are produced for set-user-ID or otherwise protected/tainted bina- ries. Three different integer values can be specified: 0 (default) This provides the traditional (pre-Linux 2.6.13) behavior. A core dump will not be produced for a process which has changed credentials (by calling seteuid(2), setgid(2), or similar, or by executing a set-user-ID or set-group-ID program) or whose binary does not have read permission enabled. 1 ("debug") All processes dump core when possible. The core dump is owned by the file system user ID of the dumping process and no security is applied. This is intended for system debug- ging situations only. Ptrace is unchecked. 2 ("suidsafe") Any binary which normally would not be dumped (see "0" above) is dumped readable by root only. This allows the user to remove the core dump file but not to read it. For security reasons core dumps in this mode will not overwrite one another or other files. This mode is appropriate when adminis- trators are attempting to debug problems in a normal environ- ment. i dont think this got enough coverage in most information outlets..in fact 2.6.13 has been around for a while but today was the first time i learnt of that behaviour. maybe FreeRADIUS code updated to detect this value...and if its set to 0 then it could mention it in the debug output? ;-) alan - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Alan Buxey, 2010-04-19 16:43:
>> Maybe the info about /proc/sys/fs/suid_dumpable should be added to >> doc/bugs... > to quote the man page: > /proc/sys/fs/suid_dumpable (since Linux 2.6.13) ... > i dont think this got enough coverage in most information outlets..in fact > 2.6.13 has been around for a while but today was the first time i learnt of > that behaviour. I agree, even though it's mentioned in the CORE(5) man page. > maybe FreeRADIUS code updated to detect this value...and if its set to 0 > then it could mention it in the debug output? ;-) Maybe, but with calling prctl(PR_SET_DUMPABLE, 1) this should not be necessary any more. I tried this with a small test program and it worked as specified, but still I won't get a core dump of the FR process unless I set suid_dumpable to 1. So after some debugging I got to the root cause of this: The process's dumpable flag is reset every time the UID is changed. FR does this several times with fr_suid_up() and fr_suid_down() after switch_users() is run, e.g. in listen_bind(). So I guess we have to change the fr_suid_* functions to always set the dumpable flag after setting the uid. btw, I wonder why is prctl() is not called when debug_flag is set. I would have thought that one would want to get a core dump especially when running in debug mode. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Jakob Hirsch wrote:
> So after some debugging I got to the root cause of this: > The process's dumpable flag is reset every time the UID is changed. FR > does this several times with fr_suid_up() and fr_suid_down() after > switch_users() is run, e.g. in listen_bind(). > So I guess we have to change the fr_suid_* functions to always set the > dumpable flag after setting the uid. Ah... OK. That can be fixed for 2.1.9. > btw, I wonder why is prctl() is not called when debug_flag is set. I > would have thought that one would want to get a core dump especially > when running in debug mode. It doesn't switch UIDs when in debug mode. So it inherits whatever code dump policy you set in the shell. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Alan DeKok, 2010-04-20 10:54:
>> So after some debugging I got to the root cause of this: >> The process's dumpable flag is reset every time the UID is changed. FR >> does this several times with fr_suid_up() and fr_suid_down() after >> switch_users() is run, e.g. in listen_bind(). >> So I guess we have to change the fr_suid_* functions to always set the >> dumpable flag after setting the uid. > Ah... OK. That can be fixed for 2.1.9. Excellent! :) Any idea when it will be released? >> btw, I wonder why is prctl() is not called when debug_flag is set. I >> would have thought that one would want to get a core dump especially >> when running in debug mode. > It doesn't switch UIDs when in debug mode. So it inherits whatever AFAICS it does when starting it as root (check in mainconfig.c:532). I'd say a quite common case for debugging is to run freeradius -X as root... - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Jakob Hirsch wrote:
> Any idea when it will be released? In the next month or so. >>> btw, I wonder why is prctl() is not called when debug_flag is set. I >>> would have thought that one would want to get a core dump especially >>> when running in debug mode. >> It doesn't switch UIDs when in debug mode. So it inherits whatever > > AFAICS it does when starting it as root (check in mainconfig.c:532). I'd > say a quite common case for debugging is to run freeradius -X as root... OK. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Alan DeKok, 04/20/2010 06:21 PM:
>>>> btw, I wonder why is prctl() is not called when debug_flag is set. I >>>> would have thought that one would want to get a core dump especially >>>> when running in debug mode. >>> It doesn't switch UIDs when in debug mode. So it inherits whatever >> AFAICS it does when starting it as root (check in mainconfig.c:532). I'd >> say a quite common case for debugging is to run freeradius -X as root... > OK. This will become a non-issue when the prctl() calls are moved into the fr_suid_* functions. :) Would you like me to prepare a patch for that or would you rather do that yourself? Anyway, here's the aftermath: I got my core dump, finally, and it turns out that we are probably hit by the notorious bug #35 (as I half feared, half hoped :). I will try the fix for list_delete() you proposed if I can get to it... - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
|
Jakob Hirsch wrote:
> This will become a non-issue when the prctl() calls are moved into the > fr_suid_* functions. :) > Would you like me to prepare a patch for that or would you rather do > that yourself? Patch, please. It's just easier. > Anyway, here's the aftermath: I got my core dump, finally, and it turns > out that we are probably hit by the notorious bug #35 (as I half feared, > half hoped :). > I will try the fix for list_delete() you proposed if I can get to it... I'm not sure that will help. <sigh> It's happened enough that I know it's real. But I have *no* idea why it's happening: - there is ONE location in the code where entries get added to the cache - there is ONE location where they're looked up - there is ONE location where they're deleted - all this is done from ONE thread So if the request is in the cache, the packet pointer *cannot* be NULL. So it's likely not a race condition between threads. It's not a mismanagement issue. It's not a "use after free" memory issue. <sigh> I'll put a fix into 2.1.9 which works around the issue. It's better than having the server crash. If you don't mind trying things, I can send you some patches which might help tracking it down. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html |
| Powered by Nabble | Edit this page |
