Status-Server requests are blocked if an Access-Request is waiting for downstream service to respond

classic Classic list List threaded Threaded
5 messages Options
| Threaded
Open this post in threaded view
|

Status-Server requests are blocked if an Access-Request is waiting for downstream service to respond

Ignacio Arces
Hello,

I'm running a containerized FreeRADIUS server v3.0.19 with a custom
authentication module written in C language that authenticates users
through a HTTP API. I also configured the Status server as detailed in
https://wiki.freeradius.org/config/Status and I'm using it to monitor my
server by sending Server-Status requests using raclient to health check my
server.

We recently experienced an outage in the auth API and since we didn't have
timeouts properly configured in the curl calls in our custom C module, the
requests were hanging indefinitely. When this happened, we also noticed
that our containerized server was restarted by Docker as the container was
set to "Unhealthy" state, so the health checks were failing.
Troubleshooting the health checks we found that Status-Server requests were
not responding while the auth request was hanging waiting for the auth API
to respond.

Now that we have a 10s timeout properly configured in our curl requests, we
have mitigated the undesired restarts but we still can understand why even
a single stuck auth request is impacting Status-Server request.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Status-Server requests are blocked if an Access-Request is waiting for downstream service to respond

Alan DeKok-2
On Nov 12, 2020, at 1:23 AM, Ignacio Arces <[hidden email]> wrote:
>
> I'm running a containerized FreeRADIUS server v3.0.19 with a custom
> authentication module written in C language that authenticates users
> through a HTTP API.

  v3 has rlm_rest, which should be good enough for most purposes.

> We recently experienced an outage in the auth API and since we didn't have
> timeouts properly configured in the curl calls in our custom C module, the
> requests were hanging indefinitely.

  Yes, that's the downside of a blocking design.  :(

> When this happened, we also noticed
> that our containerized server was restarted by Docker as the container was
> set to "Unhealthy" state, so the health checks were failing.
> Troubleshooting the health checks we found that Status-Server requests were
> not responding while the auth request was hanging waiting for the auth API
> to respond.

  Yes.  That's how it works.  The Status-Server packets are processed by the same threads which process the Access-Requests.  So if all of those threads are blocked, then Status-Server packets are also blocked.

> Now that we have a 10s timeout properly configured in our curl requests, we
> have mitigated the undesired restarts but we still can understand why even
> a single stuck auth request is impacting Status-Server request.

  If *one* Access-Request packet is blocked, then other threads can still process Status-Server.  So no, you don't see a "single stuck auth request impacting Status-Server".

  The goal of Status-Server is to see if the server is up and *working*.  Maybe the server is running, but is unable to process any packets.  In that case, yes, you *do* want it to stop processing Status-Server.

  This situation also falls into the standard design requirements for RADIUS: If the RADIUS server is critical, then _any_ system which is used by RADIUS is also critical.  Make sure that those systems are (a) up, and (b) responsive.

  It makes zero sense to have a back-end database (or REST API) take 10 seconds to respond to a request.  The solution here isn't to hack up the RADIUS server to do something magical.  The solution is to make the back-end system *not* crap.

  Alan DeKok.



-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Status-Server requests are blocked if an Access-Request is waiting for downstream service to respond

Ignacio Arces
> v3 has rlm_rest, which should be good enough for most purposes.

We ended up using rlm_c since our custom authentication requires a couple
API calls and generate random correlation/request IDs.

> Yes.  That's how it works.  The Status-Server packets are processed by
the same threads which process the Access-Requests.  So if all of those
threads are blocked, then Status-Server packets are also blocked.

This was our understanding as well, that's why we didn't expect that a
single stuck request were blocking status requests.

> If *one* Access-Request packet is blocked, then other threads can still
process Status-Server.  So no, you don't see a "single stuck auth request
impacting Status-Server".

We confirmed this scenario in our test env. We forced the request handler
in our auth API to sleep for 60 seconds and then perform a simple
Access-Request with radtest. As expected, this single Access-Request were
blocked for 60s (we removed the curl timeouts and the container health
check for this test) and during this time all Status-Server request we sent
got blocked and returned only after the Access-Requests completed.

> It makes zero sense to have a back-end database (or REST API) take 10
seconds to respond to a request.  The solution here isn't to hack up the
RADIUS server to do something magical.  The solution is to make the
back-end system *not* crap.

Agree. Our current focus is to improve our auth API. Nonetheless, I don't
think we are trying to hack up RADIUS, we just want to understand why it's
not working the way it's supposed to work. Maybe, we have
misconfigured something that's causing this behavior.

On Thu, Nov 12, 2020 at 7:54 AM Alan DeKok <[hidden email]>
wrote:

> On Nov 12, 2020, at 1:23 AM, Ignacio Arces <[hidden email]>
> wrote:
> >
> > I'm running a containerized FreeRADIUS server v3.0.19 with a custom
> > authentication module written in C language that authenticates users
> > through a HTTP API.
>
>   v3 has rlm_rest, which should be good enough for most purposes.
>
> > We recently experienced an outage in the auth API and since we didn't
> have
> > timeouts properly configured in the curl calls in our custom C module,
> the
> > requests were hanging indefinitely.
>
>   Yes, that's the downside of a blocking design.  :(
>
> > When this happened, we also noticed
> > that our containerized server was restarted by Docker as the container
> was
> > set to "Unhealthy" state, so the health checks were failing.
> > Troubleshooting the health checks we found that Status-Server requests
> were
> > not responding while the auth request was hanging waiting for the auth
> API
> > to respond.
>
>   Yes.  That's how it works.  The Status-Server packets are processed by
> the same threads which process the Access-Requests.  So if all of those
> threads are blocked, then Status-Server packets are also blocked.
>
> > Now that we have a 10s timeout properly configured in our curl requests,
> we
> > have mitigated the undesired restarts but we still can understand why
> even
> > a single stuck auth request is impacting Status-Server request.
>
>   If *one* Access-Request packet is blocked, then other threads can still
> process Status-Server.  So no, you don't see a "single stuck auth request
> impacting Status-Server".
>
>   The goal of Status-Server is to see if the server is up and *working*.
> Maybe the server is running, but is unable to process any packets.  In that
> case, yes, you *do* want it to stop processing Status-Server.
>
>   This situation also falls into the standard design requirements for
> RADIUS: If the RADIUS server is critical, then _any_ system which is used
> by RADIUS is also critical.  Make sure that those systems are (a) up, and
> (b) responsive.
>
>   It makes zero sense to have a back-end database (or REST API) take 10
> seconds to respond to a request.  The solution here isn't to hack up the
> RADIUS server to do something magical.  The solution is to make the
> back-end system *not* crap.
>
>   Alan DeKok.
>
>
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Status-Server requests are blocked if an Access-Request is waiting for downstream service to respond

Alan DeKok-2
On Nov 12, 2020, at 1:26 PM, Ignacio Arces <[hidden email]> wrote:

>> If *one* Access-Request packet is blocked, then other threads can still
> process Status-Server.  So no, you don't see a "single stuck auth request
> impacting Status-Server".
>
> We confirmed this scenario in our test env. We forced the request handler
> in our auth API to sleep for 60 seconds and then perform a simple
> Access-Request with radtest. As expected, this single Access-Request were
> blocked for 60s (we removed the curl timeouts and the container health
> check for this test) and during this time all Status-Server request we sent
> got blocked and returned only after the Access-Requests completed.

  Let me guess.  You're running in debug mode?  i.e. with only one thread?

  If you have *multiple* threads, then one stuck Access-Request will not block Status-Server packets.

  Which is why I mentioned thread*S* above.  Not "one thread".

> Agree. Our current focus is to improve our auth API. Nonetheless, I don't
> think we are trying to hack up RADIUS, we just want to understand why it's
> not working the way it's supposed to work. Maybe, we have
> misconfigured something that's causing this behavior.

  Yes.  You're only using one thread.

  Alan DeKok.


-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Status-Server requests are blocked if an Access-Request is waiting for downstream service to respond

Ignacio Arces
> Let me guess.  You're running in debug mode?  i.e. with only one thread?

Thanks so much Alan. That was indeed the issue in our test environment.
When this issue happened in our prod env it was bc all threads were
blocked; we don't run FreeRADIUS in prod in debug mode.

On Thu, Nov 12, 2020 at 12:57 PM Alan DeKok <[hidden email]>
wrote:

> On Nov 12, 2020, at 1:26 PM, Ignacio Arces <[hidden email]>
> wrote:
> >> If *one* Access-Request packet is blocked, then other threads can still
> > process Status-Server.  So no, you don't see a "single stuck auth request
> > impacting Status-Server".
> >
> > We confirmed this scenario in our test env. We forced the request handler
> > in our auth API to sleep for 60 seconds and then perform a simple
> > Access-Request with radtest. As expected, this single Access-Request were
> > blocked for 60s (we removed the curl timeouts and the container health
> > check for this test) and during this time all Status-Server request we
> sent
> > got blocked and returned only after the Access-Requests completed.
>
>   Let me guess.  You're running in debug mode?  i.e. with only one thread?
>
>   If you have *multiple* threads, then one stuck Access-Request will not
> block Status-Server packets.
>
>   Which is why I mentioned thread*S* above.  Not "one thread".
>
> > Agree. Our current focus is to improve our auth API. Nonetheless, I don't
> > think we are trying to hack up RADIUS, we just want to understand why
> it's
> > not working the way it's supposed to work. Maybe, we have
> > misconfigured something that's causing this behavior.
>
>   Yes.  You're only using one thread.
>
>   Alan DeKok.
>
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html