Cannot control attribute ordering via "rlm_perl"

classic Classic list List threaded Threaded
14 messages Options
| Threaded
Open this post in threaded view
|

Cannot control attribute ordering via "rlm_perl"

claude.brown
Hi,

First, the version I'm using:

# freeradius -v
freeradius: FreeRADIUS Version 2.1.8, for host x86_64-pc-linux-gnu, [...]


I'm trying to control the attribute-ordering when using "rlm_perl". Thus far my experience is that this is not possible. My theory is that this is due to the hash-tables used as the interface between the C and Perl worlds.

Here is a small example that demonstrates the problem. I've turned on the "users" and "perl" modules in the authorize section (in that order). These are the important bits from the "users" file and the "example.pl" file.

(from the "users" file)
steve   Cleartext-Password := "testing"
        Service-Type = Framed-User,
        Framed-Protocol = PPP,
        Framed-IP-Address = 172.16.3.33,
        Framed-IP-Netmask = 255.255.255.0,
        Framed-Routing = Broadcast-Listen,
        Framed-Filter-Id = "std.ppp",
        Framed-MTU = 1500,
        Framed-Compression = Van-Jacobsen-TCP-IP,
        WiMAX-Packet-Data-Flow-Id = 1,
        WiMAX-Service-Data-Flow-Id = 1,
        WiMAX-Service-Profile-Id = 2

(from the "example.pl")
sub authorize
{
   return RLM_MODULE_NOOP;
}


The debug log of the server is below. The interesting bits are (a) the "rlm_perl: Added pair" and (b) the attribute-order in the packet that the server sends in reply - the order is changed.

The ordering is important to for me as I want those three WiMAX attributes packed inside a parent attribute "WiMAX-Packet-Flow-Descriptor". If I turn off the "perl" module (or place it before the "files" module) the packing works as expected.

I put all this down to the attribute-list being rebuilt (by rlm_perl) from the %RAD_REPLY table. The hash-table has no concept of ordering, so it ends up randomised.

The above is a contrived example - what I really want to do is add those three WiMAX attributes in my perl script.  But due to the ordering problems I think I am wasting my time and need to come up with another solution.

Can anyone see how I can control the ordering of attributes coming out of the perl script?

Thanks,

Claude Brown.
Vividwireless.



rad_recv: Access-Request packet from host 127.0.0.1 port 50265, id=2, length=63
        User-Name = "steve"
        User-Password = "testing"
        Message-Authenticator = 0xc8b10e777a7ea53a261c855029fd0b58
+- entering group authorize {...}
++[preprocess] returns ok
++[chap] returns noop
++[mschap] returns noop
[suffix] No '@' in User-Name = "steve", looking up realm NULL
[suffix] No such realm "NULL"
++[suffix] returns noop
[eap] No EAP-Message, not doing EAP
++[eap] returns noop
++[unix] returns notfound
[files] users: Matched entry steve at line 76
++[files] returns ok
GOT CLONE -1588651264 0x1a0e160
rlm_perl: Added pair User-Name = steve
rlm_perl: Added pair User-Password = testing
rlm_perl: Added pair NAS-IP-Address = 127.0.0.1
rlm_perl: Added pair Message-Authenticator = 0xc8b10e777a7ea53a261c855029fd0b58
rlm_perl: Added pair WiMAX-Service-Data-Flow-Id = 1
rlm_perl: Added pair Service-Type = Framed-User
rlm_perl: Added pair Framed-Routing = Broadcast-Listen
rlm_perl: Added pair WiMAX-Packet-Data-Flow-Id = 1
rlm_perl: Added pair Framed-Protocol = PPP
rlm_perl: Added pair Framed-Filter-Id = std.ppp
rlm_perl: Added pair Framed-IP-Address = 172.16.3.33
rlm_perl: Added pair Framed-IP-Netmask = 255.255.255.0
rlm_perl: Added pair Framed-Compression = Van-Jacobson-TCP-IP
rlm_perl: Added pair WiMAX-Service-Profile-Id = 2
rlm_perl: Added pair Framed-MTU = 1500
rlm_perl: Added pair Cleartext-Password = testing
++[perl] returns noop
++[expiration] returns noop
++[logintime] returns noop
++[pap] returns updated
Found Auth-Type = PAP
+- entering group PAP {...}
[pap] login attempt with password "testing"
[pap] Using clear text password "testing"
[pap] User authenticated successfully
++[pap] returns ok
Login OK: [steve] (from client localhost port 0)
+- entering group post-auth {...}
++[exec] returns noop
Sending Access-Accept of id 2 to 127.0.0.1 port 50265
        WiMAX-Service-Data-Flow-Id = 1
        Service-Type = Framed-User
        Framed-Routing = Broadcast-Listen
        WiMAX-Packet-Data-Flow-Id = 1
        Framed-Protocol = PPP
        Framed-Filter-Id = "std.ppp"
        Framed-IP-Address = 172.16.3.33
        Framed-IP-Netmask = 255.255.255.0
        Framed-Compression = Van-Jacobson-TCP-IP
        WiMAX-Service-Profile-Id = 2
        Framed-MTU = 1500
Finished request 0.

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Cannot control attribute ordering via "rlm_perl"

Alan DeKok-2
Claude Brown wrote:
> I'm trying to control the attribute-ordering when using "rlm_perl". Thus far my experience is that this is not
possible. My theory is that this is due to the hash-tables used as the
interface between the C and Perl worlds.

  Quite possibly.

> The ordering is important to for me as I want those three WiMAX attributes packed inside a parent attribute "WiMAX-Packet-Flow-Descriptor". If I turn off the "perl" module (or place it before the "files" module) the packing works as expected.

  Yeah.  The server really needs to have a better way of handling nested
attributes.  Suggestions are welcome...

> Can anyone see how I can control the ordering of attributes coming out of the perl script?

  Not using Perl.

  Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Cannot control attribute ordering via "rlm_perl"

claude.brown
In the end we implemented our solution as a new C module rather than perl called by "rlm_perl".  Thanks.

Our new module was designed to replace "rlm_sql" and meet these goals:
- Be roughly equivalent to "rlm_files" in terms of speed
- Utilise all the features of "rlm_files" - avoid re-inventing that wheel
- Allow high rate of user-by-user updates; i.e. avoid config re-write as per "rlm_fastfile"
- Simple for stability: no shared in-memory state (avoid locking and races)
- Simple for stability: avoid complex on-disk structures like databases with dubious libraries
- Simple for stability: easy mechanism to re-write entire config (say daily) to iron our errors
- Simple for stability: re-use as much of freeRADIUS as possible; avoid writing lots of new code

We acheived all these goals and can now process bring all our customers back onto our service in about
five minutes. The price is a lot of i-nodes - we end up with one file per user in a dir tree.

With "rlm_sql" it would take an hour or two only then with careful (and human driven) rate management.
The main issues driving this delay were:
- "rlm_sql" calls during EAP negotation instead of just at the end of EAP
- Performance issues on our MySQL backend that we didn't have budget to resolve
- Thread lock-up's inside MySQL library yet no MySQL server queries were active

If this module is of interest to the community we are happy to contribute it.

Cheers,

Claude.

| Threaded
Open this post in threaded view
|

Re: Cannot control attribute ordering via "rlm_perl"

Alan DeKok-2
claude.brown wrote:
> Our new module was designed to replace "rlm_sql" and meet these goals:
> - Be roughly equivalent to "rlm_files" in terms of speed
> - Utilise all the features of "rlm_files" - avoid re-inventing that wheel
> - Allow high rate of user-by-user updates; i.e. avoid config re-write as per
> "rlm_fastfile"

  ?  The "fastusers" module is deprecated, because the "files" module is
just as fast.  The "files" module also can be HUP'd, so it can be
reloaded on the fly.

  Just use: radmin -e "hup files"

  and it will reload *only* the "files" module.  I've tested it at
loading 100K+ users/s off of disk.

> - Simple for stability: no shared in-memory state (avoid locking and races)

  The server core takes care of that when the "files" module is reloaded.

> - Simple for stability: avoid complex on-disk structures like databases with
> dubious libraries
> - Simple for stability: easy mechanism to re-write entire config (say daily)
> to iron our errors

  Daily config reloads are easy.

> - Simple for stability: re-use as much of freeRADIUS as possible; avoid
> writing lots of new code
>
> We acheived all these goals and can now process bring all our customers back
> onto our service in about
> five minutes. The price is a lot of i-nodes - we end up with one file per
> user in a dir tree.

  5 minutes for what, exactly?

  Say you have a format similar to the "users" file, with one user per
file.  Loading 100K users will mean 100K file reads, and that can take a
long time.  So, do that in a "cron" job.  Have it collect the individual
user files into one large file.  That might take 5 minutes, but who
cares?  It's once a day.

  Then, point the "files" module at the collected file.  It shouldn't
take longer than a second or two to reload it.

> With "rlm_sql" it would take an hour or two only then with careful (and
> human driven) rate management.

  I'm not sure what that means.  An hour or two to load SQL?  What is it
doing?

> The main issues driving this delay were:
> - "rlm_sql" calls during EAP negotation instead of just at the end of EAP

  That can be fixed without a new module.

> - Performance issues on our MySQL backend that we didn't have budget to
> resolve
> - Thread lock-up's inside MySQL library yet no MySQL server queries were
> active

  I've seen lots of people running MySQL with 300K+ users, and no
problems.  The system needs to be designed carefully, but it *does* work.

> If this module is of interest to the community we are happy to contribute
> it.

  I'd first want to know how many users you have.  And why it's taking
so long to get a system up and running.  It sounds like something is
seriously wrong.

  What does it mean to "bring customers into service in 5 minutes"?
With SQL, you should be able to keep the RADIUS server at 100% uptime.
Then, re-write individual user entries via another administration
process.  Rewriting one user entry should take ~10ms at MOST with any
SQL server.  And when the server starts up, it just connects to SQL.  It
doesn't need to read all of the users from SQL.

  So there's no reason for any downtime, and having 10 users in SQL is
just as fast as having 10M users in SQL.

  It really sounds like your *architecture* is wrong.  Find that and fix
it.  Writing a new module should *not* be necessary.

  Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Cannot control attribute ordering via "rlm_perl"

Bjørn Mork
Alan DeKok <[hidden email]> writes:
> claude.brown wrote:
>
>> - Performance issues on our MySQL backend that we didn't have budget to
>> resolve
>> - Thread lock-up's inside MySQL library yet no MySQL server queries were
>> active
>
>   I've seen lots of people running MySQL with 300K+ users, and no
> problems.  The system needs to be designed carefully, but it *does* work.

You don't even need to be that careful.  Just run a read-only mysql
slave instance locally on the radius server and all mysql-related
performance problems will vanish.

If you do mysql accounting: use buffered-sql aka decoupled-accounting.
It won't fix the performance issues on your accounting mysql-server, but
it will decouple the radius server from any such problems.




Bjørn

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

RE: Cannot control attribute ordering via "rlm_perl"

claude.brown
In reply to this post by Alan DeKok-2
Alan,

My original reply was confusingly brief. I've clarified below, and I've also put the module we wrote into github in case it helps:

https://github.com/claudebrown/freeradius-server/compare/master...rlm_tagfiles

(about 60 lines of C beyond usual module plumbing; 250 lines in total)


Alan DeKok wrote:
>
> > - Allow high rate of user-by-user updates; i.e. avoid config re-write as
> per
> > "rlm_fastfile"
>
>   ?  The "fastusers" module is deprecated, because the "files" module is
> just as fast.  The "files" module also can be HUP'd, so it can be
> reloaded on the fly.

We avoided both "fastfile" and reloading "files" on the fly because of the number of updates we have to our user setup.  The rate of change to our customers would require a reload every few seconds during most of the day.

We had concerns in two areas:
- The time to re-write the config and then re-load so frequently. This may become a performance problem as our user base grows out to 250K
- The risk of using the reload mechanism in a way that didn't seem consistent with its design intent, or the likely usage pattern of reloads every day or every few hours.

> > - Simple for stability: no shared in-memory state (avoid locking and
> races)
>
>   The server core takes care of that when the "files" module is reloaded.
>

These "Simple for stability" points were goals for our code. It wasn't something we were worried about for the existing code-base.

FreeRADIUS core is very stable. But MySQL adds instability we have been unable to identify or reproduce in our environment.

A crucial success factor for us was to ensure our module code was so simple it was very easy to be confident that stability was maintained. The strategy was to minimise the amount of software outside FreeRADIUS core.

>
>   Daily config reloads are easy.
>

Agreed. If we only needed daily, the "files" module would be perfect.

>   Say you have a format similar to the "users" file, with one user per
> file.  Loading 100K users will mean 100K file reads, and that can take a
> long time.

The module doesn't re-implement the "users" format or have a "users" file for every user.  It does not read 100K (or even 10) files at start-up.

The "files" module is used directly with a single normal "users" file just as per any normal FreeRADIUS deployment.


> > We acheived all these goals and can now process bring all our customers
> > back onto our service in about five minutes.
>
>   5 minutes for what, exactly?
>

When large parts of our WiMAX network are restarted due to maintenance or failure the customer devices re-join the network. Whilst this doesn't happen often, when it does happen we need to get as many as 50K devices will simultaneously ask to rejoin the network.  We need to service this sudden and dramatic backlog as quickly as possible.

With the "files" module this is a breeze with a single server.  It just eats it up and everything comes back in a few minutes. Importantly, our testing shows the design goal of 250K users would also be met with one server.

But with "rlm_sql" and MySQL we could not do it. The radiusd would start slowly grinding to a halt roughly as we reached 200 auths per sec (with EAP, this is about 30 devices per sec).  The radiusd log reported "Unresponsive child" in a MySQL module and gradually all the database concurrency would disappear as those threads were lost for further work.

After a lot of effort testing and experimenting with all sorts of things to isolate or avoid this problem, we did get a lot of improvement. But mostly what we achieved was a drop in the probability of losing threads. Inevitably the next larger network-outage event would re-trigger the issue.

With our new far simpler approach, all of this has gone away because we are now using the "files" module and "users" file directly. The speed of authentication is essentially as per that module.

Our new module adds an extra attribute to the Access-Request prior to it being processed by module "files".  The extra attribute can be any text attribute (we use "Reply-Message" to be perverse) and can have any value.  Normal "files" matching (typically used DEFAULT entries) is used to determine the attributes in the Access-Response.

The value of the extra attribute is in essence obtained like this:
1. Format a filename such as "/blah/%{Username}"
2. Read a line from this file

We only have about 10 different values in these files: things like "voip-customer", "payment-overdue", "gold-customer", "exceeded-download-limit", etc.  The value is used to select a DEFAULT entry in the "users" file that builds the reply attributes needed to configure the customers service.

This adds marginal overhead so performance is barely different to a vanilla "files" module.  The cost is one i-node per customer and a few 100 lines of C code. We are more than happy with that cost.

Outside calls to FreeRADIUS code, the module pretty much just calls "fopen", "fgets" and "fclose". So it's dreadfully simple and doesn't have any concerns with thread safety, locking, race conditions, etc.

>
> > With "rlm_sql" it would take an hour or two only then with careful (and
> > human driven) rate management.
>
>   I'm not sure what that means.  An hour or two to load SQL?  What is it
> doing?
>

This happens when we have a major network event that causes lots of devices to simultaneously request authentication. Due to the unpredictable loss of threads, we have to manually manage the rate of the incoming authentications by slowly starting small sections of the network at a time.

This process takes us hours of careful (manual) rate management.


> > The main issues driving this delay were:
> > - "rlm_sql" calls during EAP negotation instead of just at the end of
> EAP
>
>   That can be fixed without a new module.
>

Possibly, but we couldn't find a way. We would be keen to understand the fix for this.


> > - Performance issues on our MySQL backend that we didn't have budget to
> > resolve
> > - Thread lock-up's inside MySQL library yet no MySQL server queries were
> > active
>
>   I've seen lots of people running MySQL with 300K+ users, and no
> problems.  The system needs to be designed carefully, but it *does* work.
>

We had no problem during normal operation.  It was only when large numbers of devices (typically 10K or more) simultaneously needed to re-join the network for some reason.

Do you know if these other sites have those kinds of events?

>
> It really sounds like your *architecture* is wrong.  Find that and fix it.

I don't agree. We are not simply hitting a performance limit. That did happen, but it was resolved by using:
- proxy FreeRADIUS instances to do some hashing load-balancing
- separate auth and acct servers
- mysql index, query & deployment tuning

The performance achieved was acceptable (but nowhere near "files").

However, the stability issue would never go away. To me it smells of a race condition somewhere in the MySQL library. As we could only ever reproduce it by cycling 10K or more users, it was proving very difficult to debug.

> Writing a new module should *not* be necessary.
>

Possibly agree.  Finding and fixing the bug that caused threads to disappear would probably have been better.

But we spent far less time coding & testing a few 100 lines of "C" code than all the effort over the previous 18 months trying to reproduce, isolate or workaround the MySQL problem.  We gave up.

A nice bonus is that we can now head towards a single server configuration with a file-system database. This will allow us to retire a raft of servers doing proxying, multiple radiusd, and multiple MySQL instances.

Cheers,

Claude.


-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

RE: Cannot control attribute ordering via "rlm_perl"

claude.brown
In reply to this post by Bjørn Mork
Bjorn,

Thanks.

>
> You don't even need to be that careful.  Just run a read-only mysql
> slave instance locally on the radius server and all mysql-related
> performance problems will vanish.
>

We didn't try this.

Our design goal is:
- 250K users all needing to get on the network at the same time
- each user performing 7 authentications during EAP negotiation
- one hour duration to get everyone sorted

This is about 486 authentications per second. I'm sure that a MySQL configuration can be constructed to achieve this, but I'm not confident it would be a simple setup.  In contrast, the "files" module easily does this with a trivial configuration.

In any case, assuming MySQL can be configured appropriately, I believe the thread-loss stability issue we experienced with high authentication rates would remain.  See my longer reply to Alan for more details.

>
> If you do mysql accounting: use buffered-sql aka decoupled-accounting.
> It won't fix the performance issues on your accounting mysql-server, but
> it will decouple the radius server from any such problems.
>

Yes, we did use this feature to move the accounting backlog from the radius clients into the on-disk buffer.

However, as you note it doesn't solve the accounting performance issues on the database. This was a significant issue for us as we are only able to learn the customers IP address (needed for many business processed) from the "accounting start" request.  If this is delayed due to an avalanche of requests it affects customers in certain business states.

We were able to gain a significant performance improvement over and above "rlm_sql" accounting by writing the essential data to a flat-file and then batch-loading that into the SQL database.

The improvement came down to SQL transactions - the batch load only created one transaction for 1000's of accounting events rather than one transaction per event.  

Cheers,

Claude.


-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Cannot control attribute ordering via "rlm_perl"

Fajar A. Nugraha-2
On Tue, Jan 24, 2012 at 9:53 AM, Claude Brown
<[hidden email]> wrote:
> Our design goal is:
> - 250K users all needing to get on the network at the same time
> - each user performing 7 authentications during EAP negotiation
> - one hour duration to get everyone sorted
>
> This is about 486 authentications per second. I'm sure that a MySQL configuration can be constructed to achieve this, but I'm not confident it would be a simple setup.  In contrast, the "files" module easily does this with a trivial configuration.

So to confirm, your new module is basically files module, but it does
NOT cache anything in directory, but re-read the files on disk for
every request, is that correct?

>
> In any case, assuming MySQL can be configured appropriately, I believe the thread-loss stability issue we experienced with high authentication rates would remain.  See my longer reply to Alan for more details.

It's possible to work around that, but you need deep-enough knowledge
about sql performance tuning.

For example, in our setup the number of concurrent autentication
request that can go through and still leave mysql respond in timely
manner is about 128. So we limit the bumber of sql threads to that
number.

Using unlang, we then create a failsafe scenario, so if that a
concurrent request comes that exceed the number of max sql thread, it
would automatically accepted (i.e. basically Auth-Type=accept), but
with a low timeout (e.g. 1 hour). That way the user can connect, but
it'd reconnect and reauthenticate later when the system is (hopefuly)
not-so-busy.

> The improvement came down to SQL transactions - the batch load only created one transaction for 1000's of accounting events rather than one transaction per event.

Interesting. I wonder if we can hack a detail reader to behave similar, e.g.:
- send "start transaction"
- read lines from detail file
- every 10 seconds or before deleting the detail file, send a "commit"

--
Fajar

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Cannot control attribute ordering via "rlm_perl"

A.L.M.Buxey
In reply to this post by claude.brown
Hi,

> - each user performing 7 authentications during EAP negotiation

ummm, why? with correctly configured server and 'protection' of the authentication
type, you should only hit your authentication server just once inside the
EAP tunnel when the identity is set/known.

alan
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Cannot control attribute ordering via "rlm_perl"

Alan DeKok-2
In reply to this post by claude.brown
Claude Brown wrote:
> My original reply was confusingly brief. I've clarified below, and I've also put the module we wrote into github in case it helps:
>
> https://github.com/claudebrown/freeradius-server/compare/master...rlm_tagfiles

  OK.  It's... odd.

> We avoided both "fastfile" and reloading "files" on the fly because of the number of updates we have to our user setup.  The rate of change to our customers would require a reload every few seconds during most of the day.

  I'd normally just put users into SQL.

> We had concerns in two areas:
> - The time to re-write the config and then re-load so frequently. This may become a performance problem as our user base grows out to 250K
> - The risk of using the reload mechanism in a way that didn't seem consistent with its design intent, or the likely usage pattern of reloads every day or every few hours.

  OK.  Reloads don't work for you.

> FreeRADIUS core is very stable. But MySQL adds instability we have been unable to identify or reproduce in our environment.

  That's odd.  While MySQL isn't perfect, I have successfully used it in
systems with 100's of transactions/s.  There was a VoIP provider ~8
years ago using it with ~1K authentications/s.

> When large parts of our WiMAX network are restarted due to maintenance or failure the customer devices re-join the network. Whilst this doesn't happen often, when it does happen we need to get as many as 50K devices will simultaneously ask to rejoin the network.  We need to service this sudden and dramatic backlog as quickly as possible.

  Yup.

> With the "files" module this is a breeze with a single server.  It just eats it up and everything comes back in a few minutes. Importantly, our testing shows the design goal of 250K users would also be met with one server.
>
> But with "rlm_sql" and MySQL we could not do it. The radiusd would start slowly grinding to a halt roughly as we reached 200 auths per sec (with EAP, this is about 30 devices per sec).  The radiusd log reported "Unresponsive child" in a MySQL module and gradually all the database concurrency would disappear as those threads were lost for further work.

  MySQL does have concurrency issues.  But if you split it into
auth/acct, most of those go away.  i.e. use one SQL module for
authentication queries.  Use a *different* one for accounting inserts.

  If you also use the decoupled-accounting method (see
raddb/sites-available), MySQL gets even faster.  Having only one process
doing inserts can speed up MySQL by 3-4x.

> With our new far simpler approach, all of this has gone away because we are now using the "files" module and "users" file directly. The speed of authentication is essentially as per that module.

  OK.

> The value of the extra attribute is in essence obtained like this:
> 1. Format a filename such as "/blah/%{Username}"
> 2. Read a line from this file

  Using a database WILL be faster than reading the file system.

> We only have about 10 different values in these files: things like "voip-customer", "payment-overdue", "gold-customer", "exceeded-download-limit", etc.  The value is used to select a DEFAULT entry in the "users" file that builds the reply attributes needed to configure the customers service.

  You can do the same kind of thing with SQL.  Simply create a table,
and do:

   update request {
      My-Magic-Attr = "%{sql: SELECT .. from ..}"
   }

  Have the table contain the mapping of User-Name --> "voip-customer".
You should be able to get very high performance.  Then, use that
attribute to do the mappings in the "users" file, just like you do today.

> This happens when we have a major network event that causes lots of devices to simultaneously request authentication. Due to the unpredictable loss of threads, we have to manually manage the rate of the incoming authentications by slowly starting small sections of the network at a time.
>
> This process takes us hours of careful (manual) rate management.

  That's just weird.  SQL should be fine, *if* you design the system
carefully.  That's the key.

> Possibly, but we couldn't find a way. We would be keen to understand the fix for this.

  See above.

> We had no problem during normal operation.  It was only when large numbers of devices (typically 10K or more) simultaneously needed to re-join the network for some reason.
>
> Do you know if these other sites have those kinds of events?

  *Everyone* has this happen.  There's really no need for a new module.

> However, the stability issue would never go away. To me it smells of a race condition somewhere in the MySQL library. As we could only ever reproduce it by cycling 10K or more users, it was proving very difficult to debug.

  It's not a race condition, it's lock contention.

> But we spent far less time coding & testing a few 100 lines of "C" code than all the effort over the previous 18 months trying to reproduce, isolate or workaround the MySQL problem.  We gave up.
>
> A nice bonus is that we can now head towards a single server configuration with a file-system database. This will allow us to retire a raft of servers doing proxying, multiple radiusd, and multiple MySQL instances.

  If it works for you...

  But it's really just a re-implementation of a simple SQL table.  It's
a solution which is specific to your environment.

  The more generic solution is:

- custom tables
- split auth/acct
- decouple acct from the "live" server

  You should be able to get a very high performance with that.  The
benefit is you'll be using real databases, which is usually a good idea.

  Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

Re: Cannot control attribute ordering via "rlm_perl"

Alan DeKok-2
In reply to this post by claude.brown
Claude Brown wrote:
> We didn't try this.

  That would fix it.

> Our design goal is:
> - 250K users all needing to get on the network at the same time
> - each user performing 7 authentications during EAP negotiation

  That should be fixed, too.  There is NO NEED to do 7 SQL queries.  You
can put pretty much everything into "post-auth".

> - one hour duration to get everyone sorted

  Which works out to ~70 auths/s.  That's trivial.

> This is about 486 authentications per second. I'm sure that a MySQL configuration can be constructed to achieve this, but I'm not confident it would be a simple setup.

  It's dead easy.  Create a custom table as I said before.  SELECT on
that.  Do the select in the post-auth section.  Separate auth from acct.

  70 queries/s is a small system.  I've built MySQL systems which do
sustained hundreds of queries per second, and hundreds of accounting
packets/s.  It's pretty simple.

> In any case, assuming MySQL can be configured appropriately, I believe the thread-loss stability issue we experienced with high authentication rates would remain.

  I don't see why.  I've never seen any "thread loss" issue.

  If you use the same SQL module for auth & acct, you WILL run into
contention, locks, and massive delays.  The solution for that is above.

> However, as you note it doesn't solve the accounting performance issues on the database. This was a significant issue for us as we are only able to learn the customers IP address (needed for many business processed) from the "accounting start" request.  If this is delayed due to an avalanche of requests it affects customers in certain business states.

  Wild.

> We were able to gain a significant performance improvement over and above "rlm_sql" accounting by writing the essential data to a flat-file and then batch-loading that into the SQL database.

  That can help, yes.

> The improvement came down to SQL transactions - the batch load only created one transaction for 1000's of accounting events rather than one transaction per event.  

  Well... I've run MySQL systems with 100's of accounting inserts/s,
using pretty much a stock config.  It should be possible.

  Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

RE: Cannot control attribute ordering via "rlm_perl"

claude.brown
In reply to this post by Fajar A. Nugraha-2
>
> So to confirm, your new module is basically files module, but it does
> NOT cache anything in directory, but re-read the files on disk for
> every request, is that correct?
>

Roughly correct.  No caching, and reads on disk for every request. We rely on the block-buffer cache to make it go blisteringly quick.

The "rough" part is that it isn't a replacement for the "files" module as such.  Instead, it is a module you list in the config before "files". Our module then sets a value for use *by* the "files" module. The "files" module is still used exactly as is.

I think Alan summarised it best as being similar to this:

   update request {
      My-Magic-Attr = "%{sql: SELECT .. from ..}"
   }

Then in the "users" file the value of "My-Magic-Attr" can be used to select particular DEFAULT entries to return attributes.

What is different about our module is that "update request" above would look something like this:

   update request {
      My-Magic-Attr = "%{read-line-from-file: /blah/%{Username}}"
   }

But we wanted to avoid SQL and move to a file-based system as we had reached the end of our tether on SQL optimisation, budget, debugging, etc.

>
> Using unlang, we then create a failsafe scenario, so if that a
> concurrent request comes that exceed the number of max sql thread, it
> would automatically accepted (i.e. basically Auth-Type=accept), but
> with a low timeout (e.g. 1 hour). That way the user can connect, but
> it'd reconnect and reauthenticate later when the system is (hopefuly)
> not-so-busy.

This is very good idea.  Note that our problem was more about stability than raw performance. We still don't really know *why* we had the stability issues and are now relaxing with a beer because it's all gone away now :)

Lazy? Yes.  Happy?  Very :)


>
> Interesting. I wonder if we can hack a detail reader to behave similar,
> e.g.:
> - send "start transaction"
> - read lines from detail file
> - every 10 seconds or before deleting the detail file, send a "commit"
>

I suspect this would give all the benefits we gained by writing the events to a file and batch loading. Simpler too.




-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

RE: Cannot control attribute ordering via "rlm_perl"

claude.brown
In reply to this post by A.L.M.Buxey
>
> > - each user performing 7 authentications during EAP negotiation
>
> ummm, why? with correctly configured server and 'protection' of the
> authentication
> type, you should only hit your authentication server just once inside the
> EAP tunnel when the identity is set/known.
>

I'm not across the details - it was another person working on this who couldn't get the result we needed.

We also tried moving rlm_sql from "authorize" to "postauth", but that didn't work because (as far as we could see) rlm_sql didn't have an option for setting reply attributes during "postauth".

We have now worked around our 7x EAP problem by using a module configuration that allows us to place all our processing into "postauth".



-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
| Threaded
Open this post in threaded view
|

RE: Cannot control attribute ordering via "rlm_perl"

claude.brown
In reply to this post by Alan DeKok-2
>
>   I'd normally just put users into SQL.
>

Yes - this was our default approach.  It made the most sense to us initially.

rlm_sql
- highly dynamic
- need non-trivial skills to config "right" for performance
- need extra h/w to scale for white-hot performance

rlm_files
- not so dynamic (regular reloads provide a reasonable trade-off)
- trivial to config for performance
- unlikely to need extra h/w for white-hot performance

Our custom module
- highly dynamic
- trivial to config for performance
- unlikely to need extra h/w for white-hot performance

:)


>
> > FreeRADIUS core is very stable. But MySQL adds instability we have been
> unable to identify or reproduce in our environment.
>
>   That's odd.  While MySQL isn't perfect, I have successfully used it in
> systems with 100's of transactions/s.  There was a VoIP provider ~8
> years ago using it with ~1K authentications/s.
>

I too am surprised that we had these stability issues.  We use MySQL for pretty much everything we do - web-sites, customer management, inventory, real-time network analytics, usage accounting, etc. etc.  Some of these systems have significant load - possibly more than radius.  And we have had no stability problems.


>
>   MySQL does have concurrency issues.  But if you split it into
> auth/acct, most of those go away.  i.e. use one SQL module for
> authentication queries.  Use a *different* one for accounting inserts.
>

Yes, we did this.  We have two authentication servers and two accounting servers, with hash-based load-balancing proxies at the front.

This gave us a huge performance boost - and met our performance goals. But the stability issue just never went away.

>   If you also use the decoupled-accounting method (see
> raddb/sites-available), MySQL gets even faster.  Having only one process
> doing inserts can speed up MySQL by 3-4x.

Yes, we did this too. Also helped, but it was getting too far behind in some situations.  We've replaced with batch-loading of batch files.

>
>   Using a database WILL be faster than reading the file system.
>

Nah :)

Ignoring the effort to config the DB correctly, you still have most DBMS's using a separate server process. This requires IPC (probably over a network) and OS context-switching between processes.
 
In contrast, fopen, fgets, fclose only requires a switch into kernel mode and back again.  If the block-buffer pool has the data (it usually will) then the process may not even go into a wait-state.

>
>   You can do the same kind of thing with SQL.  Simply create a table,
> and do:
>
>    update request {
>       My-Magic-Attr = "%{sql: SELECT .. from ..}"
>    }
>

This is a very cool idea - I wish we had tried it!  This would have allowed us to put the rlm_sql processing into "postauth" and that may have made a huge difference.


>
>   That's just weird.  SQL should be fine, *if* you design the system
> carefully.  That's the key.
>

Yes. I contrast that with a trivial "C" module and some local files.

>
>   It's not a race condition, it's lock contention.
>

I don't *know* if it was a race condition, but I know it wasn't lock contention. As the threads gradually got lost we would look in the database to find corresponding stalled queries. None would be present.

>
>   If it works for you...
>
>   But it's really just a re-implementation of a simple SQL table.

Functionally, yes.  The benefits are more in simplicity of configuration, and performance per server-dollar.  Plus for us, a stability issue.

Thanks for all your help.

Cheers,

Claude.



-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html