Thursday, March 26, 2009

failure rates and the law of truly large numbers

probably the best known metric for measuring the effectiveness of an anti-virus product is the detection rate... it's something that's been around for a long time and there are frequent attempts to measure it...

i'm not about to try and suggest something to supplant that metric, just so you know - the detection rate metric has served us well for many years, even if there's a bit of confusion of what actually constitutes a detection rate, and even though there's controversy over how it should be measured and who is qualified to do so...

no, although failure rates are much more straight forward to measure, i'm not about to suggest that they're better in any way - i only want to bring them up in order to provoke thought...

let's say you're an average user and you have anti-virus protecting your computer (possibly at many different levels, like a desktop client, an email gateway scanner run by your email provider, etc)... let's further say that on average your anti-virus product fails to prevent your computer from getting compromised once every 5 years...

now, since it is possible to have anti-virus protecting you at multiple layers and with different optimizations at each layer it's important to define a failure as a piece of malware slipping past all of those layers... maybe that means incident response is required, or maybe you've got some other preventative control that stopped it after (not before) it slipped past your final layer of av prevention (ex. maybe you're a little less than average and are actually running without admin privileges that the malware needed to do it's dirty deed)...

i know what you security practitioners are probably thinking - 'where can i get this magical av product that only fails once every 5 years?'... i'm sure that would make your lives easier, wouldn't it... well right now it's just a hypothetical av product but later on we'll see what we can do...

so now let's say you're a security practitioner, you're part of the IT department of some company... let's further say that at this company you and your immediate coworkers are supporting approximately 2,000 of those same average users and you're using the same anti-virus technology... guess what - you can now expect to see an anti-virus failure once a day! (by the way, when one compromised machine goes out and leads to 5/10/50 more machines on your network getting the same malware, i call that the same event - it's the same failure)...

did the av somehow magically become less effective? no, of course not, it's the same technology - but in this enterprise scenario there are 2,000 times as many opportunities to fail per unit of time than there were in the single user scenario because there are 2,000 times as many users... malware compromise depends in part on decisions made by the user (which should be equally good/bad between the two scenarios) but also on exposure to the malware in the first place which, while it may be regular or even frequent, is an inherently random event... that means even if you could guarantee perfectly predictable (not necessarily correct, just predictable) decision-making from the users (note: you can't actually guarantee this), anti-virus failure events are still at least partially random...

highschool level math tells us that a pair of flipped coins are more likely to have one turn up heads than a single flipped coin would, and the same principle applies to anti-virus failure events - more trials means higher overall probability of the failure event occurring, and more concurrent trials means less expected waiting time between failure events...

now let's think about this in the real world - you security practitioners out there are in a perfect position to know how often your company's anti-virus fails and how many people you're supporting so what's the per-person failure rate of your av?.. at my previous employer we had (to my knowledge) one notable failure in a period of 2 years for a company size of about 20 people (and i helped clean it up)... that's a per-person failure rate of once every 40 years!... now i'm willing to bet that we were an anomaly, a statistical outlier, that the true per-person failure rate is more than once in four decades, but i'm also willing to bet that larger companies with 2,000 people in them do not suffer a new failure every single day - which means their anti-virus has a per-person failure rate that is actually less (and therefore better) than the magical example of once every 5 years...

so take an honest look at how often your anti-virus really fails on a per-person basis... one of the things i've noticed is that a lot of the people who are convinced that av isn't doing a good job anymore are people who are using enterprise experience as their anecdotal evidence - not realizing that the more users you bring into the picture the more the law of truly large numbers works against you... it's not that av actually fails often, it's that failure scales up (and that's true for all failure, not just av failure)...

2 comments:

Anonymous said...

There is a flaw in your reasoning.

You assume that these 2000 users will be attacked by 2000 different threats. In reality, especially if they are all from the same company, they are likely to be attacked by pretty much the same threat(s).

So, in the real world, a product like that won't fail once a day for one user - it will still fail about once every 5 years, but when it fails, it will fail for pretty much every single one of those 2000 users.

kurt wismer said...

actually i'm not assuming they'll each be attacked by 2000 different threats... what i'm assuming is that their individual exposures to malware will largely be independent events (because they converse with different people, visit different websites, etc)...

there will be additional exposure events that are not independent from the exposures of their coworkers (malware spreading within the network or the company getting targeted), but each enterprise user will bring the same number of independent exposure events to the table that a user in the single home user scenario would because they're fundamentally the same people...

as such, 2000 times as many people translates into 2000 times as many independent exposure events - an unknown fraction of which will turn into successful attacks and thus become av failure events...