0xdabbad00 - Stop trying to use fancy AI on malware

Stop trying to use fancy AI on malware

04 Apr 2012

Mathematicians and CS kids who have taken an AI class tend to end up thinking that applying different classification algorithms to .exe to identify malware is a great idea and think they are breaking new ground. This is the case with The H's post Adobe open sources Malware Classifier tool. Someone combined a PE file parser with some machine learning and some malware data sets and released the tool.

Let me start from a high level and explain that this work is not necessary, because you should just use white-listing. Obtain .exe's from trusted sources and anything that you didn't obtain from a trusted source shouldn't be there. I understand that this isn't a perfect solution because hey, this is real life and sometimes you end up with stuff that didn't come directly from microsoft.com or wherever, and sure even in those cases you can't be 100% absolutely certain that someone didn't fiddle with it as it came across the wire, or that your trusted source hadn't been compromised, etc.

So your next best option is to scan it with some A/V like virustotal. But you're looking for 0-day, so that won't work you think, except, shortly after it's scanned, maybe a day, it'll probably be detected after that sample gets passed around to the A/V vendors from virustotal and they run it through legit automated processes that detect packers, and run it in virtualized environments, and use more sophisticated techniques, and even human experts, to determine if it's malware. But, no, you're right, you took an AI class and with a basic understanding of the "problem" which you don't even actually understand, you created something revolutionary! As the author states "this research uses machine-learning techniques, which are seemingly underutilized by industry to solve security problems but that are used by other computing disciplines with success."

It's not that security folk didn't think of it, or even that it's not used, but it's not used directly. It might be one variable of many that helps determine if something is malware. Here is the problem: The average Windows system is going to have around 10,000 executables on it. The malware classified listed has a relatively good 98% success rate of detecting malware considering how simple the technique is (by simple I mean that it only uses 8 features of the binary, I'm not insulting your fancy AI smarts), but the problem is that it has a 5.68% false positive rate! That means that 500+ of those legit executables are going to be marked as malware and you're going to scream "OMG this is the worst infection ever!". That's the problem. Real antivirus software has to have a 0% false positive rate. That's why some stuff slips through. It's a lot better from their perspective to take maybe 24 hours to determine if something is really malware vs label svchost.exe as malware, delete it from the system, and brick the box.

That's the problem you're trying to solve, and a couple hundred lines of machine generated code that look like the following (taken from this malware classifier) is not going to solve that problem:

if input.IatRVA <= 94208:
  if input.NumberOfSections <= 5:
    if input.ExportSize <= 0:
      if input.NumberOfSections <= 4:
        if input.IatRVA <= 13504:
          if input.ImageVersion <= 353:
            if input.NumberOfSections <= 3:
              if input.IatRVA <= 6144:
                if input.IatRVA <= 2048:
                  if input.ResourceSize <= 934:
                    isDirty = 1

Take a look at that code, and you'll understand why this is a bad attempt at a solution. It hasn't identified any identification mechanism which makes sense. That's the problem with these black box solutions. Sure they seem to work for your data set, but this introduces our next problem, which in this case is that simply they are using old malware samples that are simple. Try to detect fancy malware like legit executables that have been trojaned in some way. Given that the malware was constructed craftily enough, this machine learning approach falls apart.

If you're looking for 0-day malware, use recent 0-days in your sample set and not the sloppily written stuff, but something good. Don't use some virus's written in the 90's. I bet half the goodware from the 90's will be mis-classifed by this malware classifier. </rant>

← Glossing over a resume | HTML5 FileReader →