Introduction

For this bad malware analysis, I thought I would continue the theme of counting letters … that way I could use most of my old code :)

Today, I decided to hash each file using sha512. Hashing is supposed to be completely random, so this is almost a test of that as well. I used around 3000 malicious samples and 1800 benign, so lets get started.

Why Hash, Why sha512

Hashing binaries is done all the time to verify downloads, check for changes, provide signatures, provide low hanging fruit for malware signatures, and many more purposes. It is so widely used, I was wondering if it was possible to use the hash itself as a flag to determine if this could be malware (beyond just a hash table).

In reality using a hash as a signature should not work. Hashes are meant to be random, thus there should be no discernible pattern, right? Lets see what I found out.

The reason I decided to do the letter count on hashes was for two reasons; 1) it pseudo tests the randomness of hashing and 2) I just thought it would be interesting.

The reason I decided sha512 is also two fold; 1) it’s long, so it’ll provide some of the most data and 2) sha in general is one of the most accepted hashing algorithms, so I went with that.

What Was My Result

Surprising! There seems to be a pattern of what characters show up most in hashes for malware.

What!

Yep, it appears that if you see around 3% more f’s and 1% more 7’s and 5’s in your sha512 hash, then you might have some malware.

That Can’t be Right!

Hard to believe that is what it seems like ‘f, 7, and 5’ show up more and ’e and 6’ show up 1% less in malware.

Ok, So How Was it Done

Where are the Samples

Same as my string analysis, to perform my hash analysis, I pulled down around 500 samples of malware from theZoo and dasMalwerk. For samples of benign software I grabbed all of /bin on Fedora and 200 libraries from C:/Windows directory.

How was it Analysed

I modified my program from doing string checks to perform the hash analysis. Now, instead of running strings on each of the files it performs a sha512 hash. I then averaged the number of each character seen for each file. This means I counted the number of ‘1’s seen for all malicious file hashes, then dividing by the total number of files.

This was done for all characters for each malicious and benign binaries. After that I subtracted the benign averages from the malicious averages and divided by the original value.

Why?

So a difference of 1 - 2% is not that much, but 3% seems more significant. This shouldn’t happen, all characters should show up about evenly. This can probably be accounted for with just the samples that I had chosen. Choose a different set of 1000 binaries and the results could be different.

If this wasn’t bad malware analysis, I wouldn’t stop here. I’d download some more samples and continue … but well it is bad, so make a signature in your IDS for sha512, and if you see more f’s in the hash, then you might have some malware.