Code deobfuscation and pattern recognition are as much an art as a science. In the past, we’ve talked about automating many aspects of proactive detection, such as through delta analysis, scripts, or crawling the web for exploits. These are exceedingly useful, and take me back to my the days of hard drive (HDD) forensics and analysis, which was much more manual than automated. Back then, suspicious artifact discovery and identification was also more manual. Some artifacts came from the efforts of first line net defenders, others came from incident responders, while others still were those obtained through HDD forensics of compromised systems. Regardless of the source, artifacts of suspicion were uncovered. Some were readable, while others were encoded, obfuscated or encrypted.
In general, during active investigations involving post-mortem (or dead box) forensics, it’s not always immediately known what produced the artifact in question, nor what algorithm or encoding scheme produced the artifact. Despite the deadline or urgency of the mission, obfuscated artifacts often stalled investigations or were completely cast aside because nothing immediately appeared malicious.
Current day hunters undoubtedly need to deobfuscate or decode a particular artifact – quite possibly without the luxury of having the artifact’s parent binary. But how does one go about figuring out that ‘A’ decodes to ‘B’ by basically eyeballing an artifact? The key is pattern recognition. Pattern recognition is especially important when the original parent malware is not available. I’ll walk through some key elements and tips I’ve developed to expedite the manual pattern recognition and deobfuscation process, and conclude with a brief overview of some of the automated pattern recognition techniques, which are enhanced when combined with expertise gained through the manual deep dives into the artifacts.
VISUAL PATTERN RECOGNITION
Let’s warm up with a few quick examples where the obfuscation technique should be readily apparent. Since encrypted data is visually indistinguishable from random data, I’ll be focusing on non-encrypted data, starting with how it pertains to Windows portable executable (PE) files. As a refresher, the diagram below is a useful reference for locating the various headers within a PD file. As we proceed, keep in mind there are a lot of null bytes in a PE file, specifically in between the MZ header and DOS stub.
Image source: http://marcoramilli.blogspot.com/2010/12/windows-pe-header.html
For this reason, XOR keys are relatively easy to spot at times as any key byte XOR’d with a null byte (0x00) will produce the same key byte (e.g 0x85 XOR’d with 0x00 will produce 0x85).
The four diagrams below reflect four different hex editor views of the same PE file, with each of these keys easily seen at offset 0x20
The Original
1-byte XOR key applied
2-byte XOR key applied
4-byte XOR key applied
Not everything in need of deobfuscation will be a PE file though. Take a look at the following ‘.dat’ file for example. Does anything jump out?
If the first thought was that ‘.dat’ was encoded with the two-byte XOR key [0x88, 0x85], that would be a good thought. Let’s pursue that and see what we get (using the first 16 bytes).
0000000: 8885 8885 b7ad abac ada8 abb4 a8ab ae9b ..............
XOR with [0x88, 0x85] becomes
0000000: 0000 0000 3f28 2329 252d 2331 202e 261e ..............
Nothing apparent comes to light with this line of thought. So let’s refer back to the previous figure and see where the 2-byte pattern (0x88 0x85) occurs. We see it at offsets 0x00, 0x02, 0x81, 0x83, 0xb6, 0xc2, 0xc4, 0xc6 and 0xe3. One question at this point is, “what 2-byte pattern can exist once, twice, thrice, etc?” Having encountered my fair share of keylogged data files, my thoughts turn toward the 2-byte pattern of the ‘carriage return / newline’ or [0x0d, 0x0a]. Could this be what [0x88, 0x85] decodes to? In order to pursue this line of thought, simply XOR 0x88 with 0x0d (resulting in 0x85), and XOR 0x85 with 0x0a (resulting in 0x8f). This gives us our potential XOR key. If correct, the 2-byte XOR key is [0x85, 0x8f]. Let’s check it out.
0000000: 8885 8885 b7ad abac ada8 abb4 a8ab ae9b ..............
XOR with [0x85, 0x8f] becomes
0000000: 0d0a 0d0a 3222 2e23 2827 2e3b 2d24 2b14 ....2".#('.;-$+
Something still isn’t correct. There is some ASCII readable data, but nothing useful. A rolling XOR key or byte for byte substitution scheme doesn’t seem to be the ticket, but I still like the [0x0d, 0x0a] theory. What else can get [0x88, 0x85] to our ‘carriage return / newline’ or [0x0d, 0x0a]? Some typical encoding schemes include XOR, rotating left or right (ROT/ROL), addition/subtraction, so let’s explore if it could be a different math function. Subtracting 0x0d from 0x88 equals 0x7b, but will taking 0x7b away from our next byte of 0x85 equal 0x0a? You bet it does. I think we’re onto something now. Turns out this particular data block was encoded by subtracting 0x7b from every byte as shown below.
0000000: 8885 8885 b7ad abac ada8 abb4 a8ab ae9b ..............
subtract [0x7b] from every byte becomes
0000000: 0d0a 0d0a 3c32 3031 322d 3039 2d30 3320 ....<2012-09-03
Here’s the decoded ‘.dat’ segment. As we look at the ASCII representation, we can see the output includes the date and name of the application currently being displayed or used by the user. This suggests the work of a keylogger.
The following is my python script used to do the math.
import sys
f = open(‘.dat’, 'rb')
contents = f.read()
f.close()
f = open(‘.dat_decoded', 'wb')
for x in contents:
f.write(chr(((ord(x)+256) - 0x7b) % 256))
DOUBLE OBFUSCATION
At times a defender/hunter may obtain a suspicious artifact that contains an embedded (and obfuscated) executable binary. There are various tools available to extract binaries from files such as DOCs, XLSs, PDFs, but I’ve come across some artifacts where a double layer of obfuscation was used. This forced me to deobfuscate it manually through "eyeballing" and deductive reasoning. Let’s look at this excerpt from a such a malicious file.
At first glance, a possible XOR key of 0x87 jumps out as we see it repeated many times over between offset 0x8630 and 0x8680. In an ideal world, we would see an MZ header with a DOS stub staring at us after we XOR decode the segment with 0x87. That’s not what we get though, is it?
It’s important to recognize the visual ASCII pattern of an MZ header and DOS stub. With this visual in mind, look over the ACSII display above, which includes all of the bytes that have been XOR’d with 0x87. Do your eyes fixate on offset 0x8637 (first three bytes: 0xD4 0xA5 0x09)? Even though this segment is not flush against the left margin of the ASCII view, the visual pattern of a windows PE file is still present. Unfortunately, something is still amiss. There doesn’t appear to be an MZ header anywhere in sight, nor does there appear to be a DOS stub. Or is there? Let’s carve out this segment into its own file and inspect it a little further. Below is the beginning portion of the carved out file, which is still obfuscated.
Look at the first two bytes. Do they look familiar? They are in fact nibble swapped. A nibble is half a byte so with the byte 0xD4, ‘D’ is one nibble, the other nibble is ‘4’. Swapping those nibbles result in 0x4D. Continuing with the next byte 0xA5, a nibble swap results in 0x5A, and 0x09 becomes 0x90, etc. Let’s now nibble swap each byte of our sample. What we’re left with is the immediately recognizable MZ header / DOS stub below.
ANOTHER DOUBLE OBFUSCATION EXAMPLE (because they’re fun)
This next technique involves a double layer of obfuscation, but with a few more interesting twists. The segment below consists of 272 bytes extracted from a nefarious looking file. As you can plainly see, the embedded binary begins at offset 0x6A16 with byte 0x3b, right?
Well, okay, maybe there isn’t anything ‘plain to see’ about this. In fact, it probably just looks like a bunch of gibberish, but let’s look for a pattern anyway. Do you see the repetitiveness of [0x61, 0x83, 0xa5, 0xc7]? This appears to be a good candidate for a 4-byte XOR key so let’s use that and see what happens.
It seems to have that ‘ASCII visual’ I look for - the general appearance or shape of an MZ header and DOS stub, but not necessarily the content. There isn’t much to go on though so let’s review it byte by byte keeping in mind that the MZ header is ‘MZ’ and a common DOS stub is ‘This program cannot be run in dos mode.’ Notice the bytes at offsets 0x6a16, 0x6a18 and 0x6a1a. They are 0x5a, 0x4d, and 0x90. This should definitely jump out as the first three bytes of the MZ header, but they’re out of order. Now look at the ASCII text beginning at offset 0x6a63, you can see all the letters of the DOS stub, but definitely out of sequence.
To put this ‘humpty dumpty’ back together again, a "circular shift" loop of sorts is necessary. Begin with the offset 0x6a16 (we’ll call it byte(0)), pop offset 0x02, and shift it left two bytes (or insert it at byte(0)). This shifts all other bytes to the right by one. Next, pop the byte at offset 0x04 and shift it left two bytes, and then pop the byte at offset 0x06 and shift it left 2 bytes. Reiterate this loop until the executable is completely assembled. For visualization purposes, here is the play-by-play depiction of the first eight loop iterations. Now that we have the complete binary, RE efforts can begin.
MANUAL vs AUTOMATION
Of course, I’d be remiss if I didn’t mention that manual pattern recognition isn’t the end all or the only method one should try. A wide array of deobfuscation tools are out there and should be used whenever possible. Many of these strive to deobfuscate XOR, ROL, ROT and/or SHIFT keys, while others (such as FLOSS) attempt to uncover obfuscated strings. ‘XORtool’ is another good one to try. Some tools, however, specialize in single-byte XOR keys such as ‘ex_pe_XOR’, ‘iheartXOR’ and ‘XORBruteForcer’, while others like ‘XORSearch’ goes beyond a single byte XOR key. By default, XORSearch tries all XOR keys (0x00 to 0xFF), ROL keys (0x01 to 0x07), ROT keys (0x01 to 0x19) and SHIFT keys (0x01 to 0x07) when searching, but it also has a switch to search for a 4-byte (or 32-bit) key. This works quite well, but it’s not effective for longer key lengths. To illustrate this, calc.exe was XOR encoded with a 4-byte key, then with a 5-byte key as reflected in the following diagram.
The tool had no problem finding the 4-byte XOR key for the file ‘calc.exe_XOR.bin’, but it couldn’t find the 5-byte key used for ‘calc.exe_XOR.bin2’. Each result is reflected below.
As shown above, the 32-bit XOR key 16E97642 (little endian) was found. This correctly matches the visible XOR key (0x4276E916) in ‘calc.exe_XOR.bin’ at offset 0x20. Studying the byte combinations of calc.exe_xor.bin2 however, the 5-byte key (0x4276E9162B) can be first be seen at offset 0x19, repeating in its entirety several times.
There are other tools that try to ascertain longer keys, some through frequency analysis, but each tool is different so the results often vary. For this reason it’s advisable to experiment with a wide array of deobfuscation tools to get familiar with their nuances before getting hit with an operational time crunch. Also, keep in mind that many tools are designed for obfuscated PE files, meaning they’re looking for known PE file components and may not be effective for other types of non-PE obfuscated files. Knowing the ‘target audience’ of the tool is important.
CONCLUSION
Pattern recognition analytics can be extremely beneficial, but must be viewed as one part of the larger hunter’s or analyst’s arsenal. Pattern recognition is not for the inexperienced, as strong scripting chops are necessary to completely deobfuscate artifacts once the encoding scheme or pattern has been recognized. And remember, if confronted with an obscure one-off hunter-attained artifact (meaning all the moving pieces aren’t available to reverse engineer), and the tool of choice doesn’t provide the necessary output, a keen eye may just come in to save the day.