Search This Blog

Thursday, October 9, 2014

API hooking without DLL injection

Lately I have been working on some project to do API hooking with minimal disturbance of the target programs.

I've tried a bunch of things but the reality is that there is no two way around it, you have to inject a DLL and you have to either modify the import table or do preamble patching.

While both methods do work, it is trivial to detect IAT patching and preamble hooking can be defeated with hook hopping or just by checking if a DLL does or does not belong to your process; of course you can attempt to hide your DLL by removing the entries in the 3 linked lists in the PEB (4 if you include the Hash list) but your binary will still be visible via a QueryVirtualMemory() call. This is quite complicated work just to hide yourself. Not to mention that you still have to hook some DLLmain

Also, there are issues with injecting any sort of foreign module into an application. One is that you need to use your own memory manager other you may mess up the heap of the target. Some malware will unload the main program, resize the memory where it was loaded and reload it, unpacked and this will cause the program to crash if your DLL has already done some allocation which has been placed right after the loaded module (memory can't be resized to anything bigger than it was and the call will return NULL. The unpacker will not check for the the NULL pointer and crash).

There was some research done by some people at Purdue regarding instrumenting and debugging via Hardware Virtualization (ttps:// but  I had something a little different in mind.

What about the use of little known MSR?

I was made aware of a feature that exists on both AMD and Intel which could potentially solve my problem. On those CPU, there is a specific set of MSRs that allow to do some profiling. One MSR that sounded very interesting was the DebugCTL MSR. There are another that allow you to get a list of the last 16 branches that occurred on a CPU and there are a few more that give you information about the last branch record (giving you source and destination of the last branch).

The basic idea is to take advantage of the Debug Control MSR.
What it does is that once you've set up the proper flags (bit 0 and 1):

mov ecx, 0x1d9 ; DebugCTL_MSRrdmsr
or eax, 0x3
xor eax, eax
rdmsr          ; Just to make sure it worked.
Now if you run this code on VMWare, the last rdmsr with return 0 in EAX because that MSR is ignored. In order to get this to work, you would have to use some KVM based VM or possibly XEN but I haven't had the opportunity to try with XEN.

Now that the MSR is set up, for each thread that has the Trap Flag (TF = EFlags bit 9) a INT 1 will be generated for every branching instruction such as CALL, JMP, JZ, RET, SYSCALL, Etc...

I wrote two pieces of code to handle this.

Handling the DebugCTL_MSR inside the guest (or physical machine)

The first thing I did was to write a INT 1 handler. It's quite small (and incomplete) and goes something like this:

pushad         ; Save all registers
push fs
push ds
push es
mov bx, 30h
mov fs, bx     ; Set FS to 30h since we're in the kernel
mov bx, 23h
mov es, bx
mov ds, bx     ; ES and DS are now set the 23h

mov ecx, 1d9h  ; Branch to address MSR
push eax

mov ecx, 1dch  ; Branch from address

mov bl, byte ptr [eax] ; Retrieve the op-code at EAX
pop eax        ; restore EAX with Branch to address

cmp bl, 0e8h
je JmpOrCall
cmp bl, 0e9h
je JmpOrCall
cmp bl, 0xFF
je JmpOrCall
jmp short NoRange


push eax
call HandleSingleStep


mov ecx, 1d9h
xor edx, edx
mov eax, 3
wrmsr          ; Reset the MSR value 

Quite rudimentary but it works.

There is a small optimization here where it checks the op-code of the instruction where the branch was made. It checks that only a JMP, CALL or FAR CALL has been made. We don't really care about the conditional jumps or anything like this.

The driver that contains the INT 1 handler also does a few more things. It still injects a DLL (because that was the fastest thing for me to do) and retrieves the addresses of a few hooks. Then, when Kernel32.DLL gets loaded, it retrieves the addresses of the target API (for instance: CreateFileA).

The HandleSingleStep(DWORD callee) will then resolve the address in the sense that it will check if the branch to address matches any of the APIs that we want to hook.  Once a match is found, all we need to do is to get the matching hook's address and change the return address of the INT 1.

Above is the flow of operations (or was since the picture is gone AWOL...).

The reason for me to still inject a DLL was to demonstrate that I could print some log in the hook and then resume the operations. Of course, the beginning of hookedCreateFileA() requires resetting the TF so that we can call the original CreateFileA() and then set TF again. Otherwise, we encounter the risk of getting into some deadlock.

Everything should be done inside the kernel driver. Since the INT 1 runs in the context of the application, we can access memory rather easily and do all sorts of things from there.
There is no need to have an extra DLL which in fact defeats the purpose of the solution.

The biggest problem is performances. If you think about it, every time that a branch is made, an INT 1 is getting triggered. Which means context switch a-gogo and time wasting left and right.

Also, and equally important is the fact that this only work on Windows XP. The reason for it is that NtCreateThreadEx() isn't called by NtCreateProcess(), instead Windows makes use of some private function (PsInsertThread). Since it's private, it would be harder to hook.

We need to hook the thread creation function to force TF to be set to 1. The easier is to change the PCONTEXT at thread creation time. I tried with using some Kernel APC triggered when the thread was created but unfortunately, the KTRAP_FRAME structure in EPROCESS is not yet valid. It isn't available during Thread Resume operation either...

The other issue that we face is with Windows x64. Patch guard will prevent us from hooking INT 1 and NtCreateThreadEx() unless we get rid of it, which may or may not be an option.

How about doing this on the host (for a VM)

I did that too.
It works well and is much easier to implement than inside the guest. Basically, we can setup TF when CR3 is written into (VMExit) and intercept the INT 1 when it pops.

Performance here is a much bigger issue because for one, every time the INT 1 is intercepted, we have a VMExit which is very costly. Unless we can do the filtering there (ignoring conditional jumps and whatnot) this causes a lot of unnecessary noise.

Another big problem is that when the program we monitor makes a SYSENTER/SYSCALL, TF is reset and it needs to be set again at SYSEXIT. If we do a VMExit per SYSEXIT, the performances go down drastically (VMExit requires a lot of CPU cycles and then the user mode app still needs to be scheduled to run).

What then?

I still believe that the use for this can be very good but would have to be done in a sporadic manner. For instance, do regular hooking unless a program is twitchy enough that we ought to use something less invasive. 

No comments:

Post a Comment