Search This Blog

Wednesday, November 23, 2016

Using APCs to inject your DLL, reloaded

In a previous post Using APCs to inject you DLL I talked about injecting a DLL from the Windows Kernel.

Since the post was lacking some details, I decided to add a couple things and also talk about how to inject a 32bit process from a 64bit kernel.

When to inject our DLL

For a 64bit process, I usually do it right after ntdll.dll has been loaded. You can do this easily when you get a module load notification. You can get the notification by making a call to PsSetLoadImageNotifyRoutine().

The reason to wait for ntdll.dll is that once it is loaded, we can get the address of LdrLoadDll()

NTSYSAPI 
NTSTATUS
NTAPI

LdrLoadDll(
  IN PWCHAR               PathToFile OPTIONAL,
  IN ULONG                Flags OPTIONAL,
  IN PUNICODE_STRING      ModuleFileName,
  OUT PHANDLE             ModuleHandle );

When using LdrLoadDll() you should end up with code like this:

void NTAPI ApcLoadDLL(LPLDR_CONTEXT ctx, PVOID  SystemArgument1, 
                      VOID SystemArgument2) {
    UNREFERENCED_PARAMETER(SystemArgument1);
    UNREFERENCED_PARAMETER(SystemArgument2);
    HANDLE Module = NULL;

    ctx->LdrLoadDll(NULL, 0, &ctx->dllPath, &Module);
    return;
}

The context being defined as such:

typedef NTSTATUS(*LDR_LOAD_DLL_FN)(
    IN PWCHAR               PathToFile OPTIONAL,
    IN ULONG                Flags OPTIONAL,
    IN PUNICODE_STRING      ModuleFileName,
    OUT PHANDLE             ModuleHandle);

typedef struct ldrContext {
    PVOID ShellCode;
    UNICODE_STRING dllPath;
    HANDLE Process;
    LDR_LOAD_DLL_FN LdrLoadDll;
} LDR_CONTEXT, *LPLDR_CONTEXT;

The dllPath of the context is just a PUNICODE_STRING that contains the path of the DLL we want to inject.
That function will work as 64bit shellcode but for 32bit we'll need something different and basically 32bit assembly.

Injection of a 32bit DLL from a 64bit Kernel


The first thing that changes is when we inject the DLL. For 64bit we wait for ntdll.dll to be loaded (conveniently, it just happens to be the very first library loaded) but for 32bit APC we need to wait for a different library: wow64.dll

The reason for this is that to use an APC in a 32bit process, you can't just give the address of the routing that you want to execute. You need to give the address a specific API that is inside wow64.dll. Basically it's a thunking mechanism. 

The function that will do the work is: Wow64ApcRoutine

That function will be your APC routine, which in turn will call your actual shellcode.
The Wow64ApcRoutine routine is an APC normal routine. 

The parameters given to it are very specific though. When you create your APC you should have something like this:

LPLDR_CONTEXT32 context = (LPLDR_CONTEXT32)ctx;
PVOID ApcContext = (PVOID)(((ULONG_PTR)Apc32BitRoutine  << 32) + (ULONG_PTR) Apc32BitContext);
KeInitializeApc(apc, tThread,
OriginalApcEnvironment,
(PKKERNEL_ROUTINE)&KernelApcRoutine,
NULL,
context->Wow64ApcRoutine,
UserMode,
ApcContext);

KeInsertQueueApc(apc, 0, NULL, 0);

The context and other structures being defined as such:

typedef union
{
    struct
    {
ULONG Apc32BitContext;
ULONG Apc32BitRoutine;
    };
    PVOID Apc64BitContext;
} wow64ApcContext;

typedef wow64ApcContext  WOW64_CONTEXT;
typedef wow64ApcContext* LPWOW64_CONTEXT;

typedef struct ldrContext32 {
    ULONG ShellCode;
    UNICODE_STRING32 dllPath;
    DWORD Process;
    DWORD LdrLoadDll;
    PKNORMAL_ROUTINE Wow64ApcRoutine;
    WOW64_CONTEXT wow64Context;
} LDR_CONTEXT32, *LPLDR_CONTEXT32;

The Apc32BitRoutine is the address of your shellcode (you will have used ZwAllocateMemory() for the target process earlier).
The Apc32BitContext is your structure (basically just need the PUNICODE_STRING that specifies the path of your DLL)

The APC will therefore call Wow64ApcRoutine (you got the address of that after wow64.dll got loaded), and in turn, it will call your shellcode with the given context as a parameter.

32Bit shell code


In an earlier iteration, I used the following code:

UCHAR x86shellCode[] = {
//"\xcc" // Break Point
"\x55" // push ebp
"\x8b\xec" // mov ebp, esp
"\x8b\x45\x08" // mov eax, dword ptr [ebp+8]
"\x83\xc0\x0c" // add eax,0Ch
"\x8b\xf4" // mov esi,esp
"\x50" // push eax
"\x8b\x4d\x08" // mov ecx, dword ptr[ebp+08]
"\x83\xc1\x04" // add ecx, 4
"\x51" // push ecx
"\x6a\x00" // push 0
"\x6a\x00" // push 0
"\x8b\x55\x08" // mov edx, dword [ebp+8]
"\x8b\x42\x10" // mov eax, dword [edx+8]
"\xff\xd0" // call eax Note: No need to clean the stack after the call
"\x5d" // pop ebp
"\xc3" // ret
"\x90\x90\x90" // NOP
};

This will get the address of LdrLoadDll from the context as well as the UNICODE_STRING with the path. Essentially it does the exact same thing as the 64bit code mentioned at the beginning.


Friday, February 19, 2016

How to control your Android phone which has a broken screen

Who hasn't dropped their phone?

Well, alright, maybe some people never do but I'm not one of them.
I dropped mine a couple days ago and since then, the screen doesn't respond to touch.



I figured that that was it but while I'm waiting for my new phone to be delivered, I wanted to be able to still use the device.

Fortunately, since I am doing development on the phone, I have it set up for it. Only problem is that only my home desktop machine is authorized to access the phone via ADB. Bummer but it's still a good thing.

What do you need?


  • USB cable
  • Android SDK and more specifically adb
  • Python (to make things a little easier)

Connecting to the phone


I'm using Linux at home so the first thing is to do something like:

~ sudo ~/Android/Sdk/platform-tools/adb devices

This will start the adb daemon and you will be connected to the phone.
You can then use adb with your user account normally. We use sudo here to start the daemon and avoid authorization issues.

ADB provides a shell that you can use to do a lot of things:

~ adb shell
xx@mako> pull /sdata/DCIM/Camera 

Will copy the files inside /sdata/DCIM/Camera onto the hard disk of the computer (in the current directory).

~ adb shell input tap 100 100

This will instruct the phone to do a tap at the given coordinates.

Python to the rescue


I wrote a small python script to do a bunch of actions because typing adb shell input tap xx xx becomes rather annoying after a while.

It goes like this:

#!/usr/bin/python

import subprocess

alive = True

while (alive):
    isCommand = False

    text = raw_input(">")
    if text == "send":
        isCommand = True
        subprocess.check_call("adb shell input tap 700 1150", shell=True)
        print "sent"

    if text == "sup":
        isCommand = True
        subprocess.check_call("adb shell input swipe 300 900 300 100", shell=True)

    if text.startswith("tap"):
        isCommand = True
        coords = text.split()
        subprocess.check_call("adb shell input tap " + coords[1] + " " + coords[2], shell=True)

    if text == "shownotif":
        isCommand = True
        subprocess.check_call("adb shell input swipe 400 10 400 1000", shell=True)

    if text == "unlock":
        isCommand = True
        subprocess.check_call("adb shell input swipe 400 1150 400 200", shell=True)
        subprocess.check_call("adb shell input text 0000", shell=True)
        subprocess.check_call("adb shell input keyevent 66", shell=True)

    if text == "home":
        isCommand = True
        subprocess.check_call("adb shell input keyevent 3", shell=True)

    if text == "end":
        isCommand = True
        alive = False

    if text == "enter":
        isCommand = True
        subprocess.check_call("adb shell input keyevent 66", shell=True)

    if text == "back":
        isCommand = True
        subprocess.check_call("adb shell input keyevent 4", shell=True)

    if isCommand == False:
        text = "\"" + text.replace(" ", "%s") + "\""
        subprocess.check_call(["adb", "shell", "input", "text", text])

Things to change


The script works well on my phone, which is a Nexus 4.

First thing to modify is the unlocking code: subprocess.check_call("adb shell input text 0000", shell=True)
For this line, replace the 0000 with your personal PIN

The 'send' command will also need to be changed to use the proper coordinates.
On my screen the WhatApp or Viber send buttons are bottom right, which is around 700,1150.

Your phone will most likely be different depending on the screen size and DPI.

Commands:


send: Send the message
back: Back button
home: Goes to the home screen
enter: Enter button (for example after entering the PIN, [enter] will unlock the phone)
sup: swipe up (from bottom to top, here again you may need to change the numbers)
tap: taps at the given coordinates (example: tap 400 400)
unlock: Unlocks the phone with the hard coded PIN
shownotif: will show the notification list (swipe from the top to the bottom of the screen)

What's next?


You can also run apps with adb: 

adb shell am start com.whatsapp/.Main

This will launch the WhatsApp app.

After that, by using swipe, tap and whatever else, you can still use your phone and download files from it while you wait for your new device to appear at the front door.
Basically just experiment (I archived my notes from Google Keep using adb) and have fun.

Thursday, February 18, 2016

Running Opensuse (Tumbleweed) on a Dell XPS 13

I don't have a XPS 13 but I do have a XPS 15.

Installing Opensuse on it was not too hard really but my friend does have a 13" and the biggest problem is that the Wireless card won't work.

The problem is that the drivers included on Opensuse don't work with the Broadcomm 4352 that's on the machine.

The are some pre-requisites to make this work:
  • You need another machine with Opensuse and the same kernel (or at least, close enough)
  • You need to get the source RPM for the broadcom 43xx cards
  • You need the kernel source and what not (gcc, kernel-devel, etc)

Getting the driver source

You can get that RPM Here: Packman mirror

The file should be around 2.9MB

Preparing Linux for compiling the source

~ sudo zypper install kernel-devel
~ sudo zypper install kernel-headers
~ sudo zypper install gcc

Make sure that the kernel source are for the kernel you have on the machine.

~ uname -a
Linux linux-c0wc 4.4.0-3-default #1 SMP PREEMPT Thu Jan 28 08:15:06 UTC 2016 (9f68b90) x86_64 x86_64 x86_64 GNU/Linux

~ ll /usr/src

You should have a directory here named: linux-4.4.0-3

Extract the source RPM:

~ rpm -ivh broadcom-wl-6.30.223.248-6.31.src.rpm

Apply the patches

~ cd ~/rpmbuild/SOURCES/
~ mkdir hybrid_wl
~ cd hybrid_wl
~ tar -xzf ../hybrid-v35_64-nodebug-pcoem-6_30_223_248.tar.gz
~ patch -p1 < ../broadcom-wl-4.2.patch
~ patch -p1 < ../broadcom-wl-6_30_223_248-disable-timestamps.patch
~ patch -p1 < ../broadcom-wl-6_30_223_248-linux-4.x.patch

Build it

~ make

 You should end up with a result like this:



Copy the wl.ko onto your trusty USB stick and mount it onto the XPS 13.

Then, install the driver according to the broadcomm document (here)

I'm copying parts of it here to make it a little simpler.

~ sudo su

# lsmod | grep "brcmsmac\|b43\|ssb\|bcma\|wl"
If any of these are installed, remove them:
# rmmod b43
# rmmod brcmsmac
# rmmod ssb
# rmmod bcma
# rmmod wl
To blacklist these drivers and prevent them from loading in the future: (this step is important since after reboot, if you haven't done this, the original modules will be loaded and your WiFi still won't work)
# echo "blacklist ssb" >> /etc/modprobe.d/blacklist.conf
# echo "blacklist bcma" >> /etc/modprobe.d/blacklist.conf
# echo "blacklist b43" >> /etc/modprobe.d/blacklist.conf
# echo "blacklist brcmsmac" >> /etc/modprobe.d/blacklist.conf
Insmod the driver.

# insmod wl.ko

wl.ko is now operational. It may take several seconds for the Network
Manager to notice a new network driver has been installed and show the
surrounding wireless networks.

You can also reboot and once you launch the Network Manager, you should see the different networks.

Friday, August 7, 2015

MongoDB and AWS Lambda

I haven't posted a blog in quite a while since I've been busy at work (new job) and outside of work.
I've been working on an app with my buddy and we decided to use PhoneGap to write it (well, I'm writing the backend and the JS scripts while he takes care of the GUI).

Anyway, I've also decided to use AWS to do the whole backend. I still have to have a simple EC2 instance to run some job on regular basis but otherwise, everything is handled on AWS.

I am taking advantage of the Lambda service because I can write code there instead of inside the app, which means that whenever something needs to be changed, the app doesn't need to be updated (or at least not as often).

Initially I was using Lambda to query my DynamoDB instance.
Since DynamoDB is quite expensive, in case it gets too pricey, I will switch to MongoDB (could have picked Cassandra but I like Mongo's name better...).

I wasn't sure as how to access Mongo from Lambda but I figured it out.

First, I created a Bitnami instance with MongoDB. That's pretty cool because it's available to AWS Free Tier and it's pre-installed so you don't have to do diddly squat or almost.

You need to configure Mongo so that it allows traffic other than 127.0.0.1.

Edit the config file:

sudo vi /opt/bitnami/mongodb/mongodb.conf

Comment out that line:

#bind_ip = 127.0.0.1

You also need to allow the port with the local firewall:

sudo ufw allow 27017

Restart Mongo:

sudo /opt/bitnami/ctlscript.sh restart mongodb

You're almost good to go. The last thing is to allow the port on the AWS console via security group.
Find out which security group is your EC2 instance using and add the 27017 port in the Inbound section.

Now, you can query Mongo from everywhere. Probably not the best security posture but eventually you can limit the access from a specific IP.

Before doing anything else, you need to install the Mongo driver via NodeJS onto your development machine:

npm install mongodb

You will have a new directory named "node_modules" should look like this (minus the index.js and .zip)


You can write your NodeJS script and create a zip file. The script has to be named index.js.
Here's a small sample:

var mongodb = require('mongodb');

console.log('Loading function');

exports.handler = function(event, context) {
    console.log("Connecting to Mongo");

    mongodb.MongoClient.connect('mongodb://user:password@ec2-blah.compute-X.amazonaws.com:27017/database', function(err, db) {
        console.log("Connected to mongo");
        var col = db.collection('Users');

        col.find({ userID: "415173559090" } ).toArray(function(err, docs) {
            if (err) throw err;
            docs.forEach(function(doc) {
                console.log(doc);
            });
            
            db.close();
            console.log("Done");
            context.done(null, "finished");
        });    
    });
};

Upload to Lambda, test, happiness!

Your Lambda function should have at least 512Mb of memory, and 1024 is better. The more memory, the faster it will connect to your mongo instance.

That little sample takes 1600ms with 1024Mb and 3800ms with 512Mb.

Thursday, October 9, 2014

API hooking without DLL injection

Lately I have been working on some project to do API hooking with minimal disturbance of the target programs.

I've tried a bunch of things but the reality is that there is no two way around it, you have to inject a DLL and you have to either modify the import table or do preamble patching.

While both methods do work, it is trivial to detect IAT patching and preamble hooking can be defeated with hook hopping or just by checking if a DLL does or does not belong to your process; of course you can attempt to hide your DLL by removing the entries in the 3 linked lists in the PEB (4 if you include the Hash list) but your binary will still be visible via a QueryVirtualMemory() call. This is quite complicated work just to hide yourself. Not to mention that you still have to hook some DLLmain

Also, there are issues with injecting any sort of foreign module into an application. One is that you need to use your own memory manager other you may mess up the heap of the target. Some malware will unload the main program, resize the memory where it was loaded and reload it, unpacked and this will cause the program to crash if your DLL has already done some allocation which has been placed right after the loaded module (memory can't be resized to anything bigger than it was and the call will return NULL. The unpacker will not check for the the NULL pointer and crash).

There was some research done by some people at Purdue regarding instrumenting and debugging via Hardware Virtualization (ttps://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2013-5.pdf) but  I had something a little different in mind.

What about the use of little known MSR?

I was made aware of a feature that exists on both AMD and Intel which could potentially solve my problem. On those CPU, there is a specific set of MSRs that allow to do some profiling. One MSR that sounded very interesting was the DebugCTL MSR. There are another that allow you to get a list of the last 16 branches that occurred on a CPU and there are a few more that give you information about the last branch record (giving you source and destination of the last branch).

The basic idea is to take advantage of the Debug Control MSR.
What it does is that once you've set up the proper flags (bit 0 and 1):


mov ecx, 0x1d9 ; DebugCTL_MSRrdmsr
or eax, 0x3
wrmsr
xor eax, eax
rdmsr          ; Just to make sure it worked.
Now if you run this code on VMWare, the last rdmsr with return 0 in EAX because that MSR is ignored. In order to get this to work, you would have to use some KVM based VM or possibly XEN but I haven't had the opportunity to try with XEN.

Now that the MSR is set up, for each thread that has the Trap Flag (TF = EFlags bit 9) a INT 1 will be generated for every branching instruction such as CALL, JMP, JZ, RET, SYSCALL, Etc...

I wrote two pieces of code to handle this.

Handling the DebugCTL_MSR inside the guest (or physical machine)

The first thing I did was to write a INT 1 handler. It's quite small (and incomplete) and goes something like this:


pushad         ; Save all registers
push fs
push ds
push es
mov bx, 30h
mov fs, bx     ; Set FS to 30h since we're in the kernel
mov bx, 23h
mov es, bx
mov ds, bx     ; ES and DS are now set the 23h

mov ecx, 1d9h  ; Branch to address MSR
rdmsr
push eax

mov ecx, 1dch  ; Branch from address
rdmsr 

mov bl, byte ptr [eax] ; Retrieve the op-code at EAX
pop eax        ; restore EAX with Branch to address

cmp bl, 0e8h
je JmpOrCall
cmp bl, 0e9h
je JmpOrCall
cmp bl, 0xFF
je JmpOrCall
jmp short NoRange

JmpOrCall: 

push eax
call HandleSingleStep

NoRange:

mov ecx, 1d9h
xor edx, edx
mov eax, 3
wrmsr          ; Reset the MSR value 

...
Quite rudimentary but it works.

There is a small optimization here where it checks the op-code of the instruction where the branch was made. It checks that only a JMP, CALL or FAR CALL has been made. We don't really care about the conditional jumps or anything like this.

The driver that contains the INT 1 handler also does a few more things. It still injects a DLL (because that was the fastest thing for me to do) and retrieves the addresses of a few hooks. Then, when Kernel32.DLL gets loaded, it retrieves the addresses of the target API (for instance: CreateFileA).

The HandleSingleStep(DWORD callee) will then resolve the address in the sense that it will check if the branch to address matches any of the APIs that we want to hook.  Once a match is found, all we need to do is to get the matching hook's address and change the return address of the INT 1.


Above is the flow of operations (or was since the picture is gone AWOL...).

The reason for me to still inject a DLL was to demonstrate that I could print some log in the hook and then resume the operations. Of course, the beginning of hookedCreateFileA() requires resetting the TF so that we can call the original CreateFileA() and then set TF again. Otherwise, we encounter the risk of getting into some deadlock.

Everything should be done inside the kernel driver. Since the INT 1 runs in the context of the application, we can access memory rather easily and do all sorts of things from there.
There is no need to have an extra DLL which in fact defeats the purpose of the solution.

The biggest problem is performances. If you think about it, every time that a branch is made, an INT 1 is getting triggered. Which means context switch a-gogo and time wasting left and right.

Also, and equally important is the fact that this only work on Windows XP. The reason for it is that NtCreateThreadEx() isn't called by NtCreateProcess(), instead Windows makes use of some private function (PsInsertThread). Since it's private, it would be harder to hook.

We need to hook the thread creation function to force TF to be set to 1. The easier is to change the PCONTEXT at thread creation time. I tried with using some Kernel APC triggered when the thread was created but unfortunately, the KTRAP_FRAME structure in EPROCESS is not yet valid. It isn't available during Thread Resume operation either...

The other issue that we face is with Windows x64. Patch guard will prevent us from hooking INT 1 and NtCreateThreadEx() unless we get rid of it, which may or may not be an option.

How about doing this on the host (for a VM)

I did that too.
It works well and is much easier to implement than inside the guest. Basically, we can setup TF when CR3 is written into (VMExit) and intercept the INT 1 when it pops.

Performance here is a much bigger issue because for one, every time the INT 1 is intercepted, we have a VMExit which is very costly. Unless we can do the filtering there (ignoring conditional jumps and whatnot) this causes a lot of unnecessary noise.

Another big problem is that when the program we monitor makes a SYSENTER/SYSCALL, TF is reset and it needs to be set again at SYSEXIT. If we do a VMExit per SYSEXIT, the performances go down drastically (VMExit requires a lot of CPU cycles and then the user mode app still needs to be scheduled to run).

What then?

I still believe that the use for this can be very good but would have to be done in a sporadic manner. For instance, do regular hooking unless a program is twitchy enough that we ought to use something less invasive. 

Friday, September 16, 2011

Using APCs to inject your DLL

A few blogs ago I discussed remote threads on Windows 7. This topic was targeted towards the goal of injecting one's DLL into processes.

There are several ways of injecting your code inside any other processes:

  • Via SetWindowsHookEx
  • By using the APP_Init key in the registry
  • With CreateRemoteThread & NtCreateThreadEx
  • You could also do it from a driver by replacing the main thread's entry point with your shell code.
Those techniques are increasingly difficult to put in practice. They do work but sometimes, it's not enough or it may not work on some OS. For instance CreateRemoteThread only works half of the time on Windows 7 because of the different sessions used by applications and services.

Hooking NtCreateThread in a driver and inject your DLL that way won't work on Windows 7 either since NtCreateThread is no longer used by NtCreateProcess. Not to mention that some of that stuff won't work on Windows 64. For instance, NtCreateThreadEx() doesn't use the same structure in 32bit and 64bit. It took me half a day to figure out the proper way of injecting a DLL on Win64.

That time could have been saved if I had used a proper/cleaner way of injecting my DLL.

Enters APC.

In this article you can see how APCs are used in Windows (2000, XP, 7) so I won't go over it again.

The basic idea is that in order to inject our DLL, we will use an APC and queue it for the process. Quite obviously, this has to be done in a driver.

When the target program starts, our driver can be notified via a callback (See PsSetCreateProcessNotifyRoutine) and it can also be notified whenever a module loads (see PsSetLoadImageNotifyRoutine ).

As the module loading callback is called, we can wait for NTDLL.DLL to be loaded since it is the first DLL that will be automatically loaded for every process on the system.

Another reason to wait for NTDLL to be loaded is because we can parse the PE headers and find out the user mode address for LdrLoadDLL. You could do a GetProcAddress(NULL, "LoadLibraryA") and pass it down to your driver but with ASLR this could potentially cause problems.

So, in the callback, we wait for the NTDLL to load and then we obtain the address of LdrLoadDLL.

Here's the code to find out the address of a function in a given DLL.

/**
  * This function is like GetProcAddress()
 * ImageBase is the address of the mapped DLL
  * ImageSize is the size of the DLL
  * FunctionName is the API we are looking for
 *
  * Before looking through the PE headers, we need to map the DLL in memory because
  * it may not be fully mapped. By creating a MDL, we can take care of this.
  */

PVOID GetProcAddress(PVOID ImageBase, DWORD ImageSize, const char* FunctionName)
{    PVOID pFunc = NULL;
    PIMAGE_DOS_HEADER DosHeader = NULL;
    PIMAGE_NT_HEADERS NtHeader = NULL;
    PIMAGE_EXPORT_DIRECTORY pIed = NULL;
    PIMAGE_DATA_DIRECTORY ExportDataDir;
    PIMAGE_EXPORT_DIRECTORY ExportDirectory;
    PVOID LoadAddress = NULL;
    PULONG FunctionRvaArray;
    PUSHORT OrdinalsArray;
    PULONG NamesArray;
    ULONG Index;
    PMDL vMem = NULL;

    __try {
        vMem = IoAllocateMdl(ImageBase, ImageSize, FALSE, FALSE, NULL);
        if (vMem != NULL)
        {
            ULONG ByteCount = 0;
            LoadAddress =  MmGetMdlVirtualAddress(vMem);
            DosHeader = (PIMAGE_DOS_HEADER) LoadAddress;
            ByteCount = MmGetMdlByteCount(vMem);

            MmProbeAndLockPages(vMem, UserMode,  IoReadAccess);
        }
    }
    __except(EXCEPTION_EXECUTE_HANDLER)
    {
        if (vMem != NULL)
            IoFreeMdl(vMem);

        DbgPrint("Unable to read memory");
        return NULL;
    }

    //
    // Peek into PE image to obtain exports.
    //
    NtHeader = ( PIMAGE_NT_HEADERS ) PtrFromRva( DosHeader, DosHeader->e_lfanew );
    if( IMAGE_NT_SIGNATURE != NtHeader->Signature )
    {
        //
        // Unrecognized image format.
        //
        return NULL;
    }

    ExportDataDir = &NtHeader->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
    ExportDirectory = ( PIMAGE_EXPORT_DIRECTORY ) PtrFromRva(LoadAddress, ExportDataDir->VirtualAddress);

    if ( ExportDirectory->AddressOfNames == 0 ||
         ExportDirectory->AddressOfFunctions == 0 ||
         ExportDirectory->AddressOfNameOrdinals == 0 )
    {
        //
        // This module does not have any exports.
        //
        return NULL;
    }

    FunctionRvaArray = ( PULONG ) PtrFromRva(LoadAddress, ExportDirectory->AddressOfFunctions);
    OrdinalsArray = ( PUSHORT ) PtrFromRva(LoadAddress, ExportDirectory->AddressOfNameOrdinals);
    NamesArray = ( PULONG) PtrFromRva(LoadAddress, ExportDirectory->AddressOfNames);

    for ( Index = 0; Index < ExportDirectory->NumberOfNames; Index++ )
    {
        //
        // Get corresponding export ordinal.
        //
        USHORT Ordinal = ( USHORT ) OrdinalsArray[ Index ] + ( USHORT ) ExportDirectory->Base;

        //
        // Get corresponding function RVA.
        //
        ULONG FuncRva = FunctionRvaArray[ Ordinal - ExportDirectory->Base ];

        if ( FuncRva >= ExportDataDir->VirtualAddress && 
             FuncRva < ExportDataDir->VirtualAddress + ExportDataDir->Size )
        {
            //
            // It is a forwarder.
            //
        }
        else
        {
            //
            // It is an export.
            //
            ULONG FunctionNamePointer = (ULONG) LoadAddress + NamesArray[Index];
            const char* pszName = (const char*) FunctionNamePointer;
            if (strcmp(pszName, FunctionName) == 0)
            {
                pFunc = (PVOID) ((ULONG) LoadAddress + FuncRva);
                break;
            }
        }
    }

    __try {
        if (vMem != NULL)
        {
            MmUnlockPages(vMem);                            
            IoFreeMdl(vMem);
        }
    }
    __except(EXCEPTION_EXECUTE_HANDLER)
    {
        DbgPrint("Unable to read memory");
    }

    return pFunc;
}

By now, your callback knows that NTDLL has been loaded and with the help of the code above, we obtained the address for LdrLoadDLL.

Notice in the function above that the begining consists of mapping the DLL into memory. The DLL is loaded but may or may not be fully mapped in memory. Failure to do it will result into not finding the API we want since the PE headers are not in memory yet.

Now, we need to prepare for the APC.
First, we need a dummy Kernel APC routine since it is required by the queuing call.

/**
  * This is dummy Kernel APC Routine
  *
  */
VOID KernelApcRoutine (
    IN PKAPC Apc,
    IN PKNORMAL_ROUTINE *NormalRoutine,
    IN PVOID *NormalContext,
    IN PVOID *SystemArgument1,
    IN PVOID *SystemArgument2)
{
    UNREFERENCED_PARAMETER( SystemArgument1 );
    UNREFERENCED_PARAMETER( SystemArgument2 );

    DbgPrint("User APC is being delivered - Apc: %p\n", Apc);
    if (PsIsThreadTerminating( PsGetCurrentThread() ))
    {
        *NormalRoutine = NULL;
    }

    ExFreePoolWithTag(Apc, DIRECT_KERNEL_ALLOC_TAG);
}


That code will be called when the APC is processed and it will delete the memory allocated for the APC object.

Next, we need a function to create the APC and queue it. First our new function needs to allocate some memory in the target process. Since the callback is called in the context of the target, we can simply use NtCurrentProcess() to specify what process the memory will be allocated into.

...
    ZwAllocateVirtualMemory( NtCurrentProcess(),
                                               &context,
                                               0,
                                               &contextSize,
                                               MEM_COMMIT, PAGE_READWRITE);

...

The {context} is a structure that you will define. You can store the address of LdrLoadDLL inside of it as well as the name of the DLL you want to inject.

Then you need to allocate memory for the APC object, which you can do using:

    ExAllocatePoolWithTag(NonPagedPool, sizeof(KAPC), 'tag');

Now, you must initialize the APC, which you do with the following call:

...
     KeInitializeApc(apc, KeGetCurrentThread(),
                              OriginalApcEnvironment,
                              (PKERNEL_ROUTINE) KernelApcRoutine,
                              NULL,
                              InjectionShellCode,
                              UserMode, context);
...

The context is the one that we just allocated and which will be passed as a parameter to the user mode routine.

For the actual routine, you can go two ways. You can create some shell code in assembly and insert the opcodes inside some array of memory, or you can create a function in your own code. If you write a function inside your source code, you will have to make sure that it does not call any windows APIs.

The only thing your function should do is to create a UNICODE string manually (meaning, no call to RtlUnicodeStringInit() ).

Once the APC is initialized, you can queue it using:

...
     KeInsertQueueApc( apc, NULL, NULL, 0);
...

Here's how LdrLoadDLL is called:

...
     pfnLdrLoadDll(NULL,           // No name
                             0,                
                             &pDLL,         // full path here as a unicode string
                             &handle);
...

Basically, what will happen is the following:

  1. Process is created
  2. Module Load Callback is called
  3. Callback check if the module is NTDLL.DLL
  4. Callback retrieve the address of LdrLoadDll() from NTDLL.DLL
  5. Allocate the APC user mode routine context
  6. Allocate memory in the target process to hold the shellcode
  7. Allocate memory for the APC
  8. Initialize the APC (user mode routine points to the shell code)
  9. Queue the APC
  10. The rest of the modules get loaded and your APC gets processed
  11. The user mode routine, loads the DLL using LdrLoadDll.
  12. Main thread is created
  13. Process starts
Hopefully, all this should get you going.

Happy coding!