March 4, 2024

Analysing Windows Malware on Apple Mac M1/M2 ( Windows 11 ARM ) - Part I

x86/x64 emulation internals on Windows 11 ARM #

Introduction #

Since the introduction of Intel processors for the MacBooks, malware analysis on Mac has become quite popular, and it has become the hardware of choice for malware analysts.

With the introduction of the Mac M1, the landscape has changed significantly. The processor is no longer Intel-based but ARM-based. This shift has caused heartbreak for malware analysts, as native virtualization is necessary for effective malware analysis. While emulation like qemu is possible, it often proves to be more trouble than it is worth, serving as a nuisance.

Fortunately, there exists a version of Windows that can run natively on ARM hardware—the Windows ARM version OS. This offers backward compatibility and includes an emulation layer for running both 32-bit and 64-bit Windows applications. Combining these capabilities provides a viable solution for users.

While emulation works well with normal applications, today, we will explore in this blog post the challenges that malware analysts can encounter while analyzing malware on Windows ARM OS using WOW64ARM.

Malware analysis on Windows ARM OS presents unique obstacles due to the emulation layer WOW64ARM. In this post, we’ll delve into the specific issues faced by analysts in this context, shedding light on the complexities and limitations inherent in the process

But before we begin we need to go deeper into some of the windows internal concepts that allow the translation to take place .

WOW64 #

illustration : Stack Overflow
On a native x86 processor, system calls are performed via an interrupt, i.e., int 2e, as there is no translation required. However, when a 32-bit application is executed on Windows x64 WOW64, transition is done via a call gate known as ‘heaven’s gate’.

On a WOW64 process, four extra DLLs are loaded:

ntddl (64-bit)
wow64.dll
wow64win.dll
wow64cpu.dll

Eventually, the call is routed via wow64cpu.dll, where the far jump to 64-bit code takes place inside the Wow64Transition function and lands inside the 64-bit version of ntdll.dll. Before jumping, the processor state is saved and eventually restored when the translation back to 32-bit happens via the BTCpuSimulate function. With this, essentially it is not emulating 32-bit code but providing a bridge between 32-bit and 64-bit, as the 64-bit processor is able to execute 32-bit code.

An ARM processor cannot execute any Intel processor instructions. So, in the case of WOW64 on Windows ARM, there is actually an emulation layer provided.

x86 on Windows ARM #

Emulation of x86 applications on ARM64 is done via binary translation. This is handled by xtajit.dll instead of wow64cpu.dll.

Binary JIT Translation happens in xtajit.dll. x86 instructions are translated into ARM on the fly and further saved in a cache for faster future retrieval.

This is quite different from WOW64 on 64-bit, which offers “emulation” at native speed as instructions are not emulated; rather, they are executed at native speeds.

If we look at the loaded modules of x86 binary on Windows ARM, we notice some of the paths:

Some of the peculiar paths are:

C:\Windows\SyChpe32
C:\Windows\XtaCache
C:\Windows\System32

DLL files loaded from SyChpe32 are the compiled hybrid portable executable (CHPE) DLLs. These will be discussed later. These are basically the x86 version of DLLs, as well as containing ARM code. So, they are Hybrid DLLs. CHPE DLLs for most of the exported APIs provide a jump thunk which is in x86/x64 followed by a jump towards a special section that consists of ARM code. This skips up the process of JIT translation needed further, almost achieving native speeds when code is in system DLLs.

u 75d46230
KERNEL32!EXP+#VirtualAlloc:
75d46230 8bff            mov     edi,edi
75d46232 55              push    ebp
75d46233 8bec            mov     ebp,esp
75d46235 5d              pop     ebp
75d46236 90              nop
75d46237 e9c4990000      jmp     KERNEL32!#VirtualAllocStub (75d4fc00)
75d4623c cc              int     3
75d4623d cc              int     3
0:000> u 75d4fc00
KERNEL32!#VirtualAllocStub:
75d4fc00 8807            mov     byte ptr [edi],al
75d4fc02 00b008b12111    add     byte ptr [eax+1121B108h],dh
75d4fc08 09fd            or      ebp,edi
75d4fc0a df8820011fd6    fisttp  word ptr [eax-29E0FEE0h]
75d4fc10 0000            add     byte ptr [eax],al
75d4fc12 0000            add     byte ptr [eax],al
75d4fc14 0000            add     byte ptr [eax],al
75d4fc16 0000            add     byte ptr [eax],al

0x75d46230 is the x86 jump thunk and VirtualAllocStub is actually an ARM instruction for the same API call

and when we change the assembly type to ARM we get the real instructions

75d4fc00 0788     lsls        r0,r1,#0x1E
75d4fc02 b000     add         sp,sp,#0
75d4fc04 b108     cbz         r0,KERNEL32!#VirtualAllocStub+0xa (75d4fc0a)
75d4fc06 1121     asrs        r1,r4,#4
75d4fc08 fd0988df stc2        p8,c8,[r9,#-0x37C]
75d4fc0c 0120     lsls        r0,r4,#4
75d4fc0e d61f     bvs         KERNEL32!QuirkIsEnabled3Worker+0x30 (75d4fc50)
75d4fc10 0000     movs        r0,r0

This is the basic idea behind CHPE executables as they are hybrid, consisting of both Intel and ARM assembly code.

X64 on Windows ARM (CHPE version 2) #

Windows on ARM with x64 emulation support was introduced with the release of Windows 10 version 2004 (May 2020 Update). This update brought the capability for ARM-based Windows devices to run x64 (64-bit) applications through emulation, expanding the range of software that can be run on these devices.

It also included a version of CHPE V2, slightly different from the older CHPE version 1. X64 emulation on Windows ARM comes in two different flavors:

Arm64EC (Emulation Compatible)
ARM64X

Arm64EC (Emulation Compatible) #

In Arm64EC, you can mix both x64 and ARM code in a single binary. It provides a form of interoperability between the two architecture formats.

Developers can take advantage of this interoperability and significantly increase the speed of applications.

Arm64EC provides an ABI (application binary interface) to provide this interoperability between x64 and ARM code, which includes, but is not limited to:

Register mapping
Exit and entry thunks
Calling conventions

Let’s try to compile a sample Arm64EC executable and see how it looks after linking.

extern int hello();

int function1()
{
    int a = 10;

    a = a + 100;
    return a;
}
int main()
{
    int b = 0;

    b++;

    b = function1();

    if (b > 100)
    {
        return b;
    }

    hello();
    return b; // To bypass any optimization
}

To observe how x64 code gets translated into ARM, we will use x64 code and compile it using ml64.exe into an object file which we will link later.

.code
hello PROC
    xor rax, rax
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    ret
hello ENDP
END

ml64 /c x64asm.asm

To compile a ARM64EC ( emulation compatible binary ) we can use the visual studio developer tools for ARM64 vcvarsarm64.bat

To compile an ARM64EC binary use the following command line

cl /arm64EC sample.c /c
link /MACHINE:ARM64EC sample.obj x64asm.obj /entry:main

We can verify the same using the dumpbin tool to check for machine type header

dumpbin sample.exe /headers

Lets see how this binary looks from inside .

As we can see, there is just an entry thunk of 64-bit code; the rest of the code is converted into ARM64 after the JMP instruction. Important to note here is that x64 code is present in the .hexpthk section of the binary, and ARM code is present in the .text section of the binary. .hexpthk consists of thunks used for interoperability between two 64-bit PE binaries.

Also, these entry sequences are known as Fast-Forward Sequences, which help if the base application is trying to hook the API call. FFS sequences finally lead to a tail-call to the real Arm64EC function.

ARM64EC code is an ahead-of-time precompiled version of a particular x64 function done during compilation, not during execution. Only x64 code is JIT compiled and executed.

Hybrid Code Address Range Table

                Address Range
          ----------------------
        arm64ec  0000000140001000 - 00000001400027E3 (00001000 - 000027E3)
            x64  0000000140003000 - 000000014000400F (00003000 - 0000400F)

So essentially this is 64bit Thunk > main() (arm )

Using the .effmach meta-command of WinDbg, we can change the effective machine type to CHPE to retrieve the ARM assembly of the main function.

.effmach chpe

u

00007ff7`445d1040 a9be7bfd stp         fp,lr,[sp,#-0x20]!
00007ff7`445d1044 910003fd mov         fp,sp
00007ff7`445d1048 52800008 mov         w8,#0
00007ff7`445d104c b90013e8 str         w8,[sp,#0x10]
00007ff7`445d1050 b94013e8 ldr         w8,[sp,#0x10]
00007ff7`445d1054 11000508 add         w8,w8,#1
00007ff7`445d105c 97ffffef bl          sample+0x1018 (00007ff7`445d1018)```

This code is the translation of x64 to ARM assembly

int b = 0;
b++;

mov         w8,#0
str         w8,[sp,#0x10]
ldr         w8,[sp,#0x10]
add         w8,w8,#1

mov w8, #0: This instruction moves the immediate value 0 into register w8. It initializes w8 with the value 0.
str w8, [sp, #0x10]: This instruction stores the value of register w8 onto the stack at an offset of 0x10 bytes from the stack pointer (sp). It saves the value of w8 into memory at a location relative to the current stack pointer. “int b = 0;
ldr w8, [sp, #0x10]: This instruction loads a word (32 bits) from the stack at an offset of 0x10 bytes from the stack pointer (sp) into register w8. It retrieves the value previously stored at that location on the stack.
add w8, w8, #1: This instruction adds the immediate value 1 to the value in register w8. It increments the value stored in w8 by 1. b++;

Even the hardcoded x64 instructions mentioned above, which simply increment the rax register 10 times, are converted to ARM instructions, albeit in a vague form. This conversion is necessary due to the interoperability between the ARM and x64 ABIs.

”mov x8,x0 00007ff7e11d1110 a8c17bfd ldp fp,lr,[sp],#0x10
00007ff7e11d1114 ad443fee ldp q14,q15,[sp,#0x80] 00007ff7e11d1118 ad4337ec ldp q12,q13,[sp,#0x60]
00007ff7e11d111c ad422fea ldp q10,q11,[sp,#0x40] 00007ff7e11d1120 ad4127e8 ldp q8,q9,[sp,#0x20]
00007ff7e11d1124 acc51fe6 ldp q6,q7,[sp],#0xA0 00007ff7e11d1128 d50323ff autibsp

``````asm
.code
hello PROC
    xor rax, rax
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    add rax, 10
    ret
hello ENDP
END

Now, for this version, the interoperability is such that an x64 binary can interact/load an ARM64EC binary and vice versa.

ARM64X #

One of the issues with ARM64EC (emulation compatible) is the fact that only emulation-compatible binaries can load emulated x64 binaries. It is not possible to load an EC binary or x64 binary from a native ARM64 binary.

To solve this issue, Microsoft has introduced a new binary/format known as ARM64X. ARM64X differs from ARM64EC in that it can be loaded from a 64-bit process. This is made possible by transforming the binaries during the loading phase using a new type of relocation called DVRT, i.e., dynamic value relocation table.

Let’s attempt to compile an ARM64X binary DLL and load it from both a 64-bit and ARM process to observe the changes in action.

DllSample.c

__declspec(dllexport) int hello(int a, int b , int c)
{

    return a + b + c; 

}

Loader.c

int main(int agrc, char **argv)
{
    HINSTANCE hDLL = LoadLibrary("load.dll");
    int (*hello)(int) =(int (__cdecl *)(int)) GetProcAddress(hDLL, "hello");

    hello(10);

    return 0;
}


Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for ARM64

link load.obj /DLL /MACHINE:ARM64X /NODEFAULTLIB /NOENTRY
Microsoft (R) Incremental Linker Version 14.38.33135.0
Copyright (C) Microsoft Corporation.  All rights reserved.

   Creating library load.lib and object load.exp

Once the DLL is compiled, it can be loaded from a 64-bit process, and it runs correctly. Similarly, vice versa is also possible; loading this ARM64X binary, which consists of both native ARM code and ARM64EC code, can be done from a native ARM binary.

And how exactly is that happens ? Dynamic Value Relocation Table ( DVRT) behind the scene makes it possible

Dynamic Value Relocation Table ( DVRT) #

If we load an ARM64X DLL from a native 64-bit binary, normally, this should not be allowed as the _IMAGE_FILE_HEADER.Machine would be a mismatch between x64 and ARM64X. x64 would only allow loading of similar _IMAGE_FILE_HEADER.Machine type binaries only. However, as we observe after loading is complete, the _IMAGE_FILE_HEADER.Machine appears to be set correctly to IMAGE_FILE_MACHINE_AMD64, while on disk it is still (AA64) ARM64X.

Let’s observe this phenomenon firsthand.

const char* GetMachineTypeName(WORD machine) {
    switch (machine) {
        case IMAGE_FILE_MACHINE_AMD64:
            return "x64";
        case IMAGE_FILE_MACHINE_ARM:
            return "ARM";
        case IMAGE_FILE_MACHINE_ARM64:
            return "ARM64";
        case IMAGE_FILE_MACHINE_IA64:
            return "IA-64";
        case IMAGE_FILE_MACHINE_I386:
            return "x86";
        default:
            return "Unknown";
    }
}

int GetMachineType(HMODULE hModule) {

    // Get the DOS header
    IMAGE_DOS_HEADER* dosHeader = (IMAGE_DOS_HEADER*)hModule;

    // Check if it's a valid PE file
    if (dosHeader->e_magic != IMAGE_DOS_SIGNATURE) {
        printf("Not a valid PE file.\n");
        FreeLibrary(hModule);
        return 0;
    }

    // Get the NT header
    IMAGE_NT_HEADERS* ntHeader = (IMAGE_NT_HEADERS*)((BYTE*)hModule + dosHeader->e_lfanew);

    // Check if it's a 64-bit PE file
    if (ntHeader->OptionalHeader.Magic != IMAGE_NT_OPTIONAL_HDR64_MAGIC) {
        printf("Not a 64-bit PE file.\n");
        FreeLibrary(hModule);
        return 0;
    }

    // Get the machine type
    WORD machine = ntHeader->FileHeader.Machine;

    return machine;
}

int main(int agrc, char **argv)
{
    HMODULE hDLL = LoadLibrary("load.dll");
    int (*hello)(int) = (int (__cdecl *)(int)) GetProcAddress(hDLL, "hello");

    printf("Machine type = %s", GetMachineTypeName(GetMachineType(hDLL)));
    return 0;
}

compile the above code in both X64 and ARM64 native .

for the X64 binary we get the following result

Machine type = x64

and for ARM native

Machine type = ARM64

The reason for this dynamic change is the presence of the Dynamic Value Relocation Table (DVRT) in the binary. The Dynamic Value Relocation Table is similar to a relocation table but in a metadata format. It can be obtained via the load config table of the PE file.

It consists of a plethora of information related to DVRT. Using the dumpbin tool, we can dump the contents of DVRT if present in a binary.

DVRT was can be used to mitigate Spectre CPU vulnerabilities using insertion of retpoline in the binary .

DVRT has the following format in the memory

typedef struct _IMAGE_DYNAMIC_RELOCATION_TABLE {
    DWORD Version;
    DWORD Size;
//  IMAGE_DYNAMIC_RELOCATION DynamicRelocations[0];
} IMAGE_DYNAMIC_RELOCATION_TABLE,

With ARM64X, Microsoft introduced a new symbol type, namely type 6, which is for the IMAGE_DYNAMIC_RELOCATION_ARM64X. The rest is defined for retpoline mitigation types.

#define IMAGE_DYNAMIC_RELOCATION_GUARD_RF_PROLOGUE             1
#define IMAGE_DYNAMIC_RELOCATION_GUARD_RF_EPILOGUE             2
#define IMAGE_DYNAMIC_RELOCATION_GUARD_IMPORT_CONTROL_TRANSFER 3
#define IMAGE_DYNAMIC_RELOCATION_GUARD_INDIR_CONTROL_TRANSFER  4
#define IMAGE_DYNAMIC_RELOCATION_GUARD_SWITCHTABLE_BRANCH      5
#define IMAGE_DYNAMIC_RELOCATION_ARM64X                        6

Simply put, the fixup will be replaced by the loader with the data provided in the DVRT table to make the loading conducive with the interoperability of x64 binary. This includes, but is not limited to, changing the machine type, imports, function exports, and exception pointers.

Address	RVA	Bytes	Target Value
00000000	00000104	2	8664
00000001	00000128	4	5010
00000002	00000188	4	7DE0

Consider for example the very first entry
In page 0x00000000, offset 0x104, two bytes need to be replaced with the value 0x8664, which happens to be the Machine type for IMAGE_FILE_MACHINE_AMD64. So, if the process is loaded by an x64 binary, this transformation will take place, as observed in the example above.

Similarly, there is the second one, which is at offset 0x128, representing AddressOfEntryPoint. It is supposed to be replaced with 0x5010 if loaded, i.e., DLLMain or _DllMainCRTStartup in the case of a DLL. So, ARM64 would have a different version of _DllMainCRTStartup, and for ARM64EC, the code would be different, basically a Fast-Forward Sequence (FFS) Thunk/Sequence that points towards the ARM64’s _DllMainCRTStartup.

[00000006] page 00000000 rva 000001D8, 4 bytes, target value 7200

This is for the LoadConfig Value Directory inside the PE header. It is also different for x64, as it consists of a plethora of information regarding stuff which is peculiar to the particular architecture, such as security cookie, exception table, debug info, etc.

This pretty much explains the technology behind the Windows ARM emulation for x64/x86 binaries. In the context of malware, mostly we will be dealing with x86 binaries, as most of the malware binaries are compiled in x86, and some happen to be in x64.

In the next part of the blog, we will be testing some of the malware on Windows 11 ARM. Also, some of the manual tests will be performed based on the common tricks and techniques used commonly by modern Windows malware. So stay tuned!

Kudos