Analysing Windows Malware on Apple Mac M1/M2 ( Windows 11 ARM ) - Part I
x86/x64 emulation internals on Windows 11 ARM #
Introduction #
Since the introduction of Intel processors for the MacBooks, malware analysis on Mac has become quite popular, and it has become the hardware of choice for malware analysts.
With the introduction of the Mac M1, the landscape has changed significantly. The processor is no longer Intel-based but ARM-based. This shift has caused heartbreak for malware analysts, as native virtualization is necessary for effective malware analysis. While emulation like qemu is possible, it often proves to be more trouble than it is worth, serving as a nuisance.
Fortunately, there exists a version of Windows that can run natively on ARM hardware—the Windows ARM version OS. This offers backward compatibility and includes an emulation layer for running both 32-bit and 64-bit Windows applications. Combining these capabilities provides a viable solution for users.
While emulation works well with normal applications, today, we will explore in this blog post the challenges that malware analysts can encounter while analyzing malware on Windows ARM OS using WOW64ARM.
Malware analysis on Windows ARM OS presents unique obstacles due to the emulation layer WOW64ARM. In this post, we’ll delve into the specific issues faced by analysts in this context, shedding light on the complexities and limitations inherent in the process
But before we begin we need to go deeper into some of the windows internal concepts that allow the translation to take place .
WOW64 #
illustration : Stack Overflow
On a native x86 processor, system calls are performed via an interrupt, i.e., int 2e, as there is no translation required. However, when a 32-bit application is executed on Windows x64 WOW64, transition is done via a call gate known as ‘heaven’s gate’.
On a WOW64 process, four extra DLLs are loaded:
ntddl
(64-bit)wow64.dll
wow64win.dll
wow64cpu.dll
Eventually, the call is routed via wow64cpu.dll
, where the far jump to 64-bit code takes place inside the Wow64Transition
function and lands inside the 64-bit version of ntdll.dll
. Before jumping, the processor state is saved and eventually restored when the translation back to 32-bit happens via the BTCpuSimulate
function. With this, essentially it is not emulating 32-bit code but providing a bridge between 32-bit and 64-bit, as the 64-bit processor is able to execute 32-bit code.
An ARM processor cannot execute any Intel processor instructions. So, in the case of WOW64 on Windows ARM, there is actually an emulation layer provided.
x86 on Windows ARM #
Emulation of x86 applications on ARM64 is done via binary translation. This is handled by xtajit.dll
instead of wow64cpu.dll
.
Binary JIT Translation happens in xtajit.dll
. x86 instructions are translated into ARM on the fly and further saved in a cache for faster future retrieval.
This is quite different from WOW64 on 64-bit, which offers “emulation” at native speed as instructions are not emulated; rather, they are executed at native speeds.
If we look at the loaded modules of x86 binary on Windows ARM, we notice some of the paths:
Some of the peculiar paths are:
- C:\Windows\SyChpe32
- C:\Windows\XtaCache
- C:\Windows\System32
DLL files loaded from SyChpe32
are the compiled hybrid portable executable (CHPE) DLLs. These will be discussed later. These are basically the x86 version of DLLs, as well as containing ARM code. So, they are Hybrid DLLs. CHPE DLLs for most of the exported APIs provide a jump thunk which is in x86/x64 followed by a jump towards a special section that consists of ARM code. This skips up the process of JIT translation needed further, almost achieving native speeds when code is in system DLLs.
u 75d46230
KERNEL32!EXP+#VirtualAlloc:
75d46230 8bff mov edi,edi
75d46232 55 push ebp
75d46233 8bec mov ebp,esp
75d46235 5d pop ebp
75d46236 90 nop
75d46237 e9c4990000 jmp KERNEL32!#VirtualAllocStub (75d4fc00)
75d4623c cc int 3
75d4623d cc int 3
0:000> u 75d4fc00
KERNEL32!#VirtualAllocStub:
75d4fc00 8807 mov byte ptr [edi],al
75d4fc02 00b008b12111 add byte ptr [eax+1121B108h],dh
75d4fc08 09fd or ebp,edi
75d4fc0a df8820011fd6 fisttp word ptr [eax-29E0FEE0h]
75d4fc10 0000 add byte ptr [eax],al
75d4fc12 0000 add byte ptr [eax],al
75d4fc14 0000 add byte ptr [eax],al
75d4fc16 0000 add byte ptr [eax],al
0x75d46230
is the x86 jump thunk and VirtualAllocStub
is actually an ARM instruction for the same API call
and when we change the assembly type to ARM we get the real instructions
75d4fc00 0788 lsls r0,r1,#0x1E
75d4fc02 b000 add sp,sp,#0
75d4fc04 b108 cbz r0,KERNEL32!#VirtualAllocStub+0xa (75d4fc0a)
75d4fc06 1121 asrs r1,r4,#4
75d4fc08 fd0988df stc2 p8,c8,[r9,#-0x37C]
75d4fc0c 0120 lsls r0,r4,#4
75d4fc0e d61f bvs KERNEL32!QuirkIsEnabled3Worker+0x30 (75d4fc50)
75d4fc10 0000 movs r0,r0
This is the basic idea behind CHPE executables as they are hybrid, consisting of both Intel and ARM assembly code.
X64 on Windows ARM (CHPE version 2) #
Windows on ARM with x64 emulation support was introduced with the release of Windows 10 version 2004 (May 2020 Update). This update brought the capability for ARM-based Windows devices to run x64 (64-bit) applications through emulation, expanding the range of software that can be run on these devices.
It also included a version of CHPE V2, slightly different from the older CHPE version 1. X64 emulation on Windows ARM comes in two different flavors:
- Arm64EC (Emulation Compatible)
- ARM64X
Arm64EC (Emulation Compatible) #
In Arm64EC, you can mix both x64 and ARM code in a single binary. It provides a form of interoperability between the two architecture formats.
Developers can take advantage of this interoperability and significantly increase the speed of applications.
Arm64EC provides an ABI (application binary interface) to provide this interoperability between x64 and ARM code, which includes, but is not limited to:
- Register mapping
- Exit and entry thunks
- Calling conventions
Let’s try to compile a sample Arm64EC executable and see how it looks after linking.
extern int hello();
int function1()
{
int a = 10;
a = a + 100;
return a;
}
int main()
{
int b = 0;
b++;
b = function1();
if (b > 100)
{
return b;
}
hello();
return b; // To bypass any optimization
}
To observe how x64 code gets translated into ARM, we will use x64 code and compile it using ml64.exe into an object file which we will link later.
.code
hello PROC
xor rax, rax
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
ret
hello ENDP
END
ml64 /c x64asm.asm
To compile a ARM64EC ( emulation compatible binary ) we can use the visual studio developer tools for ARM64 vcvarsarm64.bat
To compile an ARM64EC binary use the following command line
cl /arm64EC sample.c /c
link /MACHINE:ARM64EC sample.obj x64asm.obj /entry:main
We can verify the same using the dumpbin
tool to check for machine type header
dumpbin sample.exe /headers
Lets see how this binary looks from inside .
As we can see, there is just an entry thunk of 64-bit code; the rest of the code is converted into ARM64 after the JMP instruction. Important to note here is that x64 code is present in the .hexpthk
section of the binary, and ARM code is present in the .text
section of the binary. .hexpthk
consists of thunks used for interoperability between two 64-bit PE binaries.
Also, these entry sequences are known as Fast-Forward Sequences, which help if the base application is trying to hook the API call. FFS sequences finally lead to a tail-call to the real Arm64EC function.
ARM64EC code is an ahead-of-time precompiled version of a particular x64 function done during compilation, not during execution. Only x64 code is JIT compiled and executed.
Hybrid Code Address Range Table
Address Range
----------------------
arm64ec 0000000140001000 - 00000001400027E3 (00001000 - 000027E3)
x64 0000000140003000 - 000000014000400F (00003000 - 0000400F)
So essentially this is 64bit Thunk > main() (arm )
Using the .effmach
meta-command of WinDbg, we can change the effective machine type to CHPE to retrieve the ARM assembly of the main function.
.effmach chpe
u
00007ff7`445d1040 a9be7bfd stp fp,lr,[sp,#-0x20]!
00007ff7`445d1044 910003fd mov fp,sp
00007ff7`445d1048 52800008 mov w8,#0
00007ff7`445d104c b90013e8 str w8,[sp,#0x10]
00007ff7`445d1050 b94013e8 ldr w8,[sp,#0x10]
00007ff7`445d1054 11000508 add w8,w8,#1
00007ff7`445d105c 97ffffef bl sample+0x1018 (00007ff7`445d1018)```
This code is the translation of x64 to ARM assembly
int b = 0;
b++;
to
mov w8,#0
str w8,[sp,#0x10]
ldr w8,[sp,#0x10]
add w8,w8,#1
mov w8, #0
: This instruction moves the immediate value 0 into register w8. It initializes w8 with the value 0.str w8, [sp, #0x10]
: This instruction stores the value of register w8 onto the stack at an offset of 0x10 bytes from the stack pointer (sp). It saves the value of w8 into memory at a location relative to the current stack pointer. “int b = 0;
ldr w8, [sp, #0x10]
: This instruction loads a word (32 bits) from the stack at an offset of 0x10 bytes from the stack pointer (sp) into register w8. It retrieves the value previously stored at that location on the stack.add w8, w8, #1
: This instruction adds the immediate value 1 to the value in register w8. It increments the value stored in w8 by 1.b++;
Even the hardcoded x64 instructions mentioned above, which simply increment the rax register 10 times, are converted to ARM instructions, albeit in a vague form. This conversion is necessary due to the interoperability between the ARM and x64 ABIs.
”mov x8,x0
e11d1110 a8c17bfd ldp fp,lr,[sp],#0x10
00007ff7
00007ff7e11d1114 ad443fee ldp q14,q15,[sp,#0x80]
e11d1118 ad4337ec ldp q12,q13,[sp,#0x60]
00007ff7
00007ff7e11d111c ad422fea ldp q10,q11,[sp,#0x40]
e11d1120 ad4127e8 ldp q8,q9,[sp,#0x20]
00007ff7
00007ff7e11d1124 acc51fe6 ldp q6,q7,[sp],#0xA0
e11d1128 d50323ff autibsp
00007ff7
``````asm
.code
hello PROC
xor rax, rax
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
add rax, 10
ret
hello ENDP
END
Now, for this version, the interoperability is such that an x64 binary can interact/load an ARM64EC binary and vice versa.
ARM64X #
One of the issues with ARM64EC (emulation compatible) is the fact that only emulation-compatible binaries can load emulated x64 binaries. It is not possible to load an EC binary or x64 binary from a native ARM64 binary.
To solve this issue, Microsoft has introduced a new binary/format known as ARM64X. ARM64X differs from ARM64EC in that it can be loaded from a 64-bit process. This is made possible by transforming the binaries during the loading phase using a new type of relocation called DVRT, i.e., dynamic value relocation table.
Let’s attempt to compile an ARM64X binary DLL and load it from both a 64-bit and ARM process to observe the changes in action.
DllSample.c
__declspec(dllexport) int hello(int a, int b , int c)
{
return a + b + c;
}
Loader.c
int main(int agrc, char **argv)
{
HINSTANCE hDLL = LoadLibrary("load.dll");
int (*hello)(int) =(int (__cdecl *)(int)) GetProcAddress(hDLL, "hello");
hello(10);
return 0;
}
Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for ARM64
link load.obj /DLL /MACHINE:ARM64X /NODEFAULTLIB /NOENTRY
Microsoft (R) Incremental Linker Version 14.38.33135.0
Copyright (C) Microsoft Corporation. All rights reserved.
Creating library load.lib and object load.exp
Once the DLL is compiled, it can be loaded from a 64-bit process, and it runs correctly. Similarly, vice versa is also possible; loading this ARM64X binary, which consists of both native ARM code and ARM64EC code, can be done from a native ARM binary.
And how exactly is that happens ? Dynamic Value Relocation Table ( DVRT) behind the scene makes it possible
Dynamic Value Relocation Table ( DVRT) #
If we load an ARM64X DLL from a native 64-bit binary, normally, this should not be allowed as the _IMAGE_FILE_HEADER.Machine
would be a mismatch between x64 and ARM64X. x64 would only allow loading of similar _IMAGE_FILE_HEADER.Machine
type binaries only. However, as we observe after loading is complete, the _IMAGE_FILE_HEADER.Machine
appears to be set correctly to IMAGE_FILE_MACHINE_AMD64
, while on disk it is still (AA64) ARM64X.
Let’s observe this phenomenon firsthand.
const char* GetMachineTypeName(WORD machine) {
switch (machine) {
case IMAGE_FILE_MACHINE_AMD64:
return "x64";
case IMAGE_FILE_MACHINE_ARM:
return "ARM";
case IMAGE_FILE_MACHINE_ARM64:
return "ARM64";
case IMAGE_FILE_MACHINE_IA64:
return "IA-64";
case IMAGE_FILE_MACHINE_I386:
return "x86";
default:
return "Unknown";
}
}
int GetMachineType(HMODULE hModule) {
// Get the DOS header
IMAGE_DOS_HEADER* dosHeader = (IMAGE_DOS_HEADER*)hModule;
// Check if it's a valid PE file
if (dosHeader->e_magic != IMAGE_DOS_SIGNATURE) {
printf("Not a valid PE file.\n");
FreeLibrary(hModule);
return 0;
}
// Get the NT header
IMAGE_NT_HEADERS* ntHeader = (IMAGE_NT_HEADERS*)((BYTE*)hModule + dosHeader->e_lfanew);
// Check if it's a 64-bit PE file
if (ntHeader->OptionalHeader.Magic != IMAGE_NT_OPTIONAL_HDR64_MAGIC) {
printf("Not a 64-bit PE file.\n");
FreeLibrary(hModule);
return 0;
}
// Get the machine type
WORD machine = ntHeader->FileHeader.Machine;
return machine;
}
int main(int agrc, char **argv)
{
HMODULE hDLL = LoadLibrary("load.dll");
int (*hello)(int) = (int (__cdecl *)(int)) GetProcAddress(hDLL, "hello");
printf("Machine type = %s", GetMachineTypeName(GetMachineType(hDLL)));
return 0;
}
compile the above code in both X64 and ARM64 native .
for the X64 binary we get the following result
Machine type = x64
and for ARM native
Machine type = ARM64
The reason for this dynamic change is the presence of the Dynamic Value Relocation Table (DVRT) in the binary. The Dynamic Value Relocation Table is similar to a relocation table but in a metadata format. It can be obtained via the load config table of the PE file.
It consists of a plethora of information related to DVRT. Using the dumpbin tool, we can dump the contents of DVRT if present in a binary.
DVRT was can be used to mitigate Spectre CPU vulnerabilities using insertion of retpoline in the binary .
DVRT has the following format in the memory
typedef struct _IMAGE_DYNAMIC_RELOCATION_TABLE {
DWORD Version;
DWORD Size;
// IMAGE_DYNAMIC_RELOCATION DynamicRelocations[0];
} IMAGE_DYNAMIC_RELOCATION_TABLE,
With ARM64X, Microsoft introduced a new symbol type, namely type 6, which is for the IMAGE_DYNAMIC_RELOCATION_ARM64X
. The rest is defined for retpoline mitigation types.
#define IMAGE_DYNAMIC_RELOCATION_GUARD_RF_PROLOGUE 1
#define IMAGE_DYNAMIC_RELOCATION_GUARD_RF_EPILOGUE 2
#define IMAGE_DYNAMIC_RELOCATION_GUARD_IMPORT_CONTROL_TRANSFER 3
#define IMAGE_DYNAMIC_RELOCATION_GUARD_INDIR_CONTROL_TRANSFER 4
#define IMAGE_DYNAMIC_RELOCATION_GUARD_SWITCHTABLE_BRANCH 5
#define IMAGE_DYNAMIC_RELOCATION_ARM64X 6
Simply put, the fixup will be replaced by the loader with the data provided in the DVRT table to make the loading conducive with the interoperability of x64 binary. This includes, but is not limited to, changing the machine type, imports, function exports, and exception pointers.
Address | Page | RVA | Bytes | Target Value |
---|---|---|---|---|
00000000 | 00000000 | 00000104 | 2 | 8664 |
00000001 | 00000000 | 00000128 | 4 | 5010 |
00000002 | 00000000 | 00000188 | 4 | 7DE0 |
Consider for example the very first entry
In page 0x00000000, offset 0x104, two bytes need to be replaced with the value 0x8664, which happens to be the Machine
type for IMAGE_FILE_MACHINE_AMD64
. So, if the process is loaded by an x64 binary, this transformation will take place, as observed in the example above.
Similarly, there is the second one, which is at offset 0x128, representing AddressOfEntryPoint
. It is supposed to be replaced with 0x5010 if loaded, i.e., DLLMain
or _DllMainCRTStartup
in the case of a DLL. So, ARM64 would have a different version of _DllMainCRTStartup
, and for ARM64EC, the code would be different, basically a Fast-Forward Sequence (FFS) Thunk/Sequence that points towards the ARM64’s _DllMainCRTStartup
.
[00000006] page 00000000 rva 000001D8, 4 bytes, target value 7200
This is for the LoadConfig Value Directory inside the PE header. It is also different for x64, as it consists of a plethora of information regarding stuff which is peculiar to the particular architecture, such as security cookie, exception table, debug info, etc.
This pretty much explains the technology behind the Windows ARM emulation for x64/x86 binaries. In the context of malware, mostly we will be dealing with x86 binaries, as most of the malware binaries are compiled in x86, and some happen to be in x64.
In the next part of the blog, we will be testing some of the malware on Windows 11 ARM. Also, some of the manual tests will be performed based on the common tricks and techniques used commonly by modern Windows malware. So stay tuned!