Coroutines in SPARC assembly

Alex Muscar

April 29, 2023

A while back I picked up a book on the SPARC architecture from a second hand bookshop. To justify the fiver I spent, I figured I might as well read it.1 While reading the chapter on subroutines, I realised that SPARC offers an elegant mechanism for coroutine linking.

I won’t assume you’re familiar with SPARC. I’ll explain the relevant details as I go along. It will help, though, if you’re at least somewhat familiar with the assembly language of some other architecture.

You can find the code from this post here.

A brief aside on coroutines

Coroutines are not a new concept.2 SIMULA is one of the earliest languages to support.3 In the past decade, more and more languages have added support for coroutines of one sort or another.

The term is somewhat overloaded, so let’s clarify it. A coroutine is a generalization of a subroutine. A subroutine has one entry point and one exit point. It’s called once, does its job, and returns.

A coroutine has many entry and exit points. That is, its execution can be suspended and resumed. Execution resumes where it left off when the coroutine was most recently suspended.

If a coroutine can choose who it yields to, it’s called symmetric. If it can only yield to the most recent caller, then it’s an asymmetric coroutine.

Implementing coroutines on SPARC

The code targets Linux for 64-bit SPARC. There’s nothing 64-bit specific about the implementation except the system call numbers. I used the as and ld binaries that come with the SPARC cross-build tools and QEMU to run the resulting binary.

Miscellany

Let’s get the nitty gritty out of the way. We define some constants and a macro for system calls.

.equ SYS_exit, 1
.equ SYS_write, 4

.equ STDOUT, 1

.macro syscall no
    mov \no, %g1
    t 0x6d
.endm

System calls are implemented using software traps on SPARC. The t instruction triggers a software trap. This works similarly to interrupts on x86, i.e., the operating system sets up a trap table that is then used to call the appropriate trap handler.

Arguments for system calls go in the global registers, starting with %g1.

An aside on SPARC registers

SPARC registers are grouped in four categories:

In addition to the above, there are also a few special registers: the stack pointer, %sp; the frame pointer, %fp; and the link register, %lr.

On SPARC each function gets its own set of in, out and local registers. This is implemented using a mechanism known as register windows.

When a function gets called, the out registers of the caller become the in parameters of the callee. The local and out registers of the callee are fresh.

More miscellany

We’re going to use %g2, %g3 and %g6 to keep the “state” of our coroutines. We have to tell as we intend to use them as scratch registers.

.register %g2, #scratch
.register %g3, #scratch
.register %g6, #scratch

We’re not linking in libc so call the entry point _start.

    .section ".text"
    .align 4

    .global _start

_start:

Function prologues

The first thing we do is to set up the stack frame for the main coroutine. We’re not going to use any stack space, but on SPARC a routine has to allocate at least enough space for some registers to be spilled. That’s a consequence of SPARC’s register windows design.

Since there’s only so many transistors you can fit on a sliver of silicone, SPARC doesn’t have infinitely many physical registers to give dedicated local and out registers to every routine invocation. When the processor runs out of physical registers, it spills the registers used by one of the routines higher on the call stack.4

    save %sp, -64, %sp

save adds -64 bytes to %sp–like on x86, the stack grows downwards on SPARC. It also performs the register window shift that maps the out registers of the caller to the in registers of the callee, and allocates new local and out registers. This might lead to spilling, as mentioned above. That sure is a lot of work for a RISC instruction.

Finally coroutines

Our program has three coroutines, imaginatively named _start, a and b. _start calls a, who prints a1, yields to b who prints b1 then yields to a, who prints a2 then yields back to b; b then prints b2, returns to a, who returns to _start. Finally, _start just exits. Or in spiffy ASCII art (printed message in brackets):

_start -> a ("a1") -> b ("b1") -> a ("a2") -> b ("b2") -> a -> _start

Remember we reserved %g2 and %g3 for the coroutine states? Now we’re going to use them. We store the address of the first instruction of a in %g2 and the address of the first instruction of b to %g3. That’s why I put quotes around the word “state”–the coroutine state is just its instruction pointer.

    set a, %g2
    set b, %g3

Yielding

With the state set up, we can start the coroutine dance by calling a:

    jmpl %g2, %g6
    add %g6, 8, %g6

OK, let’s unpack this. jmpl jumps to the address in its first register argument, and stores the address of the program counter in the second register. In this case we’re jumping to the address in %g2–that is a’s first instruction. We store the return address in %g6, the main coroutine’s state.

What’s up with the add after jmpl? That instruction is called the branch instruction’s delay slot. I’m not going to go into details about delay slots, but the gist is that, by the time the CPU knows where to jump, it would have already started work on the next instruction–the one in the delay slot. No sense in wasting that work, so it will execute it anyway. The hope was that a “sufficiently smart” compiler wold manage to put a useful instruction in the delay slot, thus mitigating the cost of branches.

The instruction in the delay slot is executed before any of the instructions in the target of the jump. We use it to adjust the location where _start resumes its execution. Since jmpl stores the value of the program counter in its second register argument, we’d end up with %g6 pointing at the jmpl instruction itself. That would cause an infinite loop, so we adjust %g6 to resume execution after the jmpl-add pair of instructions.5

Raymond Chen covers MIPS delay slots in some detail.

The last thing _start does is to return to the operating system.

    syscall SYS_exit

The other coroutines

While _start.. er.. starts a by yielding to it, a and b are more interesting since they yield to each other. So let’s look at a.

We start with the prologue code, just like in _start:

a:
    save %sp, -64, %sp

and then print the first message:

    mov STDOUT, %o0
    set msg_a1, %o1
    mov (end_msg_a1-msg_a1), %o2
    syscall SYS_write

The printing code is not really interesting. We just set the out registers and call the write system call.

And it’s time to yield to b for the first time:

    jmpl %g3, %g2
    add %g2, 8, %g2

This should look familiar: we jump to %g3b’s instruction pointer–and store the return address in %g2a’s instruction pointer.

When we resume a, we’ll start executing the code to print the second message:

    mov STDOUT, %o0
    set msg_a2, %o1
    mov (end_msg_a2-msg_a2), %o2
    syscall SYS_write

Then we resume b again:

    jmpl %g3, %g2
    add %g2, 8, %g2

and finally return to _start:

    mov 10, %i0
    jmpl %g6, %g0
    restore

Function epilogues

The return value of the coroutine is 10. But why is it in an input register? Shouldn’t the return value be an output? Remember that save maps the caller’s outputs to the callee’s inputs? Since the return value should be in the %o0 register of the caller that means we have to put it in %i0 in the callee.

Here we use %g0 to ignore the instruction pointer saved by jmpl since we’re done with the coroutine.

Finally, restore undoes what save did: it restores the stack pointer and unmaps the registers. Since it’s in jmpl’s delay slot, it’ll get executed before we actually return from a.

The rest of the code

b is pretty much the same as a. The main difference is that the last thing it does is to return 20 to a. Note that %g2 and %g3 have swapped places in the jmpl instructions.

b:
    save %sp, -64, %sp

    mov STDOUT, %o0
    set msg_b1, %o1
    mov (end_msg_b1-msg_b1), %o2
    syscall SYS_write

    jmpl %g2, %g3
    add %g3, 8, %g3

    mov STDOUT, %o0
    set msg_b2, %o1
    mov (end_msg_b2-msg_b2), %o2
    syscall SYS_write

    mov 20, %i0
    jmpl %g2, %g0
    restore

The messages we print live in the .data section:

    .section ".data"
    .align 4

msg_a1: .ascii "a1\n"
end_msg_a1:

msg_a2: .ascii "a2\n"
end_msg_a2:

msg_b1: .ascii "b1\n"
end_msg_b1:

msg_b2: .ascii "b2\n"
end_msg_b2:

And with this we’re done.

Quick quiz

What is the exit code of the program? Why? Can you make it return the value returned from either of the coroutines?

Conclusion

I find the implementation of coroutines particularly neat on SPARC. It only takes two instructions, one of which you is almost free–the add in the delay slot. This is possible because SPARC’s flexibility when it comes to storing return addresses on jumps. On x86 we’d have to use a relatively ugly hack to get the resume address.

This simplicity is only possible because of some simplifying assumptions I made. I use global registers to store the return addresses for coroutines. Those are a scarce resource even on SPARC, and this approach won’t scale to more than 2 or 3 coroutines. We could work around this by storing the return address in memory.

We were careful to nest the coroutines nicely, i.e., _start had a longer lifetime than a who had a longer lifetime than b. That avoided awkward issues with the stack pointer. We could even have used stack variables if we were careful to save and restore the appropriate frame pointer. If we want to lift the restriction on coroutine lifetime nesting, we’d have to allocate separate memory for their stacks, and be careful about saving and restoring the frame pointer. That’s not hard, but it’s definitely more complicated than the simple approach in this post.


  1. What better way to polish your résumé than reading a 20+ year old book on a 30+ year old ISA that’s going to be sunset in 5 or so years :^)↩︎

  2. According to Knuth, the term has been around since 1958.↩︎

  3. Fun fact: early C++ had support for coroutines because Bjarne’s background in simulation software written in SIMULA.↩︎

  4. That’s a very hand wavy explanation of what actually goes on. The actual process is quite involved. The processor tracks the current register window index and triggers a trap on overflow. The trap code uses the frame pointer to decide what stack frame to use to spill the registers.↩︎

  5. The code on GitHub has macro definitions for creating, yeilding and returning from coroutines.↩︎