April 29, 2023
A while back I picked up a book on the SPARC architecture from a second hand bookshop. To justify the fiver I spent, I figured I might as well read it.1 While reading the chapter on subroutines, I realised that SPARC offers an elegant mechanism for coroutine linking.
I won’t assume you’re familiar with SPARC. I’ll explain the relevant details as I go along. It will help, though, if you’re at least somewhat familiar with the assembly language of some other architecture.
You can find the code from this post here.
Coroutines are not a new concept.2 SIMULA is one of the earliest languages to support.3 In the past decade, more and more languages have added support for coroutines of one sort or another.
The term is somewhat overloaded, so let’s clarify it. A coroutine is a generalization of a subroutine. A subroutine has one entry point and one exit point. It’s called once, does its job, and returns.
A coroutine has many entry and exit points. That is, its execution can be suspended and resumed. Execution resumes where it left off when the coroutine was suspended.
If a coroutine can choose who it yields to, it’s called symmetric. If it can only yield to the most recent caller, then it’s an asymmetric coroutine.
The code targets Linux for 64-bit SPARC. There’s nothing 64-bit
specific about the implementation except the system call numbers. I used
the as
and ld
binaries that come with the
SPARC cross-build tools and QEMU To run the resulting binary.
We begin by defining some useful constants and a macro for system calls.
.equ SYS_exit, 1
.equ SYS_write, 4
.equ STDOUT, 1
.macro syscall no
mov \no, %g1
t 0x6d
.endm
System calls are implemented using software traps on SPARC.
The t
instruction triggers a software trap. This works
similarly to interrupts on x86, i.e., the operating system sets up a
trap table that is then used to call the appropriate trap
handler.
Arguments for system calls go in the global registers,
starting with %g1
.
SPARC registers are grouped in four categories:
Global registers: %g0
-%g7
;
%g0
is always zero when read, and writes to it are
ignored;
In registers: %i0
-%i7
; used
for receiving parameters and returning results to callers;
Out registers: %o0
-%o7
; used
for passing arguments and receiving results from callees; and
Local registers: %l0
-%l7
; used
for temporary results and other bookkeeping for the current
routine.
In addition to the above, SPARC also has a few special registers: the
stack pointer, %sp
; the frame pointer, %fp
;
and the link register, %lr
.
On SPARC each function gets its own set of in, out and local registers. This is implemented using a mechanism known as register windows.
When a function gets called, the out registers of the caller become the in parameters of the callee. The local and out registers of the callee are fresh.
We’re going to use %g2
, %g3
and
%g6
to keep the “state” of our coroutines. We have to tell
as
we intend to use them as scratch registers.
.register %g2, #scratch
.register %g3, #scratch
.register %g6, #scratch
We’re not going to link to libc
so our entry point is
called _start
.
.section ".text"
.align 4
.global _start
_start:
The first thing we do is to set up the stack frame for the main coroutine. We’re not going to use any stack space, but on SPARC a routine has to allocate at least enough space for some registers to be spilled. That’s a consequence of SPARC’s register windows design.
Since there’s only so many transistors you can fit on a sliver of silicone, SPARC doesn’t have infinitely many physical registers to give dedicated local and out registers to every routine invocation. When the processor runs out of physical registers, it spills the registers used by one of the routines higher on the call stack.4
save %sp, -64, %sp
save
adds -64 bytes to %sp
–like on
x86, the stack grows downwards on SPARC. It also performs the register
window shift that maps the out registers of the caller to the in
registers of the callee, and allocates new local and out registers. This
might lead to spilling, as mentioned above. That sure is a lot of work
for a RISC instruction.
Our program has three coroutines, imaginatively named
_start
, a
and b
.
_start
calls a
, who prints a1
,
yields to b
who prints b1
then yields to
a
, who prints a2
then yields back to
b
; b
then prints b2
, returns to
a
, who returns to _start
. Finally,
_start
just exits. Or, in spiffy ASCII art (printed message
in brackets):
_start -> a ("a1") -> b ("b1") -> a ("a2") -> b ("b2") -> a -> _start
Remember we reserved %g2
and %g3
for the
coroutine states? Now we’re going to use them. We store the address of
the first instruction of a
in %g2
and the
address of the first instruction of b
to %g3
.
That’s why I put quotes around the word “state”–the coroutine state is
just its instruction pointer :^).
set a, %g2
set b, %g3
With the state set up, we can start the coroutine dance by calling
a
:
jmpl %g2, %g6
add %g6, 8, %g6
OK, let’s unpack this. jmpl
jumps to the address in its
first register argument, and stores the address of the program
counter in the second register. In this case we’re jumping to the
address in %g2
–that is a
’s first instruction.
We store the return address in %g6
, the main coroutine’s
state.
What’s up with the add
after jmpl
? That
instruction is called the branch instruction’s delay slot. I’m
not going to go into details about delay slots, but the gist is that, by
the time the CPU knows where to jump, it would have already started work
on the next instruction–the one in the delay slot. No sense in wasting
that work, so it will execute it anyway. The hope was that a
“sufficiently smart” compiler wold manage to put a useful instruction in
the delay slot, thus mitigating the cost of branches.
The instruction in the delay slot is executed before any of
the instructions in the target of the jump. We use it to adjust the
location where _start
resumes its execution. Since
jmpl
stores the value of the program counter in its second
register argument, we’d end up with %g6
pointing at the
jmpl
instruction itself. That would cause an infinite loop,
so we adjust %g6
to resume execution after the
jmpl
-add
pair of instructions.5
Raymond Chen covers MIPS delay slots in some detail.
The last thing _start
does is to return to the operating
system.
syscall SYS_exit
While _start
.. er.. starts a
by yielding to
it, a
and b
are more interesting since they
yield to each other. So let’s look at a
.
We start with the prologue code, just like in
_start
:
a:
save %sp, -64, %sp
and then print the first message:
mov STDOUT, %o0
set msg_a1, %o1
mov (end_msg_a1-msg_a1), %o2
syscall SYS_write
The printing code is not really interesting. We just set the out
registers and call the write
system call.
And it’s time to yield to b
for the first time:
jmpl %g3, %g2
add %g2, 8, %g2
This should look familiar: we jump to
%g3
–b
’s instruction pointer–and store the
return address in %g2
–a
’s instruction
pointer.
When we resume a
, we’ll start executing the code to
print the second message:
mov STDOUT, %o0
set msg_a2, %o1
mov (end_msg_a2-msg_a2), %o2
syscall SYS_write
Then we resume b
again:
jmpl %g3, %g2
add %g2, 8, %g2
and finally return to _start
:
mov 10, %i0
jmpl %g6, %g0
restore
The return value of the coroutine is 10
. But why is it
in an input register? Shouldn’t the return value be an output? Remember
that save
maps the caller’s outputs to the callee’s inputs?
Since the return value should be in the %o0
register of the
caller that means we have to put it in %i0
in the
callee.
Here we use %g0
to ignore the instruction pointer saved
by jmpl
since we’re done with the coroutine.
Finally, restore
undoes what save
did: it
restores the stack pointer and unmaps the registers. Since it’s in
jmpl
’s delay slot, it’ll get executed before we actually
return from a
.
b
is pretty much the same as a
. The main
difference is that the last thing it does is to return 20
to a
. Note that %g2
and %g3
have
swapped places in the jmpl
instructions.
b:
save %sp, -64, %sp
mov STDOUT, %o0
set msg_b1, %o1
mov (end_msg_b1-msg_b1), %o2
syscall SYS_write
jmpl %g2, %g3
add %g3, 8, %g3
mov STDOUT, %o0
set msg_b2, %o1
mov (end_msg_b2-msg_b2), %o2
syscall SYS_write
mov 20, %i0
jmpl %g2, %g0
restore
The messages we print live in the .data
section:
.section ".data"
.align 4
msg_a1: .ascii "a1\n"
end_msg_a1:
msg_a2: .ascii "a2\n"
end_msg_a2:
msg_b1: .ascii "b1\n"
end_msg_b1:
msg_b2: .ascii "b2\n"
end_msg_b2:
And with this we’re done.
What is the exit code of the program? Why? Can you make it return the value returned from either of the coroutines?
I find the implementation of coroutines particularly neat on SPARC.
It only takes two instructions, one of which you is almost free–the
add
in the delay slot. This is possible because SPARC’s
flexibility when it comes to storing return addresses on jumps. On x86
we’d have to use a relatively ugly hack to get the resume address.
This simplicity is only possible because of some simplifying assumptions I made. I use global registers to store the return addresses for coroutines. Those are a scarce resource even on SPARC, and this approach won’t scale to more than 2 or 3 coroutines. We could work around this by storing the return address in memory.
We were careful to nest the coroutines nicely, i.e.,
_start
had a longer lifetime than a
who had a
longer lifetime than b
. That avoided awkward issues with
the stack pointer. We could even have used stack variables if we were
careful to save and restore the appropriate frame pointer. If we want to
lift the restriction on coroutine lifetime nesting, we’d have to
allocate separate memory for their stacks, and be careful about saving
and restoring the frame pointer. That’s not hard, but it’s definitely
more complicated than the simple approach in this post.
That’s enough for one post. Maybe I’ll explore coroutines with stacks in a future installment.
What better way to polish your résumé than reading a 20+ year old book on a 30+ year old ISA that’s going to be sunset in 5 or so years :^)?↩︎
According to Knuth, the term has been around since 1958.↩︎
Fun fact: early C++ had support for coroutines because Bjarne’s background in simulation software written in SIMULA.↩︎
That’s a very hand wavy explanation of what actually goes on. The actual process is quite involved. The processor tracks the current register window index and triggers a trap on overflow. The trap code uses the frame pointer to decide what stack frame to use to spill the registers.↩︎
The code on GitHub has macro definitions for creating, yeilding and returning from coroutines.↩︎