A simple problem that isn’t

Alex Muscar

January 4, 2023

Some problems seem simple at first blush. Recently, I wanted to add an iterator to a bit set implemented in Go. The bit set implementation itself is what you’d expect. I lifted it from section 6.5 of “The Go Programming Language”.

You can find the code for this post here.

Have a look at the implementation of BitSet before you read on. The code in this post is written for bit sets represented using 64 bit words. That’s why you’ll see the magic constant 64 all over the shop.

A problem specification

Given the following specification:

type Iterator interface {
    // Next tries to advance the iterator to the next value in the set.
    // It returns `true` if there was a next value, `false` otherwise.
    //
    // Next must be called *before* the first call to Current.
    Next() bool

    // Current returns the element of the set that the iterator is pointing at.
    Current() int
}

how would you implement an iterator for the bit set?

The straightforward implementation

In languages with built-in support for generators, the solution is straightforward. Here’s a C# implementation:

public IEnumerable<int> NaiveIter()
{
    for (var i = 0; i < words.Length * 64; i++)
    {
        if (Has(i))
        {
            yield return i;
        }
    }
}

Since Go doesn’t have generators, we have to roll our own. That’s not too hard, though. We just need to remember where we left off on the previous iteration of the loop, and pick up from there. At this point, someone is bound to suggest using channels. But channels aren’t free as we’ll see in the benchmarks section.

type NaiveIter struct {
    s   *BitSet
    off int
}

func (s *BitSet) Iter() *NaiveIter {
    return &Iter{s, -1}
}

func (it *NaiveIter) Next() bool {
    it.off++
    for it.off < it.s.Cap() && !it.s.Has(it.off) {
        it.off++
    }
    return it.off < it.s.Cap()
}

func (it *NaiveIter) Current() int {
    return it.off
}

There are two things to note about this iterator. First, we initialize off to -1–we start just before the first element. Second, we increment it.off before the loop.1 We want to make progress on every call to Next. 2, 3

Both the C# and the Go iterator have the same problem, though: they don’t handle sparse bit sets very efficiently.

An implementation that deals with sparse bit sets

Let’s try to clarify the problem we are solving: we have a bit set represented as an array of words; the set is sparse, so we have long runs of zeroes; we don’t want to look at every single zero bit.

The obvious approach is to skip consecutive zeroes. Given the set’s representation, it seems sensible to try to skip entire words if they are zero. But, as with many simple ideas, the devil is in the details. You’ve seen the straightforward implementation. How would you implement an optimised iterator?

A first try

In addition to the offset of the current bit in a word, the optimised iterator tracks the offset of the current word. Tracking it will help us skip zero words. The data structure now looks like this:

type Iter struct {
    s          *BitSet
    woff, boff int
}

func (s *BitSet) Iter() *Iter {
    return &Iter{s, 0, -1}
}

The iterator is initially positioned just before the first bit of the first word.

Getting the current element is easy. It’s just the inverse of the formula in Add:

func (it *Iter) Current() int {
    return it.woff*64 + it.boff
}

Now for the interesting part, the implementation of Next:

func (it *Iter) Next() bool {

We start by tackling the termination condition. If there are no more words to check in the set representation, we are done:

    if it.woff >= len(it.s.words) {
        return false
    }

Next, we want to make progress. We can pretty much take the loop from the previous implementation, and adapt it slightly:

    // ensure we make progress
    it.boff++
    // find the least significant one bit in the current word
    for it.boff < 64 && it.s.words[it.woff]&(1<<it.boff) == 0 {
        it.boff++
    }

This loop looks for the least significant one bit in the current word. When it terminates, one of two conditions holds: either it.boff is 64–all the bits were zero–and we need to move to the next non-zero word. Or we’ve found a one bit, and we found the next element.

    // we found the next element
    if it.boff < 64 {
        return true
    }
    // we need to find the next non-zero word
    it.boff = 0
    it.woff++
    for it.woff < len(it.s.words) && it.s.words[it.woff] == 0 {
        it.woff++
    }

When this second loop terminates it’s either because we ran out of words– we found all the elements, we return false. Or because we found a non-zero word, and we’re not done yet. We’re positioned at the beginning of a word, but we still have to find the least significant one bit.

    // we found all the elements
    if it.woff >= len(it.s.words) {
        return false
    }
    // we found a non-zero word; find the least significant bit
    for it.s.words[it.woff]&(1<<it.boff) == 0 {
        it.boff++
    }
    return true
}

And with this, we’re done. But the code could be nicer.

Improving the code

Next’s epilogue is almost identical to its prologue. Let’s start there. We could replace the final if and for statements with a jump to the top of the function–they do the same thing.4 And in some cases gotos are not that harmful. But here it’s just an unguarded loop in disguise.

func (it *Iter) Next() bool {
    for {
        // we're done with the whole set
        if it.woff >= len(it.s.words) {
            return false
        }
        // ensure we make progress
        it.boff++
        // find the least significant one bit in the current word
        for it.boff < 64 && it.s.words[it.woff]&(1<<it.boff) == 0 {
            it.boff++
        }
        // we found a set bit
        if it.boff < 64 {
            return true
        }
        // we need to find the next non-zero word
        it.boff = -1
        it.woff++
        for it.woff < len(it.s.words) && it.s.words[it.woff] == 0 {
            it.woff++
        }
    }
}

Did you notice it? We’re resetting it.boff to -1, not 0, like in the previous version. That’s because we advance it.boff before the bit scanning loop.

We can further clean this up by merging the first if with the outer for:

func (it *Iter) Next() bool {
    for it.woff < len(it.s.words) {
        // ensure we make progress
        it.boff++
        // find the least significant one bit in the current word
        for it.boff < 64 && it.s.words[it.woff]&(1<<it.boff) == 0 {
            it.boff++
        }
        // we found the next element
        if it.boff < 64 {
            return true
        }
        // we need to find the next non-zero word
        it.boff = -1
        it.woff++
        for it.woff < len(it.s.words) && it.s.words[it.woff] == 0 {
            it.woff++
        }
    }
    return false
}

But do we need to skip all the zero words at once? No, we can let the main loop do it word by word–it’s already doing that with it.woff++. If we refactor the bit scanning loop just slightly, we can test for zero words, and get rid of the final for.

The it.woff < len(it.s.words) check is covered by the main loop. We just need to roll it.s.words[it.woff] == 0 into the bit scanning loop’s condition. But it.boff < 64 and it.s.words[it.woff] == 0 check almost the same thing: the (remaining) sub-interval is empty.

// NOTE: wrong!
func (it *NotGreatIter1) Next() bool {
    for it.woff < len(it.s.words) {
        // ensure we make progress
        it.boff++
        // find the least significant bit in the current word
        for it.s.words[it.woff] != 0 && it.s.words[it.woff]&(1<<it.boff) == 0 {
            it.boff++
        }
        // we found a set bit
        if it.boff < 64 {
            return true
        }
        // we need to find the next non-zero word
        it.boff = -1
        it.woff++
    }
    return false
}

The comment gives it away, this version doesn’t work. It’s going to get stuck in an infinite loop because it.s.words[it.woff] never changes. We can address this by keeping a copy of the current word that we shift right as we scan it. It will eventually become zero, and as a nice bonus we have the next bit to check handy as the least significant bit in the word.

type Iter struct {
    s          *BitSet
    word       uint64
    woff, boff int
}

func (s *BitSet) Iter() *Iter {
    return &Iter{s, s.words[0], 0, -1}
}

func (it *Iter) Next() bool {
    for it.woff < len(it.s.words) {
        // ensure we make progress
        it.boff++
        // find the least significant bit in the current word
        for it.word != 0 && it.word&0x01 == 0 {
            it.boff++
            it.word >>= 1
        }
        // we found a set bit
        if it.word > 0 {
            it.word >>= 1
            return true
        }
        // we need to find the next non-zero word
        it.boff = -1
        it.woff++
        it.word = it.s.words[it.woff]
    }
    return false
}

Not only is the code cleaner, but it’s also more efficient. The inner loop test got cheaper: we compare the current word with zero, and only check the least significant bit instead of building the mask on the fly by shifting.5

Notice that it.word >>= 1 just before we return true? Because we exit the bit scanning loop before we shift the current word right, in the next iteration we’d be looking at the same bit. This is another one of those “ensure we make progress” measures.

There’s one more problem to solve. On the very last go around the loop, we’ll do an out-of-bounds read from the it.s.words array. We could add an explicit check, but that’s a price we pay on every iteration of the loop for a case that only happens once. Not great.

Instead, we pad the words array. We allocate one extra word that gets read during the last iteration of the loop. And we update the main loop condition to take into account the padding.

func (it *Iter) Next() bool {
    for it.woff < len(it.s.words)-1 {
        // ensure we make progress
        it.boff++
        // find the least significant bit in the current word
        for it.word != 0 && it.word&0x01 == 0 {
            it.boff++
            it.word >>= 1
        }
        // we found a set bit
        if it.word > 0 {
            it.word >>= 1
            return true
        }
        // we need to find the next non-zero word
        it.boff = -1
        it.woff++
        it.word = it.s.words[it.woff]
    }
    return false
}

We’re done. We have an iterator that should handle sparse sets efficiently. But how efficient is it compared to the naive iterator we started with?

Benchmarks

I ran several benchmarks on my ageing 2014 macbook pro. All the benchmarks use a set with room for 1000000 elements with various load factors. The benchmarks just iterate over the elements and sum them up.

I’ve also implemented the naive and optimised versions of the iterators with channels for the benchmarks. You can find the code for the implementations and the benchmarks in the repo. The benchmark names should be fairly obvious: the part after “Benchmark” is the iterator flavour; “Dense” means the set if full, while “SparseN” means the set has every Nth element set. So, for example, “BenchmarkOptIterSparse1000” benchmarks the optimised implementation of the iterator with every 1000th element set. That’s a 0.1% load factor.

Let’s look at the results:

cpu: Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz
BenchmarkNaiveIterDense-4                    271       4004419 ns/op         483 B/op          0 allocs/op
BenchmarkOptIterDense-4                      364       3880946 ns/op         360 B/op          0 allocs/op
BenchmarkChanNaiveIterDense-4                  5     249094165 ns/op       26360 B/op          2 allocs/op
BenchmarkChanOptIterDense-4                    5     223480939 ns/op       26584 B/op          3 allocs/op
BenchmarkNaiveIterSparse100-4                382       3168522 ns/op         343 B/op          0 allocs/op
BenchmarkOptIterSparse100-4                 1482        880012 ns/op          88 B/op          0 allocs/op
BenchmarkNaiveIterSparse1000-4               391       3078779 ns/op         335 B/op          0 allocs/op
BenchmarkOptIterSparse1000-4                8874        122877 ns/op          14 B/op          0 allocs/op
BenchmarkNaiveIterSparse10000-4              394       3052418 ns/op         332 B/op          0 allocs/op
BenchmarkOptIterSparse10000-4              24816         48760 ns/op           5 B/op          0 allocs/op
BenchmarkChanNaiveIterSparse100-4            182       6479458 ns/op         864 B/op          2 allocs/op
BenchmarkChanOptIterSparse100-4              452       2969189 ns/op         430 B/op          2 allocs/op
BenchmarkChanNaiveIterSparse1000-4           312       3826812 ns/op         562 B/op          2 allocs/op
BenchmarkChanOptIterSparse1000-4            3910        290776 ns/op         172 B/op          2 allocs/op
BenchmarkChanNaiveIterSparse10000-4          354       4126216 ns/op         506 B/op          2 allocs/op
BenchmarkChanOptIterSparse10000-4          24386         49496 ns/op         141 B/op          2 allocs/op

For full sets, the naive and optimised iterators are in the same ballpark. The optimised iterator tended to come faster in most runs, but the difference is most likely down to noise.

The channel implementations on the other hand were about 70 times slower. While I expected using channels to come with a performance penalty, this is more expensive than I would have guessed.

As the load factor decreases, and we have longer zero spans, the optimised iterator does consistently better. The naive implementation takes about the same time for every run, which is not surprising.

The channel iterators get significantly faster. The optimised implementation is faster than the naive non-channel implementation when the load factor is 0.1%.

Conclusions

What did we learn? First, that seemingly simple problems, can get quite tricky when you try to implement them. For me, the devil was in the details with this algorithm. Trying to get all the edge cases right took some time and fiddling with the code.

Second, you can start with a less than ideal algorithm, and iteratively improve it by a combination of insight and (almost) mechanical transformations.

Third, tools matter. It took me about 5 minutes to write the C# implementation of the optimised algorithm:

public IEnumerable<int> OptIter()
{
    for (var word = 0; word < words.Length; word++)
    {
        var segment = words[word];
        for (var bit = 0; segment > 0; bit++, segment >>= 1)
        {
            if ((segment & 0x01) != 0)
            {
                yield return word * 64 + bit;
            }
        }
    }
}

Granted, this was after I’d already written the Go implementation. It took somewhat longer to come up with the optimised version in Go.


  1. If we initialize off to 0, we skip the very first element in the set. That’s why we have to call Next before the first call to Current.↩︎

  2. If we leave out it.off++ before the loop, the iterator gets stuck on the first element in the set.↩︎

  3. This duplication is somewhat unfortunate, but we have to do it like this because Go lacks C’s do/while or Pascal’s repeat/until.↩︎

  4. Well, not quite the same. The loop in the preamble does one extra test per iteration.↩︎

  5. We could further move the it.boff++ statements in the inner for loop.↩︎