This post describes a particular technique that is useful for optimizing functional programs, which I call explicit laziness. This example came about because I needed a purely functional implementation of an infinite list of primes, to demonstrate the capabilities of exporting OpenTheory logic packages to Haskell. The Sieve of Eratosthenes is a classic algorithm to compute this, and in an imperative language can be implemented with an array where it is cheap to update elements. In translating the algorithm to a purely functional language the array data structure naturally becomes a list, where it is much cheaper to update leading elements than trailing elements. This led to the idea of using explicit laziness to minimize the number of updates to trailing elements, and the result is an improved algorithm that could be backported to speed up an imperative implementation. The price of the improvement is greater complexity, but this can be hidden inside an abstract type, and the increased verification challenge can be met by the HOL Light theorem prover which is used to generate the source logic package.
Example Sieve Program
To illustrate the explicit laziness technique, we’ll use it to optimize an Haskell program for computing prime numbers based on a sieve method. This is the central Sieve
data structure:
newtype Sieve =
Sieve { unSieve :: (Natural,[(Natural,Natural)]) }
Values of type Natural
are non-negative integers of unbounded size. Every value Sieve (n,ps)
of type Sieve
must satisfy the invariant that map fst ps
is the list of all prime numbers up to the perimeter n
in ascending order, and the counter k
in every element (p,k)
of the list ps
satisfies k == n `mod` p
. The following examples are valid values of type Sieve
:
Sieve (9,[(2,1),(3,0),(5,4),(7,2)])
Sieve (10,[(2,0),(3,1),(5,0),(7,3)])
Sieve (11,[(2,1),(3,2),(5,1),(7,4),(11,0)])
The Sieve
API supports two operations for creating new values of type Sieve:
initial :: Sieve
increment :: Sieve -> (Bool,Sieve)
The initial
value of a Sieve
is simply:
initial :: Sieve
initial = Sieve (1,[])
The increment
operation takes a value Sieve (n,_)
and returns a new value Sieve (n + 1, _)
, together with a boolean indicating whether the new perimeter n + 1
is a prime number:
increment :: Sieve -> (Bool,Sieve)
increment =
\s ->
let (n,ps) = unSieve s in
let n' = n + 1 in
let ps' = inc ps in
let b = check ps' in
let ps'' = if b then ps' ++ [(n',0)] else ps' in
(b, Sieve (n', ps''))
where
inc = map (\(p,k) -> (p, (k + 1) `mod` p))
check = all (\(_,k) -> k /= 0)
The increment
operation works by adding 1 to the perimeter n <- n + 1
and every counter k <- (k + 1) `mod` p
. It then checks whether any counter is zero, which happens when the perimeter is divisible by a known prime. If no counter is zero then the perimeter is a new prime, and so (n,0)
is added to the end of the list.
Motivating Explicit Laziness
How can we optimize this program? Well, there are certainly many ways that we can help the compiler produce efficient target code, but before we get into that we should consider algorithmic improvements. For example, are we performing any unnecessary computation? It seems that we are, because we increment every counter even though the boolean result only needs to know whether any counter is zero. We could stop at the first zero counter and just make a note (called a closure) that the other counters need to be incremented later. This is an instance of lazy evaluation, and since Haskell is a lazy language this optimization will be performed automatically.
Unfortunately, there is still a performance problem here. Consider a time in the future when we are incrementing counters and we come to a counter with an increment closure. At this point Haskell’s lazy evaluation will execute the closure to increment the counter, and then increment it for a second time to see whether it is zero. This is functionally correct, but it would be much more efficient to add 2 to the counter rather than add 1 and then add 1 again. Even worse, it turns out that the increment closures stack up on counters during evaluation, to the point that each time we discover a new prime p
we have to increment the counter k
in the final list element (q,k)
a total of p - q
times before we can check that it is non-zero and record the new prime.
How can we avoid this inefficient behavior of adding m
to a counter by adding 1 to it m
times? This is a case where Haskell’s automatic laziness is not flexible enough, because whenever evaluation encounters a value with a closure, the only possible course of action is to execute the closure and then continue evaluation. There is no way to examine the closure to find out what computation it will perform. However, we can manually simulate lazy evaluation by storing data structures together with explicit notes saying what computation needs to be done before they can be used.
Demonstrating Explicit Laziness
Here is how explicit laziness works in our example. First, we extend the Sieve data structure as follows:
newtype Sieve =
Sieve { unSieve :: (Natural,[(Natural,(Natural,Natural))]) }
This is exactly the same as before, except that each list element (p,(k,j))
has an extra natural number component j
, which is a note that j
must be added to every counter following this one in the list.
The definition of the initial
value remains the same:
initial :: Sieve
initial = Sieve (1,[])
However, the increment
operation now takes into account the explicit j
notes:
increment :: Sieve -> (Bool,Sieve)
increment =
\s ->
let (n,ps) = unSieve s in
let n' = n + 1 in
let (b,ps') = inc n' 1 ps in
(b, Sieve (n',ps'))
where
inc n _ [] = (True, (n,(0,0)) : [])
inc n i ((p,(k,j)) : ps) =
let k' = (k + i) `mod` p in
let j' = j + i in
if k' == 0 then (False, (p,(0,j')) : ps)
else let (b,ps') = inc n j' ps in (b, (p,(k',0)) : ps')
The inc
and check
functions in the previous implementation have been combined into one inc
function which takes a parameter i
recording how much to increment counters. Initially i
is set to 1, but it absorbs the j
values for the counters it visits. As soon a zero counter is discovered, the i parameter is added to the current j value and the rest of the list is untouched. This last point is crucial: explicit laziness eliminates the overhead of closures that are normally generated to update data structures not directly required for the result. For comparison with the version of the program without explicit laziness, here are the Sieve
values with perimeters 9, 10 and 11:
Sieve (9,[(2,(1,0)),(3,(0,2)),(5,(2,0)),(7,(0,0))])
Sieve (10,[(2,(0,1)),(3,(0,2)),(5,(2,0)),(7,(0,0))])
Sieve (11,[(2,(1,0)),(3,(2,0)),(5,(1,0)),(7,(4,0)),(11,(0,0))])
Values of the new version of the Sieve
type are much harder to read than values of the original version, because the j
values have to be added to every counter following it in the list. This complicates the invariant for Sieve
values, but is actually a measure of the efficiency gained. For example, the j
value of 2 in the Sieve
value with perimeter 9 line replaces 4 counter increments in the original program: 2 for the 5 counter and another 2 for the 7 counter.
Performance Comparison
Even though explicit laziness is an algorithmic optimization, and there remain many optimizations that can be applied to both versions of the example sieve program, it is nevertheless interesting to gauge the performance effects of explicit laziness using a standard compiler. The following table presents the time taken to compute the Nth prime using both versions of the example sieve program, for various values of N:
Nth prime | 100 | 200 | 400 | 800 | 1600 | 3200 |
Without explicit laziness | 0.007s | 0.019s | 0.068s | 0.284s | 1.169s | 5.008s |
With explicit laziness | 0.004s | 0.008s | 0.025s | 0.083s | 0.345s | 1.495s |
It is gratifying to see that the version of the example sieve program with explicit laziness is about 3x faster on this small experiment.
Summary
In this post we have described explicit laziness, a useful technique for optimizing functional programs. Like all optimization techniques its effectiveness depends on the context: automatic laziness is more flexible and guarantees that closures will be evaluated at most once, so should be the default. However, the judicious introduction of explicit laziness can eliminate closures and allow future computations to be combined into more efficient variants before executing them (a limited form of computation on computations).
Appendix
The code for the optimized version of the example sieve program is available at Hackage: this verified Haskell package is automatically generated from an OpenTheory package. The performance results were gathered using the GHC 7.4.1 compiler on a MacBook Air with a 1.8GHz Intel Core i7 processor and 4Gb of RAM, running OS X 10.7.5.
I presented this at a pdxfunc meeting on February 11, after which Bart Massey asked me if this was an implementation of The Genuine Sieve of Eratosthenes. Alas, I discover that both versions are unfaithful implementations – I should implement the genuine sieve and compare its performance to the explicitly lazy version.