Optimizing Performance of Orthogonal Polynomials
In a previous post I talked about a few methods to compute the Zernike polynomials, and alluded to a way to compute them significantly faster.
This post walks through doing that with the Jacobi polynomials, which underpin the Zernikes, and shows the optimization of that routine by multiple orders of magnitude.
To recap, the Jacobi polynomials are orthogonal polynomials on the domain $[-1,1]$ with respect to some weighting parameters $\alpha, \beta$. They can be computed with a recurrence relation:
$$ a \cdot P_n^{(\alpha,\beta)} = b \cdot x \cdot P_{n-1}^{(\alpha,\beta)} - c \cdot P_{n-2}^{(\alpha,\beta)} $$
This recurrence relation removes all powers from the calculation, which yields numerical stability; powers of large or small numbers creates catastrophic rounding error. A benefit of the recurrence is that if we already have $P_{n-1}$ and $P_{n-2}$, computing $P_{n}$ becomes; three multiplies, a subtract, and a divide.
The simplest possible algorithm computes $a, b, c$ in one step, initializes the recurrence, and then loops from $N=2$ to $N=n$ to compute $P_{n}$. That looks like this:
function abc(n, α, β, x)
# the parens emphasize the terms, not PEMDAS
a = (2n) * (n + α + β) * (2n + α + β - 2)
b1 = (2n + α + β -1)
b2 = (2n + α + β)
b2 = b2 * (b2 - 2)
b = b1 * (b2 * x + α^2 - β^2)
c = 2 * (n + α - 1) * (n + β - 1) * (2n + α + β)
return a, b, c
end
function jacobi(n, α, β, x)
if n == 0
return 1
elseif n == 1
return (α + 1) + (α + β + 2) * ((x-1)/2)
end
Pnm2 = jacobi(0, α, β, x)
Pnm1 = jacobi(1, α, β, x)
a, b, c = abc(2, α, β, x)
Pn = ((b * Pnm1) - (c * Pnm2)) / a
if n == 2
return Pn
end
for i = 3:n
Pnm2 = Pnm1
Pnm1 = Pn
a, b, c = abc(n, α, β, x)
Pn = ((b * Pnm1) - (c * Pnm2)) / a
end
return Pn
end
In just 31 lines, we have a function which computes the Jacobi polynomials, which means we can compute Jacobi, Chevyshev, Legendre, and Zernike polynomials. This is powerful. Is it fast?
x=-1:0.00001:1;
n=10;
α=0.;
β=0;
@benchmark jacobi.($n, $α, $β, $x)
BenchmarkTools.Trial:
memory estimate: 1.53 MiB
allocs estimate: 4
--------------
minimum time: 15.138 ms (0.00% GC)
median time: 15.490 ms (0.00% GC)
mean time: 15.644 ms (0.60% GC)
maximum time: 18.700 ms (16.95% GC)
--------------
samples: 320
evals/sample: 1
x has 200001 elements, so this took 75.7 ns per element. My CPU runs at about 0.2 ns per clock, so this is about 380 cycles per element.
Ignoring the difference between the Zernike and Jacobi polynomials, and remembering that Zernike has extra steps and will be slower, this compares to 3.48 ms for 256x256, or 53.1 ns per element.
Didn’t I say this recurrence thing would be faster? If you look back at the abc
function, it is actually a fairly large computation. We can see already that a
and c
do not depend on x
, and we can refactor them out:
function calcac(n, α, β)
a = (2n) * (n + α + β) * (2n + α + β - 2)
c = 2 * (n + α - 1) * (n + β - 1) * (2n + α + β)
return a, c
end
function calcb(n, α, β, x)
# the parens emphasize the terms, not PEMDAS
b1 = (2n + α + β -1)
b2 = (2n + α + β)
b2 = b2 * (b2 - 2)
b = b1 * (b2 * x + α^2 - β^2)
return b
end
function jacobi2(n, α, β, x)
if n == 0
return ones(size(x))
elseif n == 1
return (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
end
Pnm1 = (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
a, c = calcac(2, α, β)
inva = 1 / a
Pn = ((calcb.(2, α, β, x) .* Pnm1) .- c) .* inva
if n == 2
return Pn
end
Pnm2 = similar(Pnm1)
for i = 3:n
Pnm2, Pnm1 = Pnm1, Pn
a, c = calcac(n, α, β)
inva = 1 / a
Pn = ((calcb.(n, α, β, x) .* Pnm1) .- (c .* Pnm2)) ./ a
end
return Pn
end
By refactoring a and c out of the elementwise calculation we have greatly reduced the work. A trade is that the function now operates on arrays instead of scalar values. The implications of that will become clear in a moment. Did it get any faster?
# note lack of dot between jacobi2 and the arguments
@benchmark jacobi2($n, $α, $β, $x)
BenchmarkTools.Trial:
memory estimate: 15.26 MiB
allocs estimate: 36
--------------
minimum time: 7.516 ms (0.00% GC)
median time: 8.002 ms (0.00% GC)
mean time: 9.001 ms (11.88% GC)
maximum time: 14.984 ms (0.00% GC)
--------------
samples: 556
evals/sample: 1
Roughly twice as fast, thanks to the removal of significant waste work. There is a subtle performance improvement made by manually inlining Pnm1 and Pn when things are initialized. Unfortunately, the change from scalar to array calling has lead to the function now making more than one allocation per order $n$ due to repeated creation of new Pn matricies.
We can remove most of the allocations by reusing an array, and manually inline the b term of the recurrence will also boost performance. The Julia compiler seems to make sub-optimal choices either with the @inline macro, or without.
function jacobi3(n, α, β, x)
if n == 0
return ones(size(x))
elseif n == 1
return (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
end
Pnm1 = (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
a, c, b1, b2, b3 = calcac_startb(2, α, β)
inva = 1 / a
# .- c .* Pnm2, but Pnm2 == 1, so drop it
Pn = similar(Pnm1)
Pn .= (b1 .* (b2 .* x .+ b3) .* Pnm1 .- c) .* inva
if n == 2
return Pn
end
Pnm2 = similar(Pn);
for i = 3:n
# assign Pnm2 to Pn to reuse a buffer
Pnm2, Pnm1, Pn = Pnm1, Pn, Pnm2
a, c, b1, b2, b3 = calcac_startb(i, α, β)
inva = 1 / a
Pn .= (b1 .* (b2 .* x .+ b3) .* Pnm1 .- c .* Pnm2) .* inva
end
return Pn
end
This function differs from jacobi2 in that part of the b term in the recurrence relation has been refactored to avoid repeat work, the b term’s application w.r.t x has been manually inlined, and Pn is assigned with .=
for in-place assignment (“overwrite”) instead of making a new array each time.
Before showing the performance of this version, I’ll note that I added the @muladd
macro in a fourth version, which was about 20% faster, and another version with explicit AVX vectorization:
function jacobi4(n, α, β, x)
# ..
for i = 3:n
# ..
@avx unroll=4 for i in eachindex(Pn)
Pn[i] = (b1 * (b2 * x[i] + b3) * Pnm1[i] - c * Pnm2[i]) * inva
end
end
return Pn
end
AVX is a special instruction set on your CPU that operates on multiple numbers at once. It is distinct to multi-threading.
The results:
@benchmark jacobi3($n, $α, $β, $x)
BenchmarkTools.Trial:
memory estimate: 4.58 MiB
allocs estimate: 6
--------------
minimum time: 2.328 ms (0.00% GC)
median time: 2.604 ms (0.00% GC)
mean time: 2.873 ms (7.55% GC)
maximum time: 6.276 ms (33.33% GC)
--------------
samples: 1739
evals/sample: 1
@benchmark jacobi4($n, $α, $β, $x)
BenchmarkTools.Trial:
memory estimate: 4.58 MiB
allocs estimate: 6
--------------
minimum time: 1.493 ms (0.00% GC)
median time: 2.236 ms (0.00% GC)
mean time: 2.636 ms (12.19% GC)
maximum time: 9.074 ms (37.75% GC)
--------------
samples: 1895
evals/sample: 1
The removal of the waste allocations made the code about 3x faster, and another 50% improvement by using vector instructions.
This works out to 7.5 ns per element, about 10x faster than we started.
This is not all there is, though. For comparison, we will write out the simplest possible jacobi sum function, using the slow waste work filled jacobi function we wrote first:
function jacobi_sum_simple(ns, weights, α, β, x)
out = zero(x)
for (n, w) in zip(ns, weights)
out .+= (w .* jacobi.(n, α, β, x))
end
return out
end
@benchmark jacobi_sum_simple(1:100, 100 : -1 : 1, 0., 0., $x)
BenchmarkTools.Trial:
memory estimate: 1.53 MiB
allocs estimate: 2
--------------
minimum time: 5.364 s (0.00% GC)
median time: 5.364 s (0.00% GC)
mean time: 5.364 s (0.00% GC)
maximum time: 5.364 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
Higher orders are inherently more time consuming, so the total time is weighted towards the largest n involved. You might think this is as fast as it gets with a little intuition. However, we are not exploiting the recurrence relation. This function is more complicated, but avoids any intermediate allocations and uses the recurrence to avoid repeat work:
function jacobi_sum_onepass(ns, weights, α, β, x)
# start by allocating the output array, be a little clever
# around the initialization value
out = similar(x)
if ns[1] == 0
fill!(out, weights[1])
else
fill!(out, 0)
end
# Pnm1 = P_1, Pnm2= P_0, == 1
# this block of setup is the same as before
Pnm1 = (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
a, c, b1, b2, b3 = calcac_startb(2, α, β)
inva = 1 / a
# .- c .* Pnm2, but Pnm2 == 1, so drop it
Pn = similar(Pnm1)
Pn .= @muladd (b1 .* (b2 .* x .+ b3) .* Pnm1 .- c) .* inva
# add P_2 if we can. Two iffs is faster than
# a searchsorted
if ns[1] == 1
out .+= weights[1] .* Pnm1
elseif ns[2] == 1
out .+= weights[2] .* Pnm1
end
if ns[1] == 2
out .+= weights[1] .* Pn
elseif ns[2] == 2
out .+= weights[2] .* Pn
end
Pnm2 = similar(Pn);
for i = 3:ns[end]
# assign Pnm2 to Pn to reuse a buffer
Pnm2, Pnm1, Pn = Pnm1, Pn, Pnm2
a, c, b1, b2, b3 = calcac_startb(i, α, β)
inva = 1 / a
@avx unroll=4 for i in eachindex(Pn)
Pn[i] = (b1 * (b2 * x[i] + b3) * Pnm1[i] - c * Pnm2[i]) * inva
end
# i is the order that was just computed
for j in eachindex(ns)
if ns[j] == i
# this order is relevant to the sum
out .+= weights[j] .* Pn
break
end
end
end
return out
end
This is longer, but is mostly a copy paste – the big addition is the if..elseif block in the middle, and the inner for loop. The performance reflects the fact that we compute each term only once:
@benchmark jacobi_sum_onepass(1:100, 100: -1 : 1, 0., 0., $x)
BenchmarkTools.Trial:
memory estimate: 6.10 MiB
allocs estimate: 8
--------------
minimum time: 20.800 ms (0.00% GC)
median time: 23.255 ms (0.00% GC)
mean time: 24.710 ms (1.98% GC)
maximum time: 39.047 ms (0.00% GC)
--------------
samples: 203
evals/sample: 1
So for a weighted sum up to order 100, this is 270x faster.
Recap
In summary, we replaced a calculation that took on the order of 50 nanoseconds per element with one that takes on the order of 5 nanoseconds per element, by replacing a factorial expansion with a recurrence relation, and then optimizing the implementation of that recurrence relation.
We then used that recurrence relation to remove redundant work in computing sums of orthogonal polynomials, which yielded an additional 270x speedup.
The results combine to create an orthogonal polynomial sum calculator that is ~2,500x faster. The performance gap is larger for larger sums of terms and lesser for smaller sums of terms.
Future work will expand this to the Zernike polynomials, which require writing the same function body a third time to inline the recurrence relation. The dual expansion in n and m requires special care.