## Power-of-two padding

In my previous post, I wrote about using FFTs to compute a full convolution. If the vector `x`

has $K$ points and the vector `h`

has $L$ points, then the convolution of `x`

and `h`

has $K+L-1$ points. FFTs can be used to compute this convolution, but **only** if `x`

and `h`

are zero-padded to have **at least** $K+L-1$ points in the FFT computation.

Early in my career, the commonly available FFT algorithms, which were usually written in Fortran, were limited to computing FFT with transforms that were powers of two. Because of this limitation, using FFTs to compute convolution required zero-padding to the smallest power of two that was greater than or equal to $K+L-1$ . Here’s how that might look in MATLAB code.

```
x = [1 -1 0 4];
h = [1 0 -1];
K = length(x);
L = length(h);
N = K + L - 1
```

```
N = 6
```

```
Np = 2^nextpow2(N)
```

```
Np = 8
```

```
X = fft(x,Np); % Computes Np-point zero-padded FFT
H = fft(h,Np); % Computes Np-point zero-padded FFT
Y = ifft(X .* H);
y = Y(1:N)
```

```
y = 1x6
1.0000 -1.0000 -1.0000 5.0000 0 -4.0000
```

## Small-primes padding

More than 20 years ago now, I integrated the FFTW library into MATLAB. FFTW, or “Fastest Fourier Transform in the West,” won the 1999 J. H. Wilkinson Prize for Numerical Software. I spent a long time deep in the FFTW documentation at the time, and this statement caught my attention: “The standard FFTW distribution works most efficiently for arrays whose size can be factored into small primes (2, 3, 5, and 7), and otherwise it uses a slower general-purpose routine.”

This observation suggests the interesting possibility of using transform lengths are powers of 2, 3, 5, and 7, instead of using only transform lengths that are powers of 2.

Here is a function that computes a transform length based on this idea.

```
function np = transformLength(n)
np = n;
while true
% Divide n evenly, as many times as possible, by the factors 2, 3,
% 5, and 7.
r = np;
for p = [2 3 5 7]
while (r > 1) && (mod(r, p) == 0)
r = r / p;
end
end
if r == 1
% If the result after the above divisions is 1, then we have found
% the desired number, so break out of the loop.
break;
else
% np has one or more prime factors greater than 7, so try the
% next integer.
np = np + 1;
end
end
end
```

Suppose `n`

is 37, a prime number. The next power of 2 is 64. The next number whose largest prime factor 7 or less, on the other hand, is 40.

```
n = 37;
np = transformLength(n)
```

```
np = 40
```

```
factor(np)
```

```
ans = 1x4
2 2 2 5
```

## 2-D performance comparisons

To see how different padding strategies affect performance, I will set up some different FFT-based 2-D convolution functions and measure their execution time for different input sizes. Here is function that zero-pads only to the minimum amount required, $K+L-1$ .

```
function C = conv2_fft_minpad(A,B)
[K1,K2] = size(A);
[L1,L2] = size(B);
N1 = K1 + L1 - 1;
N2 = K2 + L2 - 1;
C = ifft2(fft2(A,N1,N2) .* fft2(B,N1,N2));
end
```

Let’s measure the time required to convolve a 1170x1170 input with a 101x101 input. The output will be 1271x1271, and 1271 has prime factors 31 and 41.

```
A = rand(1170,1170);
B = rand(101,101);
f_minpad = @() conv2_fft_minpad(A,B);
timeit(f_minpad)
```

```
ans = 0.0329
```

Here is a function that zero-pads to the next power of 2. (The next power of 2 above 1271 is 2048.)

```
function C = conv2_fft_pow2pad(A,B)
[K1,K2] = size(A);
[L1,L2] = size(B);
N1 = K1 + L1 - 1;
N2 = K2 + L2 - 1;
N1p = 2^nextpow2(N1);
N2p = 2^nextpow2(N2);
C = ifft2(fft2(A,N1p,N2p) .* fft2(B,N1p,N2p));
C = C(1:N1,1:N2);
end
```

Repeat the timing experiment using this second padding method:

```
f_pow2pad = @() conv2_fft_pow2pad(A,B);
timeit(f_pow2pad)
```

```
ans = 0.0553
```

So, power-of-2 padding is not helping in this case. The fact that power-of-2 padding is slower than using a transform length with relatively high prime factors represents a sea change from the way FFT computations in MATLAB performed prior to 2004, which is when MATLAB started using FFTW.

Can small-primes padding do better?

```
function C = conv2_fft_smallprimespad(A,B)
[K1,K2] = size(A);
[L1,L2] = size(B);
N1 = K1 + L1 - 1;
N2 = K2 + L2 - 1;
N1p = transformLength(N1);
N2p = transformLength(N2);
C = ifft2(fft2(A,N1p,N2p) .* fft2(B,N1p,N2p));
C = C(1:N1,1:N2);
end
```

Run the timing again with the small-primes pad method:

```
f_smallprimespad = @() conv2_fft_smallprimespad(A,B);
timeit(f_smallprimespad)
```

```
ans = 0.0230
```

Yes, the convolution function using the small-primes method runs about 45% faster for this case.

Now let’s compare the results for a range of sizes.

```
nn = 1100:1200;
t_minpad = zeros(size(nn));
t_pow2pad = zeros(size(nn));
t_smallprimespad = zeros(size(nn));
for k = 1:length(nn)
n = nn(k);
A = rand(n,n);
t_minpad(k) = timeit(@() conv2_fft_minpad(A,B));
t_pow2pad(k) = timeit(@() conv2_fft_pow2pad(A,B));
t_smallprimespad(k) = timeit(@() conv2_fft_smallprimespad(A,B));
end
```

Plot the results.

```
plot(nn,t_minpad,nn,t_pow2pad,nn,t_smallprimespad)
ax = gca;
ax.YLim(1) = 0;
legend(["min padding","power-of-2","small-primes"],...
Location="southeast")
title("Execution time, nxn convolution with 101x101")
xlabel("n")
ylabel("time (s)")
grid on
```

While this is admittedly not anything like a comprehensive performance study, it does suggest that:

- There is no longer a reason to do power-of-2 padding for 2-D convolution problems.
- Small-primes padding is almost always faster that minimum-length padding, even when considering the extra steps needed (computing the transform length and cropping the output of
`ifft2`

. In a few cases, it is about the same or just slightly worse.

In my next post in this series, I plan to investigate a possible way to speed up the computation of the small-primes transform length.