MATLAB: Repeated multiplication is faster than integer powers
Consider the following code
function testpow() %function to compare integer powers with repeated multiplication rng default a=rand(1,10000000)*10; disp('speed of ^4 using pow') tic;test4p = a.^4;toc disp('speed of ^4 using multiply') tic;test4m = a.*a.*a.*a;toc disp('maximum difference in results'); max_diff = max(abs(test4p-test4m)) end
On running it on my late 2013 Macbook Air with a Haswell Intel i5 using MATLAB 2013b, I get the following results
speed of ^4 using pow Elapsed time is 0.527485 seconds. speed of ^4 using multiply Elapsed time is 0.025474 seconds. maximum difference in results max_diff = 1.8190e-12
In this case (derived from a real-world case), using repeated multiplication is around twenty times faster than using integer powers in MATLAB. This leads to some questions:-
- Why the huge speed difference?
- Would a similar speed difference be seen in other systems–R, Python, Julia etc?
- Would we see the same speed difference on other operating systems/CPUs?
- Are there any numerical reasons why using repeated multiplication instead of integer powers is a bad idea?
Using Octave 3.6.2 on Ubuntu 12.04 with AMD Athlon(tm) II X4 640 Processor, I get the opposite, but less impressive difference, in execution times:
speed of ^4 using pow
Elapsed time is 0.1246 seconds.
speed of ^4 using multiply
Elapsed time is 0.1968 seconds.
maximum difference in results
max_diff = 1.8190e-12
Hi. In Julia:
function testpow()
#function to compare integer powers with repeated multiplication
a=rand(1,10000000)*10;
println(“speed of ^4 using pow”)
tic(); test4p = a.^4; toc()
println(“speed of ^4 using multiply”)
tic(); test4m = a.*a.*a.*a; toc()
println(“maximum difference in results”);
max_diff = maximum(abs(test4p-test4m))
end
testpow()
speed of ^4 using pow
elapsed time: 1.074429726 seconds
speed of ^4 using multiply
elapsed time: 0.40391358 seconds
maximum difference in results
1.8189894035458565e-12
I guess it may be due to some for loop overhead in the pow version.
I believe some languages/compilers implement loop unrolling to optimize small exponent cases.
For Mathematica it is the opposite:
test = RandomReal[{0, 10}, 10^8];
AbsoluteTiming[test^4;]
AbsoluteTiming[test test test test;]
{1.457870, Null}
{1.832707, Null}
‘stupid’ exponentiation is slower.
O and probably it has to with the fast that matlab sees ‘4’ not as an integer, but rather as a float (double). And uses a very general algorithm that holds for any number, but 4 is special; it is an integer…
@Sander Huisman
Indeed (at least for Mathematica):
test = RandomReal[{0, 10}, 10^8];
AbsoluteTiming[test^4.0;]
AbsoluteTiming[test test test test;]
(* notice there is 4.0 not 4 *)
{2.890319, Null}
{1.811797, Null}
My guess exponentiation is slower than repeated multiplying because it has to be general enough to handle non-integer powers. (x.^y) is probably implemented as exp(y.*ln(x)). Surprisingly both that and repeated squaring are also faster than exponentiation, but still slower than repeated multiplies.
I think the speed difference for other language would very much depend on if they use LAPACK or some other library for matrix algebra. I believe in general repeated multiplies are better numerically than general exponentiation.
Here are my commands and output (run on 2009 Macbook Pro with 2.8 Ghz Intel Core 2 Duo):
>>disp(‘speed of ^4 using pow’)
tic;test4p = a.^4;toc
disp(‘speed of ^4 using multiply’)
tic;test4m = a.*a.*a.*a;toc
disp(‘speed of ^4 using squaring’)
tic;temp = a.*a; test4s=temp.*temp; toc
disp(‘speed of ^4 using exp’)
tic;test4e=exp(4.*log(a)); toc
speed of ^4 using pow
Elapsed time is 0.801814 seconds.
speed of ^4 using multiply
Elapsed time is 0.067373 seconds.
speed of ^4 using squaring
Elapsed time is 0.113646 seconds.
speed of ^4 using exp
Elapsed time is 0.337066 seconds.
maximum difference in results
In Euler Math Toolbox, there is the same effect. I also compared with a binary multiplication, and with C.
[code]
>a=random(10000000)*10;
>tic; p=a^4; toc;
Used 0.607 seconds
>tic; p=a*a*a*a; toc;
Used 0.289 seconds
>tic; p=a*a; p=p*p; toc;
Used 0.203 seconds
>function tinyc test (x) …
$ARG_DOUBLE_MATRIX(x,r,c);
$RES_DOUBLE_MATRIX(y,r,c);
$int i,j;
$for (i=0; i<r; i++)
$ for (j=0; jtic; test(a); toc;
Used 0.112 seconds
[/code]
It seems that the power functions (using a logarithm, a multiplication and an exponential) is not the best idea. The advantage vanishes from at a^8 on my system.
For comparison, I tried the same code in Octave, except for the initial “rng” command, which Octave doesn’t support:
speed of ^4 using pow
Elapsed time is 0.106022 seconds.
speed of ^4 using multiply
Elapsed time is 0.150319 seconds.
maximum difference in results
max_diff = 1.8190e-12
(Acer 5742 laptop from early 2011 with Intel Core i5-450M “Arrandale”, running Debian 64-bit “testing” and Octave 3.6.4)
On my computer (2011-era Dell XPS 15 laptop with i7-2670QM and Ubuntu 13.10) I get:
– MATLAB R2013a:
speed of ^4 using pow
Elapsed time is 0.372011 seconds.
speed of ^4 using multiply
Elapsed time is 0.019862 seconds.
maximum difference in results
max_diff =
1.8190e-12
– Octave 3.6.4:
speed of ^4 using pow
Elapsed time is 0.074532 seconds.
speed of ^4 using multiply
Elapsed time is 0.10198 seconds.
maximum difference in results
max_diff = 1.8190e-12
In MATLAB multiply is faster but in Octave pow is faster.
In Python 2.7.5:
In [10]: a = np.random.randn(1,10000000)*10
In [11]: %timeit a**4
1 loops, best of 3: 948 ms per loop
In [12]: %timeit a*a*a*a
10 loops, best of 3: 113 ms per loop
I ran your code in Octave 3.8.0 on MacOS X 10.8.5 on a Core i5@3.4 GHz and also a version in Python 2.7.6 from Anaconda with Numpy 1.8.0 and got the following:
OCTAVE :
speed of ^2 using pow
Elapsed time is 0.0249081 seconds.
speed of ^2 using multiply
Elapsed time is 0.045887 seconds.
speed of ^3 using pow
Elapsed time is 0.047009 seconds.
speed of ^3 using multiply
Elapsed time is 0.11313 seconds.
speed of ^4 using pow
Elapsed time is 0.077533 seconds.
speed of ^4 using multiply
Elapsed time is 0.159042 seconds.
PYTHON :
speed of ^2 using pow
Elapsed time : 0.0255 seconds
speed of ^2 using multiply
Elapsed time : 0.0428 seconds
pow time / mult time = 0.5967
speed of ^3 using pow
Elapsed time : 0.1714 seconds
speed of ^3 using multiply
Elapsed time : 0.1091 seconds
pow time / mult time = 1.5703
speed of ^4 using pow
Elapsed time : 0.1481 seconds
speed of ^4 using multiply
Elapsed time : 0.1508 seconds
pow time / mult time = 0.9818
Interestingly, Octave is consistently faster using powers than multiplication for n = 2,3 and 4. Python’s power operator is faster for n = 2, slower for n = 3 and essentially the same for n = 4. I would guess that the Numpy developers have worked to optimize the n = 2 case since squaring numbers is such a common calculation.
On my old laptop, with Dell Latitude E6400, Core2Due P9800 (2.8GHz) and Win7x64:
Matlab 2013b
speed of ^4 using pow
Elapsed time is 0.972871 seconds.
speed of ^4 using multiply
Elapsed time is 0.063420 seconds.
maximum difference in results
max_diff =
1.8190e-12
As far as I can tell, x.^y is also slower than exp(y.*log(x)), which might not be too surprising considering that the latter will not work for negative or 0 x. For integer powers, if you calculate them often, rather than using repeated multiplication it’s convenient to implement exponential ion by squaring. Interestingly none of this is true for Octave, in which the .^ seems to be always faster than any other hack you can try. I wrote a blog post about this a while ago: http://www.robertomontagna.it/2012/12/more-about-exponents.html
This is my maple 17 code on the same system (I’m not sure if this is the best way):
The ^4 using power: 0.375000
The ^4 using multiply: 0.733000
this is for 1e5, for 1e6 it take much longer time to create the vector however:
The ^4 using power: 5.3030
The ^4 using multiply: 9.3600
And now something to show the speed/power of _pure_ Julia code:
Defining a new .^ operator for element-wise integer power of matrices
.^{T<:Number}(A::Matrix{T}, x::Integer)
using _pure_ Julia "for" loops as presented in https://gist.github.com/cdsousa/9247979
the times passes from
speed of ^4 using pow
elapsed time: 1.074429726 seconds
speed of ^4 using multiply
elapsed time: 0.40391358 seconds
maximum difference in results
1.8189894035458565e-12
to
speed of ^4 using pow
elapsed time: 0.100119984 seconds <— 10x faster with new .^
speed of ^4 using multiply
elapsed time: 0.40944625 seconds
maximum difference in results
0.0
Do this in pure Matlab/Python ;)
BTW, this was run on an Intel(R) Core(TM) i3 M380 @ 2.53GHz.
That really is very nice, thank you! It would be great to write that code up in a tutorial fashion for Julia-newbies, explaining what’s going on and why it works.
@Mike Croucher
Thanks. I myself am a Julia newbie, and I’m pretty sure my code could be improved a lot (at least for the style).
I did a post from my comments, following yours.
Cheers.
In this newsreader post, replies 7 and 9 have some background on why Matlabs power function is slower than multiplication:
http://www.mathworks.com/matlabcentral/newsreader/view_thread/83759
More discussion on this topic at http://scicomp.stackexchange.com/questions/10919/is-there-any-numerical-reason-for-not-using-repeated-multiplication-instead-of-i
In ipython (anaconda 1.9.1)
In [1]: a = random.randn(1,10000000)*10
In [2]: %timeit a**4
1 loops, best of 3: 1.09 s per loop
In [3]: %timeit a*a*a*a
10 loops, best of 3: 155 ms per loop
In [4]: %timeit (a*a)**2
10 loops, best of 3: 93.3 ms per loop
In julia Version 0.2.0-prerelease+4027
julia> a=rand(1,1,10000000)*10;
julia> timing = @time a.*a.*a.*a;
elapsed time: 0.404306974 seconds (240002880 bytes allocated)
julia> timing = @time a.^4;
elapsed time: 0.174195625 seconds (80000216 bytes allocated)
julia> timing = @time (a.*a).^2;
elapsed time: 0.233738999 seconds (160001160 bytes allocated)
In julia Version 0.2.0-prerelease+4027
julia> a=rand(1,10000000)*10;
julia> tic(); test4p = a.^4; toc()
elapsed time: 0.241979896 seconds
julia> tic(); test4m = a.*a.*a.*a; toc()
elapsed time: 0.733428131 seconds
julia> tic(); test4m = (a.*a).^2; toc()
elapsed time: 0.225762048 seconds
As noted by Tony above, squaring is a special case that is often optimized. The exponent in the floating point representation just needs incrementing by one.
The fastest way to do a fourth power in pure Matlab seems to be (a.^2).^2.
A mex file could be written to bit-twiddle the floating point numbers, increasing the exponent by 2. I’m guessing that would be a smidge faster. But it would have to be a real bottle-neck to bother trying.
Woops, (a.^2).^2 is fast and correct, but the bit-twiddling I suggested, which is what pow2 does, was just plain wrong!
I just came across this early MATLAB CSSM newsgroup thread on this: http://mathworks.com/matlabcentral/newsreader/view_thread/287247