MATLAB: Repeated multiplication is faster than integer powers

February 26th, 2014 | Categories: Making MATLAB faster, matlab, programming | Tags:

Consider the following code

function testpow() 
%function to compare integer powers with repeated multiplication

rng default
a=rand(1,10000000)*10;

disp('speed of ^4 using pow')
tic;test4p = a.^4;toc
disp('speed of ^4 using multiply')
tic;test4m = a.*a.*a.*a;toc

disp('maximum difference in results');
max_diff = max(abs(test4p-test4m))

end

On running it on my late 2013 Macbook Air with a Haswell Intel i5 using MATLAB 2013b, I get the following results

speed of ^4 using pow
Elapsed time is 0.527485 seconds.
speed of ^4 using multiply
Elapsed time is 0.025474 seconds.
maximum difference in results
max_diff =
   1.8190e-12

In this case (derived from a real-world case), using repeated multiplication is around twenty times faster than using integer powers in MATLAB. This leads to some questions:-

Why the huge speed difference?
Would a similar speed difference be seen in other systems–R, Python, Julia etc?
Would we see the same speed difference on other operating systems/CPUs?
Are there any numerical reasons why using repeated multiplication instead of integer powers is a bad idea?

Leave a comment | Trackback

Sachin Shanbhag

February 26th, 2014 at 20:50

Reply | Quote | #1

Using Octave 3.6.2 on Ubuntu 12.04 with AMD Athlon(tm) II X4 640 Processor, I get the opposite, but less impressive difference, in execution times:

speed of ^4 using pow
Elapsed time is 0.1246 seconds.
speed of ^4 using multiply
Elapsed time is 0.1968 seconds.
maximum difference in results
max_diff = 1.8190e-12
Cristóvão D Sousa

February 26th, 2014 at 21:25

Reply | Quote | #2

Hi. In Julia:

function testpow()
#function to compare integer powers with repeated multiplication

a=rand(1,10000000)*10;

println(“speed of ^4 using pow”)
tic(); test4p = a.^4; toc()
println(“speed of ^4 using multiply”)
tic(); test4m = a.*a.*a.*a; toc()

println(“maximum difference in results”);
max_diff = maximum(abs(test4p-test4m))
end

testpow()

speed of ^4 using pow
elapsed time: 1.074429726 seconds
speed of ^4 using multiply
elapsed time: 0.40391358 seconds
maximum difference in results
1.8189894035458565e-12
Cristóvão D Sousa

February 26th, 2014 at 21:30

Reply | Quote | #3

I guess it may be due to some for loop overhead in the pow version.
I believe some languages/compilers implement loop unrolling to optimize small exponent cases.
Sander Huisman

February 26th, 2014 at 22:07

Reply | Quote | #4

For Mathematica it is the opposite:
test = RandomReal[{0, 10}, 10^8];
AbsoluteTiming[test^4;]
AbsoluteTiming[test test test test;]

{1.457870, Null}
{1.832707, Null}

‘stupid’ exponentiation is slower.
Sander Huisman

February 26th, 2014 at 22:09

Reply | Quote | #5

O and probably it has to with the fast that matlab sees ‘4’ not as an integer, but rather as a float (double). And uses a very general algorithm that holds for any number, but 4 is special; it is an integer…
Sander Huisman

February 26th, 2014 at 22:11

Reply | Quote | #6

@Sander Huisman
Indeed (at least for Mathematica):

test = RandomReal[{0, 10}, 10^8];
AbsoluteTiming[test^4.0;]
AbsoluteTiming[test test test test;]
(* notice there is 4.0 not 4 *)

{2.890319, Null}
{1.811797, Null}
Shankar

February 26th, 2014 at 22:20

Reply | Quote | #7

My guess exponentiation is slower than repeated multiplying because it has to be general enough to handle non-integer powers. (x.^y) is probably implemented as exp(y.*ln(x)). Surprisingly both that and repeated squaring are also faster than exponentiation, but still slower than repeated multiplies.

I think the speed difference for other language would very much depend on if they use LAPACK or some other library for matrix algebra. I believe in general repeated multiplies are better numerically than general exponentiation.

Here are my commands and output (run on 2009 Macbook Pro with 2.8 Ghz Intel Core 2 Duo):

>>disp(‘speed of ^4 using pow’)
tic;test4p = a.^4;toc
disp(‘speed of ^4 using multiply’)
tic;test4m = a.*a.*a.*a;toc
disp(‘speed of ^4 using squaring’)
tic;temp = a.*a; test4s=temp.*temp; toc
disp(‘speed of ^4 using exp’)
tic;test4e=exp(4.*log(a)); toc

speed of ^4 using pow
Elapsed time is 0.801814 seconds.
speed of ^4 using multiply
Elapsed time is 0.067373 seconds.
speed of ^4 using squaring
Elapsed time is 0.113646 seconds.
speed of ^4 using exp
Elapsed time is 0.337066 seconds.
maximum difference in results
R. G.

February 26th, 2014 at 23:44

Reply | Quote | #8

In Euler Math Toolbox, there is the same effect. I also compared with a binary multiplication, and with C.

[code]
>a=random(10000000)*10;
>tic; p=a^4; toc;
Used 0.607 seconds
>tic; p=a*a*a*a; toc;
Used 0.289 seconds
>tic; p=a*a; p=p*p; toc;
Used 0.203 seconds
>function tinyc test (x) …
$ARG_DOUBLE_MATRIX(x,r,c);
$RES_DOUBLE_MATRIX(y,r,c);
$int i,j;
$for (i=0; i<r; i++)
$ for (j=0; jtic; test(a); toc;
Used 0.112 seconds
[/code]

It seems that the power functions (using a logarithm, a multiplication and an exponential) is not the best idea. The advantage vanishes from at a^8 on my system.
brian t

February 27th, 2014 at 00:18

Reply | Quote | #9

For comparison, I tried the same code in Octave, except for the initial “rng” command, which Octave doesn’t support:

speed of ^4 using pow
Elapsed time is 0.106022 seconds.
speed of ^4 using multiply
Elapsed time is 0.150319 seconds.
maximum difference in results
max_diff = 1.8190e-12

(Acer 5742 laptop from early 2011 with Intel Core i5-450M “Arrandale”, running Debian 64-bit “testing” and Octave 3.6.4)
Renan

February 27th, 2014 at 00:59

Reply | Quote | #10

On my computer (2011-era Dell XPS 15 laptop with i7-2670QM and Ubuntu 13.10) I get:

– MATLAB R2013a:

speed of ^4 using pow
Elapsed time is 0.372011 seconds.
speed of ^4 using multiply
Elapsed time is 0.019862 seconds.
maximum difference in results

max_diff =

1.8190e-12

– Octave 3.6.4:

speed of ^4 using pow
Elapsed time is 0.074532 seconds.
speed of ^4 using multiply
Elapsed time is 0.10198 seconds.
maximum difference in results
max_diff = 1.8190e-12

In MATLAB multiply is faster but in Octave pow is faster.

In Python 2.7.5:

In [10]: a = np.random.randn(1,10000000)*10

In [11]: %timeit a**4
1 loops, best of 3: 948 ms per loop

In [12]: %timeit a*a*a*a
10 loops, best of 3: 113 ms per loop
Tony McDaniel

February 27th, 2014 at 01:08

Reply | Quote | #11

I ran your code in Octave 3.8.0 on MacOS X 10.8.5 on a Core i5@3.4 GHz and also a version in Python 2.7.6 from Anaconda with Numpy 1.8.0 and got the following:

OCTAVE :

speed of ^2 using pow
Elapsed time is 0.0249081 seconds.
speed of ^2 using multiply
Elapsed time is 0.045887 seconds.

speed of ^3 using pow
Elapsed time is 0.047009 seconds.
speed of ^3 using multiply
Elapsed time is 0.11313 seconds.

speed of ^4 using pow
Elapsed time is 0.077533 seconds.
speed of ^4 using multiply
Elapsed time is 0.159042 seconds.

PYTHON :

speed of ^2 using pow
Elapsed time : 0.0255 seconds
speed of ^2 using multiply
Elapsed time : 0.0428 seconds
pow time / mult time = 0.5967

speed of ^3 using pow
Elapsed time : 0.1714 seconds
speed of ^3 using multiply
Elapsed time : 0.1091 seconds
pow time / mult time = 1.5703

speed of ^4 using pow
Elapsed time : 0.1481 seconds
speed of ^4 using multiply
Elapsed time : 0.1508 seconds
pow time / mult time = 0.9818

Interestingly, Octave is consistently faster using powers than multiplication for n = 2,3 and 4. Python’s power operator is faster for n = 2, slower for n = 3 and essentially the same for n = 4. I would guess that the Numpy developers have worked to optimize the n = 2 case since squaring numbers is such a common calculation.
Nima

February 27th, 2014 at 09:39

Reply | Quote | #12

On my old laptop, with Dell Latitude E6400, Core2Due P9800 (2.8GHz) and Win7x64:
Matlab 2013b

speed of ^4 using pow
Elapsed time is 0.972871 seconds.
speed of ^4 using multiply
Elapsed time is 0.063420 seconds.
maximum difference in results

max_diff =

1.8190e-12
Roberto

February 27th, 2014 at 09:48

Reply | Quote | #13

As far as I can tell, x.^y is also slower than exp(y.*log(x)), which might not be too surprising considering that the latter will not work for negative or 0 x. For integer powers, if you calculate them often, rather than using repeated multiplication it’s convenient to implement exponential ion by squaring. Interestingly none of this is true for Octave, in which the .^ seems to be always faster than any other hack you can try. I wrote a blog post about this a while ago: http://www.robertomontagna.it/2012/12/more-about-exponents.html

Nima

February 27th, 2014 at 10:38

Reply | Quote | #14

This is my maple 17 code on the same system (I’m not sure if this is the best way):


restart; 
with(RandomTools);
A := Vector(100000, Generate(float, makeproc = true)):

Tic := time[real]():
Pow := `~`[`^`](A, 4):
PowTime := time[real]()-Tic: 
printf("%s %f", "The ^4 using power: ", PowTime); 

Tic := time[real]():
Mul := `~`[`*`](`~`[`*`](`~`[`*`](A, A), A), A):
MultTime := time[real]()-Tic:
printf("\n %s %f", "The ^4 using multiply: ", MultTime);

The ^4 using power: 0.375000
The ^4 using multiply: 0.733000

this is for 1e5, for 1e6 it take much longer time to create the vector however:
The ^4 using power: 5.3030
The ^4 using multiply: 9.3600

Cristóvão D Sousa

February 27th, 2014 at 12:00

Reply | Quote | #15

And now something to show the speed/power of _pure_ Julia code:

Defining a new .^ operator for element-wise integer power of matrices
.^{T<:Number}(A::Matrix{T}, x::Integer)
using _pure_ Julia "for" loops as presented in https://gist.github.com/cdsousa/9247979
the times passes from

speed of ^4 using pow
elapsed time: 1.074429726 seconds
speed of ^4 using multiply
elapsed time: 0.40391358 seconds
maximum difference in results
1.8189894035458565e-12

to

speed of ^4 using pow
elapsed time: 0.100119984 seconds <— 10x faster with new .^
speed of ^4 using multiply
elapsed time: 0.40944625 seconds
maximum difference in results
0.0

Do this in pure Matlab/Python ;)

BTW, this was run on an Intel(R) Core(TM) i3 M380 @ 2.53GHz.
Mike Croucher

February 27th, 2014 at 12:28

Reply | Quote | #16

That really is very nice, thank you! It would be great to write that code up in a tutorial fashion for Julia-newbies, explaining what’s going on and why it works.
Cristóvão D Sousa

February 27th, 2014 at 13:54

Reply | Quote | #17

@Mike Croucher

Thanks. I myself am a Julia newbie, and I’m pretty sure my code could be improved a lot (at least for the style).
I did a post from my comments, following yours.
Cheers.
Aljen

February 27th, 2014 at 15:00

Reply | Quote | #18

In this newsreader post, replies 7 and 9 have some background on why Matlabs power function is slower than multiplication:

http://www.mathworks.com/matlabcentral/newsreader/view_thread/83759
Mike Croucher

February 27th, 2014 at 15:06

Reply | Quote | #19

More discussion on this topic at http://scicomp.stackexchange.com/questions/10919/is-there-any-numerical-reason-for-not-using-repeated-multiplication-instead-of-i
Luis Randez

February 27th, 2014 at 23:05

Reply | Quote | #20

In ipython (anaconda 1.9.1)

In [1]: a = random.randn(1,10000000)*10

In [2]: %timeit a**4
1 loops, best of 3: 1.09 s per loop

In [3]: %timeit a*a*a*a
10 loops, best of 3: 155 ms per loop

In [4]: %timeit (a*a)**2
10 loops, best of 3: 93.3 ms per loop
Luis Randez

February 27th, 2014 at 23:16

Reply | Quote | #21

In julia Version 0.2.0-prerelease+4027

julia> a=rand(1,1,10000000)*10;

julia> timing = @time a.*a.*a.*a;
elapsed time: 0.404306974 seconds (240002880 bytes allocated)

julia> timing = @time a.^4;
elapsed time: 0.174195625 seconds (80000216 bytes allocated)

julia> timing = @time (a.*a).^2;
elapsed time: 0.233738999 seconds (160001160 bytes allocated)
Luis Randez

February 27th, 2014 at 23:23

Reply | Quote | #22

In julia Version 0.2.0-prerelease+4027

julia> a=rand(1,10000000)*10;

julia> tic(); test4p = a.^4; toc()
elapsed time: 0.241979896 seconds

julia> tic(); test4m = a.*a.*a.*a; toc()
elapsed time: 0.733428131 seconds

julia> tic(); test4m = (a.*a).^2; toc()
elapsed time: 0.225762048 seconds
Iain Murray

February 28th, 2014 at 10:46

Reply | Quote | #23

As noted by Tony above, squaring is a special case that is often optimized. The exponent in the floating point representation just needs incrementing by one.

The fastest way to do a fourth power in pure Matlab seems to be (a.^2).^2.

A mex file could be written to bit-twiddle the floating point numbers, increasing the exponent by 2. I’m guessing that would be a smidge faster. But it would have to be a real bottle-neck to bother trying.
Iain Murray

February 28th, 2014 at 12:04

Reply | Quote | #24

Woops, (a.^2).^2 is fast and correct, but the bit-twiddling I suggested, which is what pow2 does, was just plain wrong!
Yair Altman

March 24th, 2014 at 20:41

Reply | Quote | #25

I just came across this early MATLAB CSSM newsgroup thread on this: http://mathworks.com/matlabcentral/newsreader/view_thread/287247