MATLAB: Repeated multiplication is faster than integer powers

February 26th, 2014 | Categories: Making MATLAB faster, matlab, programming | Tags:

Consider the following code

function testpow() 
%function to compare integer powers with repeated multiplication

rng default
a=rand(1,10000000)*10;

disp('speed of ^4 using pow')
tic;test4p = a.^4;toc
disp('speed of ^4 using multiply')
tic;test4m = a.*a.*a.*a;toc

disp('maximum difference in results');
max_diff = max(abs(test4p-test4m))

end

On running it on my late 2013 Macbook Air with a Haswell Intel i5 using MATLAB 2013b, I get the following results

speed of ^4 using pow
Elapsed time is 0.527485 seconds.
speed of ^4 using multiply
Elapsed time is 0.025474 seconds.
maximum difference in results
max_diff =
   1.8190e-12

In this case (derived from a real-world case), using repeated multiplication is around twenty times faster than using integer powers in MATLAB. This leads to some questions:-

  • Why the huge speed difference?
  • Would a similar speed difference be seen in other systems–R, Python, Julia etc?
  • Would we see the same speed difference on other operating systems/CPUs?
  • Are there any numerical reasons why using repeated multiplication instead of integer powers is a bad idea?
  1. February 26th, 2014 at 20:50
    Reply | Quote | #1

    Using Octave 3.6.2 on Ubuntu 12.04 with AMD Athlon(tm) II X4 640 Processor, I get the opposite, but less impressive difference, in execution times:

    speed of ^4 using pow
    Elapsed time is 0.1246 seconds.
    speed of ^4 using multiply
    Elapsed time is 0.1968 seconds.
    maximum difference in results
    max_diff = 1.8190e-12

  2. February 26th, 2014 at 21:25
    Reply | Quote | #2

    Hi. In Julia:

    function testpow()
    #function to compare integer powers with repeated multiplication

    a=rand(1,10000000)*10;

    println(“speed of ^4 using pow”)
    tic(); test4p = a.^4; toc()
    println(“speed of ^4 using multiply”)
    tic(); test4m = a.*a.*a.*a; toc()

    println(“maximum difference in results”);
    max_diff = maximum(abs(test4p-test4m))
    end

    testpow()

    speed of ^4 using pow
    elapsed time: 1.074429726 seconds
    speed of ^4 using multiply
    elapsed time: 0.40391358 seconds
    maximum difference in results
    1.8189894035458565e-12

  3. February 26th, 2014 at 21:30
    Reply | Quote | #3

    I guess it may be due to some for loop overhead in the pow version.
    I believe some languages/compilers implement loop unrolling to optimize small exponent cases.

  4. February 26th, 2014 at 22:07
    Reply | Quote | #4

    For Mathematica it is the opposite:
    test = RandomReal[{0, 10}, 10^8];
    AbsoluteTiming[test^4;]
    AbsoluteTiming[test test test test;]

    {1.457870, Null}
    {1.832707, Null}

    ‘stupid’ exponentiation is slower.

  5. February 26th, 2014 at 22:09
    Reply | Quote | #5

    O and probably it has to with the fast that matlab sees ‘4’ not as an integer, but rather as a float (double). And uses a very general algorithm that holds for any number, but 4 is special; it is an integer…

  6. February 26th, 2014 at 22:11
    Reply | Quote | #6

    @Sander Huisman
    Indeed (at least for Mathematica):

    test = RandomReal[{0, 10}, 10^8];
    AbsoluteTiming[test^4.0;]
    AbsoluteTiming[test test test test;]
    (* notice there is 4.0 not 4 *)

    {2.890319, Null}
    {1.811797, Null}

  7. Shankar
    February 26th, 2014 at 22:20
    Reply | Quote | #7

    My guess exponentiation is slower than repeated multiplying because it has to be general enough to handle non-integer powers. (x.^y) is probably implemented as exp(y.*ln(x)). Surprisingly both that and repeated squaring are also faster than exponentiation, but still slower than repeated multiplies.

    I think the speed difference for other language would very much depend on if they use LAPACK or some other library for matrix algebra. I believe in general repeated multiplies are better numerically than general exponentiation.

    Here are my commands and output (run on 2009 Macbook Pro with 2.8 Ghz Intel Core 2 Duo):

    >>disp(‘speed of ^4 using pow’)
    tic;test4p = a.^4;toc
    disp(‘speed of ^4 using multiply’)
    tic;test4m = a.*a.*a.*a;toc
    disp(‘speed of ^4 using squaring’)
    tic;temp = a.*a; test4s=temp.*temp; toc
    disp(‘speed of ^4 using exp’)
    tic;test4e=exp(4.*log(a)); toc

    speed of ^4 using pow
    Elapsed time is 0.801814 seconds.
    speed of ^4 using multiply
    Elapsed time is 0.067373 seconds.
    speed of ^4 using squaring
    Elapsed time is 0.113646 seconds.
    speed of ^4 using exp
    Elapsed time is 0.337066 seconds.
    maximum difference in results

  8. February 26th, 2014 at 23:44
    Reply | Quote | #8

    In Euler Math Toolbox, there is the same effect. I also compared with a binary multiplication, and with C.

    [code]
    >a=random(10000000)*10;
    >tic; p=a^4; toc;
    Used 0.607 seconds
    >tic; p=a*a*a*a; toc;
    Used 0.289 seconds
    >tic; p=a*a; p=p*p; toc;
    Used 0.203 seconds
    >function tinyc test (x) …
    $ARG_DOUBLE_MATRIX(x,r,c);
    $RES_DOUBLE_MATRIX(y,r,c);
    $int i,j;
    $for (i=0; i<r; i++)
    $ for (j=0; jtic; test(a); toc;
    Used 0.112 seconds
    [/code]

    It seems that the power functions (using a logarithm, a multiplication and an exponential) is not the best idea. The advantage vanishes from at a^8 on my system.

  9. brian t
    February 27th, 2014 at 00:18
    Reply | Quote | #9

    For comparison, I tried the same code in Octave, except for the initial “rng” command, which Octave doesn’t support:

    speed of ^4 using pow
    Elapsed time is 0.106022 seconds.
    speed of ^4 using multiply
    Elapsed time is 0.150319 seconds.
    maximum difference in results
    max_diff = 1.8190e-12

    (Acer 5742 laptop from early 2011 with Intel Core i5-450M “Arrandale”, running Debian 64-bit “testing” and Octave 3.6.4)

  10. February 27th, 2014 at 00:59

    On my computer (2011-era Dell XPS 15 laptop with i7-2670QM and Ubuntu 13.10) I get:

    – MATLAB R2013a:

    speed of ^4 using pow
    Elapsed time is 0.372011 seconds.
    speed of ^4 using multiply
    Elapsed time is 0.019862 seconds.
    maximum difference in results

    max_diff =

    1.8190e-12

    – Octave 3.6.4:

    speed of ^4 using pow
    Elapsed time is 0.074532 seconds.
    speed of ^4 using multiply
    Elapsed time is 0.10198 seconds.
    maximum difference in results
    max_diff = 1.8190e-12

    In MATLAB multiply is faster but in Octave pow is faster.

    In Python 2.7.5:

    In [10]: a = np.random.randn(1,10000000)*10

    In [11]: %timeit a**4
    1 loops, best of 3: 948 ms per loop

    In [12]: %timeit a*a*a*a
    10 loops, best of 3: 113 ms per loop

  11. Tony McDaniel
    February 27th, 2014 at 01:08

    I ran your code in Octave 3.8.0 on MacOS X 10.8.5 on a Core i5@3.4 GHz and also a version in Python 2.7.6 from Anaconda with Numpy 1.8.0 and got the following:

    OCTAVE :

    speed of ^2 using pow
    Elapsed time is 0.0249081 seconds.
    speed of ^2 using multiply
    Elapsed time is 0.045887 seconds.

    speed of ^3 using pow
    Elapsed time is 0.047009 seconds.
    speed of ^3 using multiply
    Elapsed time is 0.11313 seconds.

    speed of ^4 using pow
    Elapsed time is 0.077533 seconds.
    speed of ^4 using multiply
    Elapsed time is 0.159042 seconds.

    PYTHON :

    speed of ^2 using pow
    Elapsed time : 0.0255 seconds
    speed of ^2 using multiply
    Elapsed time : 0.0428 seconds
    pow time / mult time = 0.5967

    speed of ^3 using pow
    Elapsed time : 0.1714 seconds
    speed of ^3 using multiply
    Elapsed time : 0.1091 seconds
    pow time / mult time = 1.5703

    speed of ^4 using pow
    Elapsed time : 0.1481 seconds
    speed of ^4 using multiply
    Elapsed time : 0.1508 seconds
    pow time / mult time = 0.9818

    Interestingly, Octave is consistently faster using powers than multiplication for n = 2,3 and 4. Python’s power operator is faster for n = 2, slower for n = 3 and essentially the same for n = 4. I would guess that the Numpy developers have worked to optimize the n = 2 case since squaring numbers is such a common calculation.

  12. Nima
    February 27th, 2014 at 09:39

    On my old laptop, with Dell Latitude E6400, Core2Due P9800 (2.8GHz) and Win7x64:
    Matlab 2013b

    speed of ^4 using pow
    Elapsed time is 0.972871 seconds.
    speed of ^4 using multiply
    Elapsed time is 0.063420 seconds.
    maximum difference in results

    max_diff =

    1.8190e-12

  13. February 27th, 2014 at 09:48

    As far as I can tell, x.^y is also slower than exp(y.*log(x)), which might not be too surprising considering that the latter will not work for negative or 0 x. For integer powers, if you calculate them often, rather than using repeated multiplication it’s convenient to implement exponential ion by squaring. Interestingly none of this is true for Octave, in which the .^ seems to be always faster than any other hack you can try. I wrote a blog post about this a while ago: http://www.robertomontagna.it/2012/12/more-about-exponents.html

  14. Nima
    February 27th, 2014 at 10:38

    This is my maple 17 code on the same system (I’m not sure if this is the best way):

    
    restart; 
    with(RandomTools);
    A := Vector(100000, Generate(float, makeproc = true)):
    
    Tic := time[real]():
    Pow := `~`[`^`](A, 4):
    PowTime := time[real]()-Tic: 
    printf("%s %f", "The ^4 using power: ", PowTime); 
    
    Tic := time[real]():
    Mul := `~`[`*`](`~`[`*`](`~`[`*`](A, A), A), A):
    MultTime := time[real]()-Tic:
    printf("\n %s %f", "The ^4 using multiply: ", MultTime);
    
    

    The ^4 using power: 0.375000
    The ^4 using multiply: 0.733000

    this is for 1e5, for 1e6 it take much longer time to create the vector however:
    The ^4 using power: 5.3030
    The ^4 using multiply: 9.3600

  15. February 27th, 2014 at 12:00

    And now something to show the speed/power of _pure_ Julia code:

    Defining a new .^ operator for element-wise integer power of matrices
    .^{T<:Number}(A::Matrix{T}, x::Integer)
    using _pure_ Julia "for" loops as presented in https://gist.github.com/cdsousa/9247979
    the times passes from

    speed of ^4 using pow
    elapsed time: 1.074429726 seconds
    speed of ^4 using multiply
    elapsed time: 0.40391358 seconds
    maximum difference in results
    1.8189894035458565e-12

    to

    speed of ^4 using pow
    elapsed time: 0.100119984 seconds <— 10x faster with new .^
    speed of ^4 using multiply
    elapsed time: 0.40944625 seconds
    maximum difference in results
    0.0

    Do this in pure Matlab/Python ;)

    BTW, this was run on an Intel(R) Core(TM) i3 M380 @ 2.53GHz.

  16. Mike Croucher
    February 27th, 2014 at 12:28

    That really is very nice, thank you! It would be great to write that code up in a tutorial fashion for Julia-newbies, explaining what’s going on and why it works.

  17. February 27th, 2014 at 13:54

    @Mike Croucher

    Thanks. I myself am a Julia newbie, and I’m pretty sure my code could be improved a lot (at least for the style).
    I did a post from my comments, following yours.
    Cheers.

  18. Aljen
    February 27th, 2014 at 15:00

    In this newsreader post, replies 7 and 9 have some background on why Matlabs power function is slower than multiplication:

    http://www.mathworks.com/matlabcentral/newsreader/view_thread/83759

  19. Mike Croucher
    February 27th, 2014 at 15:06
  20. Luis Randez
    February 27th, 2014 at 23:05

    In ipython (anaconda 1.9.1)

    In [1]: a = random.randn(1,10000000)*10

    In [2]: %timeit a**4
    1 loops, best of 3: 1.09 s per loop

    In [3]: %timeit a*a*a*a
    10 loops, best of 3: 155 ms per loop

    In [4]: %timeit (a*a)**2
    10 loops, best of 3: 93.3 ms per loop

  21. Luis Randez
    February 27th, 2014 at 23:16

    In julia Version 0.2.0-prerelease+4027

    julia> a=rand(1,1,10000000)*10;

    julia> timing = @time a.*a.*a.*a;
    elapsed time: 0.404306974 seconds (240002880 bytes allocated)

    julia> timing = @time a.^4;
    elapsed time: 0.174195625 seconds (80000216 bytes allocated)

    julia> timing = @time (a.*a).^2;
    elapsed time: 0.233738999 seconds (160001160 bytes allocated)

  22. Luis Randez
    February 27th, 2014 at 23:23

    In julia Version 0.2.0-prerelease+4027

    julia> a=rand(1,10000000)*10;

    julia> tic(); test4p = a.^4; toc()
    elapsed time: 0.241979896 seconds

    julia> tic(); test4m = a.*a.*a.*a; toc()
    elapsed time: 0.733428131 seconds

    julia> tic(); test4m = (a.*a).^2; toc()
    elapsed time: 0.225762048 seconds

  23. February 28th, 2014 at 10:46

    As noted by Tony above, squaring is a special case that is often optimized. The exponent in the floating point representation just needs incrementing by one.

    The fastest way to do a fourth power in pure Matlab seems to be (a.^2).^2.

    A mex file could be written to bit-twiddle the floating point numbers, increasing the exponent by 2. I’m guessing that would be a smidge faster. But it would have to be a real bottle-neck to bother trying.

  24. February 28th, 2014 at 12:04

    Woops, (a.^2).^2 is fast and correct, but the bit-twiddling I suggested, which is what pow2 does, was just plain wrong!

  25. March 24th, 2014 at 20:41

    I just came across this early MATLAB CSSM newsgroup thread on this: http://mathworks.com/matlabcentral/newsreader/view_thread/287247