Using the Portland PGI Compiler for MATLAB mex files in Windows #1

June 14th, 2012 | Categories: Making MATLAB faster, matlab, programming, tutorials | Tags:

I recently got access to a shiny new (new to me at least) set of compilers, The Portland PGI compiler suite which comes with a great set of technologies to play with including AVX vector support, CUDA for x86 and GPU pragma-based acceleration.  So naturally, it wasn’t long before I wondered if I could use the PGI suite as compilers for MATLAB mex files.  The bad news is that The Mathworks don’t support the PGI Compilers out of the box but that leads to the good news…I get to dig down and figure out how to add support for unsupported compilers.

In what follows I made use of MATLAB 2012a on 64bit Windows 7 with Version 12.5 of the PGI Portland Compiler Suite.

In order to set up a C mex-compiler in MATLAB you execute the following

mex -setup

which causes MATLAB to execute a Perl script at C:\Program Files\MATLAB\R2012a\bin\mexsetup.pm.  This script scans the directory C:\Program Files\MATLAB\R2012a\bin\win64\mexopts looking for Perl scripts with the extension .stp and running whatever it finds. Each .stp file looks for a particular compiler.  After all .stp files have been executed, a list of compilers found gets returned to the user. When the user chooses a compiler, the corresponding .bat file gets copied to the directory returned by MATLAB’s prefdir function. This sets up the compiler for use.  All of this is nicely documented in the mexsetup.pm file itself.

So, I’ve had my first crack at this and the results are the following two files.

These are crude, and there’s probably lots missing/wrong but they seem to work.  Copy them to C:\Program Files\MATLAB\R2012a\bin\win64\mexopts. The location of the compiler is hard-coded in pgi.stp so you’ll need to change the following line if your compiler location differs from mine

my $default_location = "C:\\Program Files\\PGI\\win64\\12.5\\bin";

Now, when you do mex -setup, you should get an entry PGI Workstation 12.5 64bit 12.5 in C:\Program Files\PGI\win64\12.5\bin which you can select as normal.

An example compilation and some details.

Let’s compile the following very simple mex file, mex_sin.c, using the PGI compiler which does little more than take an elementwise sine of the input matrix.

#include <math.h>
#include "mex.h"

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
    double *in,*out;
    double dist,a,b;
    int rows,cols,outsize;
    int i,j,k;

    /*Get pointers to input matrix*/
    in = mxGetPr(prhs[0]);
    /*Get rows and columns of input*/
    rows = mxGetM(prhs[0]);
    cols = mxGetN(prhs[0]);

    /* Create output matrix */
    outsize = rows*cols;
    plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL);
    /* Assign pointer to the output */
    out = mxGetPr(plhs[0]);

    for(i=0;i<outsize;i++){
        out[i] = sin(in[i]);
    }

}

Compile using the -v switch to get verbose information about the compilation

mex sin_mex.c -v

You’ll see that the compiled mex file is actually a renamed .dll file that was compiled and linked with the following flags

pgcc -c -Bdynamic  -Minfo -fast
pgcc --Mmakedll=export_all  -L"C:\Program Files\MATLAB\R2012a\extern\lib\win64\microsoft" libmx.lib libmex.lib libmat.lib

The switch –Mmakedll=export_all is actually not supported by PGI which makes this whole setup doubly unsupported! However, I couldn’t find a way to export the required symbols without modifying the mex source code so I lived with it.  Maybe I’ll figure out a better way in the future.  Let’s try the new function out

>> a=[1 2 3];
>> mex_sin(a)
Invalid MEX-file 'C:\Work\mex_sin.mexw64': The specified module could not be found.

The reason for the error message is that a required PGI .dll file, pgc.dll, is not on my system path so I need to do the following in MATLAB.

setenv('PATH', [getenv('PATH') ';C:\Program Files\PGI\win64\12.5\bin\']);

This fixes things

>> mex_sin(a)
ans =
    0.8415    0.9093    0.1411

Performance

I took a quick look at the performance of this mex function using my quad-core, Sandy Bridge laptop. I highly doubted that I was going to beat MATLAB’s built in sin function (which is highly optimised and multithreaded) with so little work and I was right:

>> a=rand(1,100000000);
>> tic;mex_sin(a);toc
Elapsed time is 1.320855 seconds.
>> tic;sin(a);toc
Elapsed time is 0.486369 seconds.

That’s not really a fair comparison though since I am purposely leaving mutithreading out of the PGI mex equation for now.  It’s a much fairer comparison to compare the exact same mex file using different compilers so let’s do that.  I created three different compiled mex routines from the source code above using the three compilers installed on my laptop and performed a very crude time test as follows

>> a=rand(1,100000000);
>> tic;mex_sin_pgi(a);toc              %PGI 12.5 run 1
Elapsed time is 1.317122 seconds.
>> tic;mex_sin_pgi(a);toc              %PGI 12.5 run 2
Elapsed time is 1.338271 seconds.

>> tic;mex_sin_vs(a);toc               %Visual Studio 2008 run 1
Elapsed time is 1.459463 seconds.
>> tic;mex_sin_vs(a);toc
Elapsed time is 1.446947 seconds.      %Visual Studio 2008 run 2

>> tic;mex_sin_intel(a);toc             %Intel Compiler 12.0 run 1
Elapsed time is 0.907018 seconds.
>> tic;mex_sin_intel(a);toc             %Intel Compiler 12.0 run 2
Elapsed time is 0.860218 seconds.

PGI did a little better than Visual Studio 2008 but was beaten by Intel. I’m hoping that I’ll be able to get more performance out of the PGI compiler as I learn more about the compilation flags.

Getting PGI to make use of SSE extensions

Part of the output of the mex sin_mex.c -v compilation command is the following notice

mexFunction:
     23, Loop not vectorized: data dependency

This notice is a result of the -Minfo compilation switch and indicates that the PGI compiler can’t determine if the in and out arrays overlap or not.  If they don’t overlap then it would be safe to unroll the loop and make use of SSE or AVX instructions to make better use of my Sandy Bridge processor.  This should hopefully speed things up a little.

As the programmer, I am sure that the two arrays don’t overlap so I need to give the compiler a hand.  One way to do this would be to modify the pgi.dat file to include the compilation switch -Msafeptr which tells the compiler that arrays never overlap anywhere.  This might not be a good idea since it may not always be true so I decided to be more cautious and make use of  the restrict keyword.  That is, I changed the mex source code so that

double *in,*out;

becomes

double * restrict in,* restrict out;

Now when I compile using the PGI compiler, the notice from -Mifno becomes

mexFunction:
     23, Generated 3 alternate versions of the loop
         Generated vector sse code for the loop
         Generated a prefetch instruction for the loop

which demonstrates that the compiler is much happier! So, what did this do for performance?

>> tic;mex_sin_pgi(a);toc
Elapsed time is 1.450002 seconds.
>> tic;mex_sin_pgi(a);toc
Elapsed time is 1.460536 seconds.

This is slower than when SSE instructions weren’t being used which isn’t what I was expecting at all! If anyone has any insight into what’s going on here, I’d love to hear from you.

Future Work

I’m happy that I’ve got this compiler working in MATLAB but there is a lot to do including:

  • Tidy up the pgi.dat and pgi.stp files so that they look and act more professionally.
  • Figure out the best set of compiler switches to use– it is almost certain that what I’m using now is sub-optimal since I am new to the PGI compiler.
  • Get OpenMP support working.  I tried using the -Mconcur compilation flag which auto-parallelised the loop but it crashed MATLAB when I ran it. This needs investigating
  • Get PGI accelerator support working so I can offload work to the GPU.
  • Figure out why the SSE version of this function is slower than the non-SSE version
  • Figure out how to determine whether or not the compiler is emitting AVX instructions.  The documentation suggests that if the compiler is called on a Sandy Bridge machine, and if vectorisation is possible then it will produce AVX instructions but AVX is not mentioned in the output of -Minfo.  Nothing changes if you explicity set the target to Sandy Bridge with the compiler switch tp sandybridge64.

Look out for more articles on this in the future.

Related WalkingRandomly Articles

My setup

  • Laptop model: Dell XPS L702X
  • CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
  • GPU: GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
  • RAM: 8 Gb
  • OS: Windows 7 Home Premium 64 bit.
  • MATLAB: 2012a
  • PGI Compiler: 12.5
  1. ludolph
    June 15th, 2012 at 08:06
    Reply | Quote | #1

    Citation: “PGI did a little better than Visual Studio 2008 but was beaten by Intel. I’m hoping that I’ll be able to get more performance out of the PGI compiler as I learn more about the compilation flags.”

    I am afraid, that the INTEL compiler is far more better than PGI, see for example well known fortran polyhedron benchamrks: http://www.polyhedron.com/compare0html

    The PGI is recently focused mainly on GPGPU (CUDA, openCL, openACC, etc.) but the standard CPU performance optimizations are not so advanced as in INTEL compiler. My experience with PGI vs INTEL is very similar to the polyhedron’s benchmarks.

    So finally, do not waste your time by PGI tuning. The only advantage of PGI is GPGPU support, but this fearture is not matured yet.

  2. June 15th, 2012 at 14:12
    Reply | Quote | #2

    There is a forum post over at http://www.pgroup.com/userforum/viewtopic.php?t=3261 about this blog post. For completeness, I have copied a reply from Mat (or ‘mkcolg’) here. Quotes from the blog post above are in bold, Mat’s replies are in normal type

    Hi Mike,

    Thanks for putting this together. We very much appreciate your efforts. Hopefully, I can answer some of your questions.

    The switch –Mmakedll=export_all is actually not supported by PGI which makes this whole setup doubly unsupported! However, I couldn’t find a way to export the required symbols without modifying the mex source code so I lived with it. Maybe I’ll figure out a better way in the future

    The “export_all” option will work for many cases but not in all, hence it’s not officially supported. If it works, great, otherwise users need to decorate symbols using “dllexport”.

    Figure out the best set of compiler switches to use– it is almost certain that what I’m using now is sub-optimal since I am new to the PGI compiler.

    We recommend starting with “-fast” since in general gives the best performance. It’s really an aggregate of many flags which are adjusted for the particular target architecture. The command “pgfortran -help -fast” will list flags being used. You can try enabling and disabling each of the flags to see how they effect performance, but given the simplicity of the code, you probably not see much difference with some of them. The exception being auto-vectorization (-Mvect) which I would expect to help (more on this later).

    FYI, I was going to suggest using -Msafeptr or the restrict keyword but you figured that one out already!

    Figure out why the SSE version of this function is slower than the non-SSE version

    I’d like you to try adding “-Mvect=simd:128”. By default we use 256 AVX on Sandy-bridge. However, our AVX vector SIN is written in 128 vector mode but used twice. I’m wonder if the slight amount of overhead is getting magnified and causing the slow-down.

    If that’s not it, try “-Mvect=noaltcode” to remove the altcode generation which also adds a bit of overhead. Since the large data set will always fall into the vector code, no need to have the extra overhead.

    Get OpenMP support working. I tried using the -Mconcur compilation flag which auto-parallelised the loop but it crashed MATLAB when I ran it. This needs investigating

    I’m not sure about this one. It’s possible there is some conflicts between the underlying threading. Does using an explicit OpenMP pragma exhibit the have behavior?

    Get PGI accelerator support working so I can offload work to the GPU.

    This will be interesting. Compute regions should work fine. The difficult part (performance wise) will be if you have small problem sizes or need to share data from call to call. Though, the new OpenACC “present” directive might be able to help here. You’ll be blazing trails!

    Figure out how to determine whether or not the compiler is emitting AVX instructions. The documentation suggests that if the compiler is called on a Sandy Bridge machine, and if vectorisation is possible then it will produce AVX instructions but AVX is not mentioned in the output of -Minfo. Nothing changes if you explicity set the target to Sandy Bridge with the compiler switch -tp sandybridge-64.

    Sigh, this is because our engineers emit the same message for all vectorization. For a Sandy-bridge we are using AVX even though the message says sse. I blame myself for this one since I should have noticed it long ago. I guess sometimes it takes a new set of eye. I’ll ask this be corrected (TPR#18773)

    Note that I’m heading out on vacation for a few weeks, but have let the application engineer who is covering the PGI UF for me of your post. He’ll respond if you have any further questions or issues.

  3. July 12th, 2012 at 01:09
    Reply | Quote | #3

    Hi Mat

    I followed up a couple of your SSE/AVX suggestions.

    Starting with the data set a=rand(1,100000000); we have
    %Compile flags as in the blog post
    >> tic;mex_sin_pgi(a);toc
    Elapsed time is 1.467118 seconds.

    %Compile flags as in the blog post but with -Mvect=simd:128 added
    >> tic;mex_sin_pgi(a);toc
    Elapsed time is 1.375590 seconds.

    %Compile flags as in the blog post but with -Mvect=simd:128 -Mvect=noaltcode
    >> tic;mex_sin_pgi(a);toc
    Elapsed time is 1.335042 seconds.

    Cheers,
    Mike

  4. July 26th, 2012 at 17:44
    Reply | Quote | #4

    I did a similar (though much hackier) thing to get a version of the intel c compiler up and running for mex in linux. If you modify the mexopts.sh which lives in ~/.matlab/RXXXx/ then you can change the compiler & compilation flags to whatever you want by editing the CC and CXX environment variables exported by the script. You need to restart matlab before it takes effect, and it will be overwritten the next time you run mex -setup. Still, compiling all my mex functions with -O3 and -xHost made things run faster than gcc could.

  5. July 27th, 2012 at 17:45
    Reply | Quote | #5

    Hi Adam

    I was about to type ‘But Intel is a supported mex compiler and so no hackage should be necessary’ and then point you to a link. However, said link doesn’t mention Intel at all!

    http://www.mathworks.co.uk/support/compilers/R2012a/glnxa64.html

    I’m spending too much time in Windows I think (Intel IS supported there).

    Like you, I’ve found the Intel compiler to be faster than gcc on average– especially on Sandy Bridge Intel chips.

    If you’re interested in very low-level hacking around with mex files on modern Intel kit, maybe take a look at Intels ispc compiler:
    http://www.walkingrandomly.com/?p=3988

    Cheers,
    Mike

  6. Dmitry Ivan
    July 30th, 2012 at 05:08
    Reply | Quote | #6

    Hi, can you tell me, please the Portland PGI compiler can create .dll shared library from m files? I am looking for another compiler to create .dll shared library, which will be used to create C/C++ applications.
    Thanks!

  7. July 30th, 2012 at 07:20
    Reply | Quote | #7

    Hi

    You need the MATLAB compiler for that.
    http://www.mathworks.co.uk/products/compiler/

    Cheers,
    Mike

  8. Dmitry Ivan
    July 30th, 2012 at 16:09
    Reply | Quote | #8

    Thanks for your answer, but i want to ask you, have you tried to create .dll shared library from m files using PGI Workstation?

  9. Dmitry Ivan
    July 30th, 2012 at 19:01
    Reply | Quote | #9

    Hi, Mike
    Have you ever tried to create .dll shared library from m files using PGI compiler?

  10. July 31st, 2012 at 11:37

    No

  11. Dmitry
    September 30th, 2012 at 03:03

    I try compilation by (PGI Fortran) and use it in Matlab

    r.f90:
    #include “fintrf.h”

    subroutine mexFunction(nlhs, plhs, nrhs, prhs)
    mwpointer plhs(*), prhs(*)
    integer nlhs, nrhs
    print *,’Hello’
    return
    end subroutine
    end

    compilation:
    set LIB=C:\PROGRA~1\MATLAB\R2008b\extern\lib\win64\microsoft
    set INC=C:\PROGRA~1\MATLAB\R2008b\extern\include
    pgfortran -o r.mexw64 -I%INC% r.F90 -L%LIB% -lmex -lmx

    out:
    r.mexw64

    try execute in Matlab
    >>r
    Mex file entry point is missing. Please check the (case-sensitive)
    spelling of mexFunction (for C MEX-files), or the (case-insensitive)
    spelling of MEXFUNCTION (for FORTRAN MEX-files).
    ??? Invalid MEX-file ‘D:\PGI_TEST\r.mexw64’: .

    Please help!!!!

  12. Hugo
    March 27th, 2014 at 10:07

    Very nice work.
    I used your pgi.stp and pgi.bat files to make a similar files for the fortran NAG compiler.
    You files gave the right clues how to set this up for any compiler.

    Thanks.