[Proj] Experiment to speed up proj.4 by 2 or more

Tue Jun 23 10:29:49 EST 2015

Hi,

I've done an experiment to use Intel SIMD intrinsics 
(https://en.wikipedia.org/wiki/SIMD), and I think they could be beneficial for 
proj, when called to transform several coordinates at a time.

I've used the SSE2 instruction set (128 bit registers, so 2 doubles at a 
time), and I managed to speed up the inverse Transverse Mercator ellipsoidal 
transformation (ie. from projected to geodetic) by a factor of ~ 2 (excluding 
potential datum transformations)

One key for performance was to find an efficient way of computing the usual 
transcendental functions (ie. sin, cos, tan and their inverse, exp, ln, 
etc...) with SIMD registers, since they are not included in the instruction 
set. Otherwise you have to collect each component of the SIMD register, 
evaluate it with the x87 coprocessor, and reassemble the SIMD register from 
the computed components, which kills all the other performance gains. The 
SLEEF library (http://freecode.com/projects/sleef) has such routines, is in 
the public domain and works rather well (with gcc/clang, although it has some 
rough edges when trying with MSVC, but nothing that cannot be overcome)

I've encapsulated the use of SSE2 intrinsics in a C++ class with overloading 
of arithmetics operators, so the resulting code looks pretty much similar to 
the original C code, which is great for readability (although the original C 
code isn't always very readable ;-)), and confidence that it doesn't introduce 
errors. Conditionnal branches are not so great for SIMD performance, but there 
are tricks to rewrite some of them with a ternary-like operator.

SLEEF also supports the AVX & AVX2+FMA instruction sets (256 bit registers), 
which could also lead to a further ~ x2 gain over SSE2.

So I was wondering if there was :

1) interest of the project in pursuing into that approach (which involves  
introducing C++ in the code base, as an implementation detail, the interface 
being unchanged). We could imagine to have the same source files compiled 
several times with different register sizes, with runtime selection of the 
appropriate variant (note: SSE2 is guaranteed to be available on all x86_64 
compatible processors. AVX/AVX2 is for more recent CPUs).

2) ... and sponsors interested in making that happen.

Finally, the proof of concept: 
* regular code (runs in ~30s on Core i5 750 @ 2.67GHz  ): 
   https://gist.github.com/rouault/946104d0b98e8e8cc564
* SSE2 code (~14s):
   https://gist.github.com/rouault/3bbc31c9f12391d79920

Best regards,

Even

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com