[Proj] Experiment to speed up proj.4 by 2 or more

Thu Jul 2 09:06:58 EST 2015

Hello Even,

we have already tested all possible optimizations SSE2 etc. which ever 
we couild
switch our compiler to produce .. the bad news is that even if the 
performance
could be somewhat better in some speed aspects .. the overall system 
performance
got worse and for example the screen updateing matters and interrupt 
servicing
got worse! .. so basically there is not much to be gained taking that 
road. And
we switched back not to optimize some important sections since the 
overloading
of the cpu made the OS not any more work so efficiently and fluently.
(At least for some processors)

The problem with very hard optimizations is that the results can be very 
strange
with different processors an operating systems. And since most of them 
have only
been tested with rather "lame" programs by the manufacturers of 
processors and
operating systems .. it is usually best not to try too much! .. or at 
least
all combinations should be tested .. which might be very huge a task!

If you anyway like to do some special C++ work on the proj.4 package, 
please add
the syntax scanner in front of it .. so that it makes sure the user 
entered
some projection definitions that the library did really understand. That 
would
reduce the number of errors the users makes with the rather complex 
definitions
the library needs. (No the Proj.4v accepts almost what ever definitions 
and says
nothing about the fact that it was maybe totally discarded and did not 
make any
sense)..

And name it something else than "Proj.4" .. there is already "libProj4" 
.. maybe
"Cpp.Proj.4" for example .. so you are free to do what ever! :D

I have not had time to check what the github people have done with the 
package?
Most likely nothing but taken all the glory and destroyed some important
sections? :) -- which is the usual approach.. haha :)

http://libproj4.maptools.org/

regards: Janne.

----------------------------------------------

Even Rouault kirjoitti 23.06.2015 18:29:
> Hi,
> 
> I've done an experiment to use Intel SIMD intrinsics
> (https://en.wikipedia.org/wiki/SIMD), and I think they could be 
> beneficial for
> proj, when called to transform several coordinates at a time.
> 
> I've used the SSE2 instruction set (128 bit registers, so 2 doubles at 
> a
> time), and I managed to speed up the inverse Transverse Mercator 
> ellipsoidal
> transformation (ie. from projected to geodetic) by a factor of ~ 2 
> (excluding
> potential datum transformations)
> 
> One key for performance was to find an efficient way of computing the 
> usual
> transcendental functions (ie. sin, cos, tan and their inverse, exp, ln,
> etc...) with SIMD registers, since they are not included in the 
> instruction
> set. Otherwise you have to collect each component of the SIMD register,
> evaluate it with the x87 coprocessor, and reassemble the SIMD register 
> from
> the computed components, which kills all the other performance gains. 
> The
> SLEEF library (http://freecode.com/projects/sleef) has such routines, 
> is in
> the public domain and works rather well (with gcc/clang, although it 
> has some
> rough edges when trying with MSVC, but nothing that cannot be overcome)
> 
> I've encapsulated the use of SSE2 intrinsics in a C++ class with 
> overloading
> of arithmetics operators, so the resulting code looks pretty much 
> similar to
> the original C code, which is great for readability (although the 
> original C
> code isn't always very readable ;-)), and confidence that it doesn't 
> introduce
> errors. Conditionnal branches are not so great for SIMD performance, 
> but there
> are tricks to rewrite some of them with a ternary-like operator.
> 
> SLEEF also supports the AVX & AVX2+FMA instruction sets (256 bit 
> registers),
Hello

> which could also lead to a further ~ x2 gain over SSE2.
> 
> So I was wondering if there was :
> 
> 1) interest of the project in pursuing into that approach (which 
> involves
> introducing C++ in the code base, as an implementation detail, the 
> interface
> being unchanged). We could imagine to have the same source files 
> compiled
> several times with different register sizes, with runtime selection of 
> the
> appropriate variant (note: SSE2 is guaranteed to be available on all 
> x86_64
> compatible processors. AVX/AVX2 is for more recent CPUs).
> 
> 2) ... and sponsors interested in making that happen.
> 
> Finally, the proof of concept:
> * regular code (runs in ~30s on Core i5 750 @ 2.67GHz  ):
>    https://gist.github.com/rouault/946104d0b98e8e8cc564
> * SSE2 code (~14s):
>    https://gist.github.com/rouault/3bbc31c9f12391d79920
> 
> Best regards,
> 
> Even