Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
This discussion is connected to the gegl-developer-list.gnome.org mailing list which is provided by the GIMP developers and not related to gimpusers.com.
This is a read-only list on gimpusers.com so this discussion thread is read-only, too.
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
--------
CONTEXT:
--------
I have completely changed the gegl/buffer/gegl-sampler-yafr.c code.
Before I put together the patch to the new yafr, I would like to see if I could make the code even faster by using C99/gcc built-in math intrinsics. I have not tried this yet.
The method used by the updated code is different from the first generation yafr (at once softer and more pervasive; yes, this is vague: what I mean is that the nonlinear correction is "on" throughout more of the image, but that its effect is never as extreme).
The code also runs even faster than before: on my current vintage laptop, yafr scales up about 10% slower than gegl-sampler-linear, and about 10% faster than gegl-sampler-cubic.
Regarding further speed-up (using arithmetic branching): With fabs, fmin and copysign I could make my code branch-free (assuming of course that these operations are translated to assembly built-ins by the compiler on the machine on which the code is compiled). That is: the yafr code, which already does not contain "if," "for," "do" or "while," would now contain no "?." I suspect that using arithmetic branching could make my code run noticeably faster.
---------
QUESTION:
---------
I noticed that fabs, fmin and copysign, or similar C99/gcc built-ins, are not found anywhere in the gegl source.
Is there a preferred/tolerated way of using such math functions in gegl?
Can I assume that gfloats are floats?
Can I assume that gdoubles are doubles?
Must I program with the possibility that gfloats be doubles?
Must I program with the possibility that gdoubles be floats?
Could gfloats or gdoubles be anything else than floats or doubles?
Some ideas:
Idea 0:
It may be that I can use the type-generic fabs, fmin and copysign on gfloats without a speed hit. Hopefully, gcc can use the correct one based on the fact that it acts on gfloats. If not, it may be that using the double versions on gfloats is still faster than the alternatives.
Idea 1:
If I KNEW for a fact that gfloat = float, I could simply use fabsf, fminf, and copysignf.
Idea 2:
I could do the necessary parts of the computation with doubles (or gdoubles) and then use the double versions. Hopefully, this will not slow down gegl when run on hardware which is faster on floats than doubles (like some GPUs).
Is there any objection to me using straight doubles inside the code (as long as they are not used to communicate with the rest of gegl)?
Idea 3:
Is there a smarter way, which picks the right one?
Idea 4:
Change compilation flags to include C99 built-ins?
Idea 5:
You have another idea?
Idea 6:
Or should I just stick to C90 gcc built-ins?
Nicolas Robidoux Laurentian University/Universite Laurentienne
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Hi,
On Fri, 2008-09-12 at 12:50 -0400, Nicolas Robidoux wrote:
--------
CONTEXT:
--------I have completely changed the gegl/buffer/gegl-sampler-yafr.c code.
Before I put together the patch to the new yafr, I would like to see if I could make the code even faster by using C99/gcc built-in math intrinsics. I have not tried this yet.
It would be interesting to know if this makes a difference and how much of a difference it makes. In general it seems like a dependency on GCC is a high price to pay and I am not sure if we are willing to do that. Depending on C99 is something that could certainly be considered.
I noticed that fabs, fmin and copysign, or similar C99/gcc built-ins, are not found anywhere in the gegl source.
Is there a preferred/tolerated way of using such math functions in gegl?
As far as I can see there is no decision on this yet.
Can I assume that gfloats are floats?
Yes.
Can I assume that gdoubles are doubles?
Yes.
Must I program with the possibility that gfloats be doubles?
No, gfloat is just a typedef to float. Always.
Must I program with the possibility that gdoubles be floats?
No, gdouble is just a typedef to double. Always.
Sven
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Hello Sven:
Thanks for your answer. Simplifies my life a lot.
------
There is another C99/gcc built-in with the potential to speed up code a lot: the restrict keyword.
See:
http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html
I'll build two versions of the gegl-sampler-yafr code (one of which which I'll masquerade as gegl-sampler-cubic so I can run both without recompiling) and run careful benchmarks this weekend. One version will stay away from restrict and c99 math intrinsics, the other will not (first pass, I may not go as far as making explicit calls to fma, although my code is structured in the hope that the compiler recognizes fused multiply-adds when appropriate).
I don't quite understand the issues of writing c++ code using c99 features (this is why knowing that they are gcc built-ins is useful, provided one knows that gcc will be the compiler).
Maybe I'll inspire myself from
http://www.ddj.com/cpp/184401653
Nicolas Robidoux Laurentian University/Universite Laurentienne
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
I just completed a quick and dirty benchmark comparing the use of arithmetic branching using c99/gcc intrinsics within the yafr sampler code, to using the standard c if then else.
These tests were performed on a Thinkpad t60p with Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz with 2025MiB memory running 2.6.24-19-generic #1 SMP by way of a pretty standard Ubuntu 8.04.
Warning: There seems to be something wrong with math.h with the current version of gcc, as suggested by some recent bug postings. For example, according to the gcc documentation, I should not have to prefix fminf with __builtin_. Consequently, it could be that the benchmark results will soon be made irrelevant.
Second warning: If my memory is good, Intel chips have a good and fast implementation of the "? :" branching construct (having to do with selecting which register to copy into another), as well as good branch prediction. My code without intrinsics is structured to take advantage of this.
Third warning: I have not optimized looking at the assembler output of gcc, and have done no optimization of the "arithmetic branching" version of the code. In particular, I have not used fmaf, even though my code is peppered with opportunity to use it (this may not be a big deal: apparently, gcc attempts to spot opportunities to use fused multiply-add).
------------------------------
quick description of the test:
------------------------------
I ran a bunch of consecutive scalings (times 20) of a digital photograph with initial dimensions 200x133, driving the gegl scale through an xml file analogous to the ones in gegl/docs/gallery, alternating between the "with branching" and "arithmetic branching with intrinsics" versions, and throwing in four scalings with the gegl stock linear.
------------------------------------------------- Differences between the two versions of the code: -------------------------------------------------
16 code segments resembling the following (note the ?: this the version with branching):
const gfloat prem_squared = prem * prem_;
const gfloat deux_squared = deux * deux_;
const gfloat troi_squared = troi * troi_;
const gfloat prem_times_deux = prem * deux;
const gfloat deux_times_troi = deux * troi;
const gfloat deux_squared_minus_prem_squared = deux_squared - prem_squared;
const gfloat troi_squared_minus_deux_squared = troi_squared - deux_squared;
const gfloat prem_vs_deux =
deux_squared_minus_prem_squared > (gfloat) 0. ? prem : deux;
const gfloat deux_vs_troi=
troi_squared_minus_deux_squared > (gfloat) 0. ? deux: troi;
const gfloat my__up =
prem_times_deux > (gfloat) 0. ? prem_vs_deux : (gfloat) 0.;
const gfloat my_dow =
deux_times_troi> (gfloat) 0. ? deux_vs_troi : (gfloat) 0.;
were replaced by (this is the version with arithmetic branching):
const gfloat abs_prem = fabsf( prem );
const gfloat abs_deux = fabsf( deux );
const gfloat abs_troi = fabsf( troi );
const gfloat prem_vs_deux = __builtin_fminf( abs_prem, abs_deux );
const gfloat deux_vs_troi = __builtin_fminf( abs_deux, abs_troi );
const gfloat sign_prem = copysignf( prem, (gfloat) 1. );
const gfloat sign_deux = copysignf( deux, (gfloat) 1. );
const gfloat sign_troi = copysignf( troi, (gfloat) 1. );
const gfloat my__up =
( sign_prem * sign_deux + (gfloat) 1. ) * prem_vs_deux;
const gfloat my_dow =
( sign_deux * sign_troi + (gfloat) 1. ) * prem_deux_0_vs_troi;
Basically, what the code snippets does is this:
If prem and deux have the same sign, put the smallest one (in absolute value) in my__up. Otherwise, set my__up to zero. Do likewise with deux, troi and my_dow. The above two code snippets represent the best ways of performing this that I could figure.
===================
Overall conclusion:
===================
Arithmetic branching (without other improvements) does not appear to be worth the trouble.
================
Average timings:
================
stock gegl linear scale:
47.50 = ( 47.474 + 47.581 + 47.345 + 47.595 ) / 4
gegl yafr with ? branching and no use of intrinsics:
52.58 = ( 52.422 + 52.479 + 52.748 + 52.501 + 52.680 + 52.623 + 52.537 + 52.518 + 52.576 + 52.487 + 52.542 + 52.485 + 52.645 + 52.810 + 52.667 + 52.554 ) / 16
gegl yafr performing arithmetic branching with fabsf, copysignf and fminf:
52.70 = ( 52.568 + 52.447 + 52.763 + 52.524 + 52.772 + 52.652 + 52.524 + 52.765 + 52.596 + 52.850 + 52.733 + 52.799 + 52.627 + 52.897 + 52.871 + 52.866 ) / 16
As you can see, the "?" version is slightly faster overall. Probably not in a significant way, but this certainly does not suggest that this is worth the hassle.
Nicolas Robidoux Laurentian University/Universite Laurentienne
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Hi,
On Fri, 2008-09-12 at 19:45 -0400, Nicolas Robidoux wrote:
There is another C99/gcc built-in with the potential to speed up code a lot: the restrict keyword.
See:
http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html
It looks like the restrict keyword could be easily wrapped into a macro that evaluates to "restrict" on compilers that support it and to "" on compilers where support for it is missing. So if we should decide that it is too early for using C99 features, we could still use "restrict". We just need to add a configure check for it. We could even suggest that it is added to GLib as G_GNUC_RESTRICT.
Sven
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Hi,
regarding the use of C99 features, this is a pointer to the last time this question came up among the GLib developers:
http://mail.gnome.org/archives/gtk-devel-list/2008-June/msg00020.html
The thread linked from this mail might have some interesting arguments that we should consider.
Sven
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Hi,
On Sat, 2008-09-13 at 09:37 +0200, Sven Neumann wrote:
It looks like the restrict keyword could be easily wrapped into a macro
The following code seems to do the trick. It introduces G_GNUC_RESTRICT, which is actually in the GLib namespace. But I hope that we can convince the GLib developers that it makes sense to add it. At some point we could then remove our definition:
#ifndef G_GNUC_RESTRICT
#if defined (__GNUC__) && (__GNUC__ >= 4)
#define G_GNUC_RESTRICT __restrict__
#else
#define G_GNUC_RESTRICT
#endif
#endif
I haven't yet tested if this works and how much of a difference it makes. If I find time later I might try if it helps to optimize some common code paths in GIMP.
Sven
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Hi,
I've filed an enhancement request for G_GNUC_RESTRICT:
http://bugzilla.gnome.org/show_bug.cgi?id=552098
We should however not wait for this to be included in GLib. As GLib 2.18 has just been released, it will take a while before 2.20 hits the road.
Sven
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Short postscript about the benchmark:
I got worried that with a scaling factor of 20, since we recompute the exact same piexl coefficients about 400 times, the chip's branch prediction may be performing better than would be typical with more reasonable enlargement ratios.
So, I redid the tests with a scaling factor of 1.17 instead of 20.
This times, arithmetic branching using c99/gcc math intrinsics performed just a little better than using if-then-else, reversing the results of the previous test (the difference is not statistically significant).
In any case the overall conclusion seems to be: program with what you like best, at least with Intel chips (with the cell processor, say, things may be different).
================
Average timings:
================
stock gegl-sampler-linear scale:
.541 = ( .525 + .607 + .517 + .516 ) / 4
gegl-sampler-yafr with if-then-else branching and no use of intrinsics:
.558 = ( .548 + .544 + .567 + .548 + .549 + .546 + .570 + .545 + .614 + .544 + .559 + .549 + .567 + .545 + .565 + .570 ) / 16
gegl-sampler-yafr performing arithmetic branching with fabsf, copysignf and fminf:
.551 = ( .565 + .550 + .550 + .546 + .550 + .549 + .546 + .567 + .566 + .545 + .548 + .550 + .549 + .548 + .546 + .550 ) / 16 ------------------------------------------------------------------------
Also, there were some small typos in the code snippets I emailed (which were hand edited version of the real code, hence the typos). Here are cleaned up versions. (The following is for reference, really.)
Note that there is a slightly different scaling in the two versions (.5 vs .25), scaling which is taken care of at no cost elsewhere in the real code.
Within each version of yafr, 16 code segments resembling the following (note the ?: this the version with branching):
const gfloat prem_squared = prem * prem_;
const gfloat deux_squared = deux * deux_;
const gfloat troi_squared = troi * troi_;
const gfloat prem_times_deux = prem * deux;
const gfloat deux_times_troi = deux * troi;
const gfloat deux_squared_minus_prem_squared = deux_squared - prem_squared;
const gfloat troi_squared_minus_deux_squared = troi_squared - deux_squared;
const gfloat prem_vs_deux =
deux_squared_minus_prem_squared > (gfloat) 0. ? prem : deux;
const gfloat deux_vs_troi=
troi_squared_minus_deux_squared > (gfloat) 0. ? deux: troi;
const gfloat my__up =
prem_times_deux > (gfloat) 0. ? prem_vs_deux : (gfloat) 0.;
const gfloat my_dow =
deux_times_troi > (gfloat) 0. ? deux_vs_troi : (gfloat) 0.;
were replaced by (this is the version with arithmetic branching):
const gfloat abs_prem = fabsf( prem );
const gfloat abs_deux = fabsf( deux );
const gfloat abs_troi = fabsf( troi );
const gfloat prem_vs_deux = __builtin_fminf( abs_prem, abs_deux );
const gfloat deux_vs_troi = __builtin_fminf( abs_deux, abs_troi );
const gfloat sign_prem = copysignf( prem, (gfloat) 1. );
const gfloat sign_deux = copysignf( deux, (gfloat) 1. );
const gfloat sign_troi = copysignf( troi, (gfloat) 1. );
const gfloat my__up =
( sign_prem * sign_deux + (gfloat) 1. ) * prem_vs_deux;
const gfloat my_dow =
( sign_deux * sign_troi + (gfloat) 1. ) * deux_vs_troi;
Basically, what the code snippets does is this:
If prem and deux have the same sign, put the smallest one (in absolute value) in my__up. Otherwise, set my__up to zero. Do likewise with deux, troi and my_dow. The above two code snippets represent the best ways of performing this that I could figure.
Nicolas Robidoux Laurentian University/Universite Laurentienne
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Sven Neumann wrote:
Hi,
I've filed an enhancement request for G_GNUC_RESTRICT:
http://bugzilla.gnome.org/show_bug.cgi?id=552098
We should however not wait for this to be included in GLib. As GLib 2.18 has just been released, it will take a while before 2.20 hits the road.
Sven
Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats
Hi,
On Sun, 2008-09-14 at 11:30 +0200, Geert Jordaens wrote:
Introducing the qualifier restrict will have some more checks to be done by the programmer and enabling the *-fstrict-aliasing* flag and the warning *-Wstrict-aliasing *would be advisable.
I don't think enabling the strict-aliasing rules is advisable in general. It is too easy to break the rules and this might lead to bugs.
Sven