RSS/Atom feed Twitter
Site is read-only, email is disabled

GIMP and multiple processors

This discussion is connected to the gimp-developer-list.gnome.org mailing list which is provided by the GIMP developers and not related to gimpusers.com.

This is a read-only list on gimpusers.com so this discussion thread is read-only, too.

57 of 57 messages available
Toggle history

Please log in to manage your subscriptions.

GIMP and multiple processors Sven Neumann 12 Feb 14:09
  GIMP and multiple processors Sven Neumann 13 Feb 22:10
   GIMP and multiple processors Daniel Egger 13 Feb 22:18
    GIMP and multiple processors Sven Neumann 13 Feb 22:43
     GIMP and multiple processors Daniel Egger 13 Feb 23:06
   GIMP and multiple processors Nathan Summers 15 Feb 22:29
    GIMP and multiple processors Nathan Summers 15 Feb 22:32
    GIMP and multiple processors Sven Neumann 16 Feb 01:57
     GIMP and multiple processors Sven Neumann 16 Feb 22:42
      GIMP and multiple processors Daniel Egger 18 Feb 11:51
       GIMP and multiple processors Sven Neumann 20 Feb 14:09
        GIMP and multiple processors Daniel Egger 20 Feb 17:17
        GIMP and multiple processors Kevin Cozens 20 Feb 17:51
        GIMP and multiple processors Daniel Egger 20 Feb 21:43
         GIMP and multiple processors Sven Neumann 20 Feb 21:55
          GIMP and multiple processors Daniel Egger 20 Feb 22:27
           GIMP and multiple processors Sven Neumann 20 Feb 22:55
            GIMP and multiple processors Marc) (A.) (Lehmann 20 Feb 23:47
             GIMP and multiple processors Daniel Egger 21 Feb 00:12
              GIMP and multiple processors Adam D. Moss 21 Feb 00:52
               GIMP and multiple processors Tino Schwarze 21 Feb 09:14
                GIMP and multiple processors David Bonnell 21 Feb 12:28
                GIMP and multiple processors Marc) (A.) (Lehmann 21 Feb 13:45
                 GIMP and multiple processors David Bonnell 22 Feb 00:18
               GIMP and multiple processors Jay Cox 21 Feb 19:07
                GIMP and multiple processors Daniel Egger 21 Feb 21:34
                 GIMP and multiple processors Jay Cox 22 Feb 07:52
              GIMP and multiple processors David Bonnell 21 Feb 01:10
               GIMP and multiple processors Sven Neumann 21 Feb 02:00
                GIMP and multiple processors David Bonnell 21 Feb 03:47
                 GIMP and multiple processors Sven Neumann 22 Feb 21:27
                GIMP and multiple processors John Cupitt 21 Feb 14:02
                 GIMP and multiple processors Sven Neumann 22 Feb 21:29
              GIMP and multiple processors Marc) (A.) (Lehmann 21 Feb 03:14
               GIMP and multiple processors Daniel Egger 21 Feb 09:43
            GIMP and multiple processors Daniel Egger 20 Feb 23:50
      GIMP and multiple processors Jay Cox 26 Feb 02:44
       GIMP and multiple processors Sven Neumann 26 Feb 12:10
       GIMP and multiple processors Daniel Egger 26 Feb 19:07
       GIMP and multiple processors Sven Neumann 27 Feb 01:04
        GIMP and multiple processors Daniel Egger 27 Feb 14:17
         GIMP and multiple processors Sven Neumann 27 Feb 14:31
          GIMP and multiple processors Daniel Egger 27 Feb 14:59
           GIMP and multiple processors Sven Neumann 27 Feb 15:19
            GIMP and multiple processors Daniel Egger 27 Feb 15:41
             GIMP and multiple processors Sven Neumann 27 Feb 17:24
              GIMP and multiple processors Daniel Egger 27 Feb 17:32
            GIMP and multiple processors Daniel Egger 27 Feb 15:53
        GIMP and multiple processors Jay Cox 27 Feb 23:28
         GIMP and multiple processors Sven Neumann 28 Feb 00:25
          GIMP and multiple processors Daniel Egger 28 Feb 02:11
           GIMP and multiple processors GSR - FR 28 Feb 16:01
           GIMP and multiple processors Sven Neumann 28 Feb 21:15
            GIMP and multiple processors Daniel Egger 28 Feb 21:33
             GIMP and multiple processors Adam D. Moss 01 Mar 01:44
          GIMP and multiple processors Jay Cox 28 Feb 09:34
         GIMP and multiple processors Sven Neumann 01 Mar 00:35
Sven Neumann
2005-02-12 14:09:44 UTC (over 19 years ago)

GIMP and multiple processors

Hi,

since hyperthreading and multi-core CPUs are becoming more and more common, I think we should put a little more focus on making use of these features. For that reason, I have made --enable-mp the default in CVS HEAD and also changed the gimprc preference so that GIMP uses 2 processors. This default doesn't have necessarily have to make it into the next stable release. The main concern now should be to make sure that it works at all.

I had a look at the current implementation and also cleaned up the code a little. We are talking about app/base/pixel-processor.[ch]. There seems to be room for improvement. I'd like to hack on this myself but my time is limited, so I thought I'd ask if someone else would like to take this job. So here are some ideas of what could be changed:

- Port the thread code to g_thread functions to make it more portable.

That would probably be necessary since we don't want to maintain our own thread abstraction layer. It would mean that we would have to call g_thread_init(). I am not sure if that would introduce any noticeable overhead on systems that only use a single CPU. Might be worth to investigate that.

- Use GThreadPool instead of continuously starting new threads.

Creating a thread has a considerable overhead. The current implementation continuously creates new threads that finish after they have done their job. GLib provides a GThreadPool that would probably fit nicely here.

As you can see, this is a rather small task coding-wise. It would however be good to also do some benchmarking to find out how much overhead is introduced by using the threaded pixel-processor. We don't want to make things worse for single-CPU systems.

Any volunteers?

Sven

Sven Neumann
2005-02-13 22:10:21 UTC (over 19 years ago)

GIMP and multiple processors

Hi,

a quick followup to myself...

I couldn't resist and spent some time on the threaded pixel-processor code . The first part of the TODO I posted yesterday is done, the code has been ported to gthread. This makes the thread functionality available on all platforms supported by gthread-2.0.

I have also cleaned up the code further while porting it to gthread. It seems to be working rather well now, but I haven't done any benchmarking so far. There's now a #define to control the ratio between created threads and tiles that are to be processed. This parameter definitely needs tuning.

Also tried to port the code to GThreadPool but it turned out to be not as trivial as I expected. The current code blocks until all threads are returned and it is not trivial to implement this behaviour with GThreadPool. So this part, the more interesting part of the TODO, is still open.

I don't want to put more time into this, but I think it would definitely make sense to introduce a thread pool to cut down the number of thread creations. Any volunteers?

Sven

Daniel Egger
2005-02-13 22:18:09 UTC (over 19 years ago)

GIMP and multiple processors

On 13.02.2005, at 22:10, Sven Neumann wrote:

I couldn't resist and spent some time on the threaded pixel-processor code . The first part of the TODO I posted yesterday is done, the code has been ported to gthread. This makes the thread functionality available on all platforms supported by gthread-2.0.

If you need someone to run benchmark tests, let me know. I've a dual-Opteron under my desk, a dual-G4 and a dual-G5 at my disposal, a HT P4 machine in my near, and can easily fetch a dual-SPARC workstation oder some bigger irons if needed.

Servus,
Daniel

Sven Neumann
2005-02-13 22:43:27 UTC (over 19 years ago)

GIMP and multiple processors

Hi,

Daniel Egger writes:

On 13.02.2005, at 22:10, Sven Neumann wrote:

I couldn't resist and spent some time on the threaded pixel-processor code . The first part of the TODO I posted yesterday is done, the code has been ported to gthread. This makes the thread functionality available on all platforms supported by gthread-2.0.

If you need someone to run benchmark tests, let me know. I've a dual-Opteron under my desk, a dual-G4 and a dual-G5 at my disposal, a HT P4 machine in my near, and can easily fetch a dual-SPARC workstation oder some bigger irons if needed.

Doing benchmarks is exactly what would have to be done at this point. The main problem is probably what to use as a benchmark. One could write a script-fu and run that using batch-mode.

It would be interesting to have numbers that show the effect of the "num-processors" gimprc option and it would be interesting to see how the value of the TILES_PER_THREAD #define influences the performance. So far I can only guess that the cost of thread creation is perhaps a problem.

Sven

Daniel Egger
2005-02-13 23:06:20 UTC (over 19 years ago)

GIMP and multiple processors

On 13.02.2005, at 22:43, Sven Neumann wrote:

Doing benchmarks is exactly what would have to be done at this point. The main problem is probably what to use as a benchmark. One could write a script-fu and run that using batch-mode.

I've been using the coffee stain effect fu in the past but this is *way* to random to deliver sensible results for timing tests; it is somewhat useful for profiling though which is not what we would want to do in this case, though.

It would be interesting to have numbers that show the effect of the "num-processors" gimprc option and it would be interesting to see how the value of the TILES_PER_THREAD #define influences the performance. So far I can only guess that the cost of thread creation is perhaps a problem.

If anyone can deliver some good benchmark I'd be happy to run it on a few machines with the CVS version....

Servus, Daniel

Nathan Summers
2005-02-15 22:29:00 UTC (over 19 years ago)

GIMP and multiple processors

On Sun, 13 Feb 2005 22:10:21 +0100, Sven Neumann wrote:

Also tried to port the code to GThreadPool but it turned out to be not as trivial as I expected. The current code blocks until all threads are returned and it is not trivial to implement this behaviour with GThreadPool. So this part, the more interesting part of the TODO, is still open.

I don't want to put more time into this, but I think it would definitely make sense to introduce a thread pool to cut down the number of thread creations. Any volunteers?

Here's a solution that would work: (note, untested code, error checking and unintersting stuff omitted for clarity, etc)

Calling code: struct SynchStuff {
GThreadPool *pool;
GCond *cond;
GMutex *mutex;
} synch;

typedef struct SynchStuff SynchStuff;

synch.cond = g_cond_new(); synch.mutex = g_mutex_new();

/* create an exclusive threadpool */ synch.pool = g_thread_pool_new(gimp_thread_func, &synch, ...);

/* make sure that all the tasks are in the queue before checking if queue is empty.
*/

g_mutex_lock(synch.mutex);

for (/* each task */ ...) { g_thread_pool_push(sych.pool, task, ....); }

/* alll threads are now launched. allow threads to test queue for emptiness * and wait for signal that all tasks are complete */

g_cond_wait (synch.cond, synch.mutex);

/* all tasks are now complete */

g_mutex_unlock(synch.mutex);

void gimp_thread_func(GimpTaskData *task, SynchStuff *synch) {

/* process task */ .
.
.

/* wait until all tasks have been queued */ g_mutex_lock(synch->mutex);

if (g_thread_pool_get_num_threads(synch->pool) == 1) g_cond_signal(synch->cond, synch->mutex);

g_mutex_unlock(mutex); }

We could also try to talk the glib developers into including g_thread_pool_join, which would block until the task pool is empty. It would be nice if the GThreadPool API were modelled on OpenMP to the extent possible, which I admit is not really stellar because of the compiler modifications OpenMP requires. But perhaps some of what OpenMP does with compiler modifications we can do with macros.

Rockwalrus

Nathan Summers
2005-02-15 22:32:05 UTC (over 19 years ago)

GIMP and multiple processors

On Tue, 15 Feb 2005 16:29:00 -0500, Nathan Summers wrote:

if (g_thread_pool_get_num_threads(synch->pool) == 1)

correction: I meant if (g_thread_pool_get_num_threads(synch->pool) == 1 && g_thread_pool_unprocessed(synch->pool) == 0)

although the first test is sufficient if n(tasks) >= n(threads)

Rockwalrus

Sven Neumann
2005-02-16 01:57:10 UTC (over 19 years ago)

GIMP and multiple processors

Hi,

Nathan Summers writes:

Here's a solution that would work: (note, untested code, error checking and unintersting stuff omitted for clarity, etc)

Yes, that would most probably work.

struct SynchStuff {
GThreadPool *pool;
GCond *cond;
GMutex *mutex;
} synch;

typedef struct SynchStuff SynchStuff;

I don't think we need a SynchStuff struct though. These can very well be global static variables. The thread pool will be kept around for the whole lifetime of gimp anway.

Sven

Sven Neumann
2005-02-16 22:42:34 UTC (over 19 years ago)

GIMP and multiple processors

Hi,

I couldn't resist and changed the PixelProcessor to use a thread pool. Main motivation was to make progress callback work for the threaded case. So there's now a variant of pixel_regions_process_parallel() that takes progress function and data. This allows us to parallelize some of the slower core functions (like for example gradient blend).

It would be interesting to know if this actually gives a noticeable speedup on SMP systems. Would be nice if one of you could give this some testing. Please try to do gradient blends (w/o adaptive supersampling!) on large images. Changing the number of processors in the preferences dialog allows you to switch between using the thread pool and the single-threaded code.

Sven

Daniel Egger
2005-02-18 11:51:14 UTC (over 19 years ago)

GIMP and multiple processors

On 16.02.2005, at 22:42, Sven Neumann wrote:

It would be interesting to know if this actually gives a noticeable speedup on SMP systems. Would be nice if one of you could give this some testing. Please try to do gradient blends (w/o adaptive supersampling!) on large images. Changing the number of processors in the preferences dialog allows you to switch between using the thread pool and the single-threaded code.

Hm, there's still no idea floating around how to benchmark. I'd rather have some microbenchmark that delivers reproducible results than only being able to say "hey, it feels faster"...

It think this writing some GIMPbench would be a nice volunteer job and hey, if you do it, you'll get to choose the benchmark methods and all the fame... ;)

Anyone?

Servus, Daniel

Sven Neumann
2005-02-20 14:09:10 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Daniel Egger writes:

Hm, there's still no idea floating around how to benchmark.

There are very clear ideas on how to do it. Someone just needs to sit down and write the script (or dig out script-fu-bench.scm which is what we used to use a long time ago).

I'd rather have some microbenchmark that delivers reproducible results than only being able to say "hey, it feels faster"...

You'd still do me a favor if you would try current CVS and told me whether it feels faster or not.

Sven

Daniel Egger
2005-02-20 17:17:58 UTC (about 19 years ago)

GIMP and multiple processors

On 20.02.2005, at 14:09, Sven Neumann wrote:

There are very clear ideas on how to do it.

Hm, must have missed that...

Someone just needs to sit down and write the script (or dig out script-fu-bench.scm which is what we used to use a long time ago).

You'd still do me a favor if you would try current CVS and told me whether it feels faster or not.

Okee, I'm warming up the machine right now. This'll take a while though because I need to recompile quite some stuff. I'll do that in 32bit mode if you don't mind...

Servus, Daniel

Kevin Cozens
2005-02-20 17:51:59 UTC (about 19 years ago)

GIMP and multiple processors

Sven Neumann wrote:

Daniel Egger writes:

Hm, there's still no idea floating around how to benchmark.

There are very clear ideas on how to do it. Someone just needs to sit down and write the script (or dig out script-fu-bench.scm which is what we used to use a long time ago).

It would be nice to have a copy of that old script-fu-bench.scm script. I used to have a copy with one of my versions of GIMP. It isn't in CVS and I can't find it on the net. If someone has the script I'd like to get a copy of it.

Even if we had the old script-fu-bench script around I don't think its the best choice to do the benchmarking tests which are currently needed. It is something I (or someone knowledgeable in Script-Fu or another GIMP scripting language) could write. It would be helpful to have a list of GIMP features/functions to use which would make call the code which needs to be benchmarked.

Daniel Egger
2005-02-20 21:43:37 UTC (about 19 years ago)

GIMP and multiple processors

On 20.02.2005, at 14:09, Sven Neumann wrote:

You'd still do me a favor if you would try current CVS and told me whether it feels faster or not.

It's slower, measurable and reproducible slower.

As a benchmark I used a gradient fill in a 3000x3000px (68.8M) image. I get consistently times of 8s for 1 thread and between 9.2s and 9.6s for 2 threads. With a running application, after a restart -- doesn't matter.

What is strange though, is that it only seems two use one CPU for both threads; maybe a stupid gthread implementation?

Servus, Daniel

Sven Neumann
2005-02-20 21:55:09 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Daniel Egger writes:

It's slower, measurable and reproducible slower.

As a benchmark I used a gradient fill in a 3000x3000px (68.8M) image. I get consistently times of 8s for 1 thread and between 9.2s and 9.6s for 2 threads. With a running application, after a restart -- doesn't matter.

What is strange though, is that it only seems two use one CPU for both threads; maybe a stupid gthread implementation?

What version did you test? What's the last ChangeLog entry?

Sven

Daniel Egger
2005-02-20 22:27:25 UTC (about 19 years ago)

GIMP and multiple processors

On 20.02.2005, at 21:55, Sven Neumann wrote:

As a benchmark I used a gradient fill in a 3000x3000px (68.8M) image. I get consistently times of 8s for 1 thread and between 9.2s and 9.6s for 2 threads. With a running application, after a restart -- doesn't matter.

What is strange though, is that it only seems two use one CPU for both threads; maybe a stupid gthread implementation?

What version did you test?

GIMP (r/w) CVS from a few hours ago (about the time when I wrote in that I'll do it but that it'll take a while...)

What's the last ChangeLog entry?

2005-02-20 Sven Neumann

* app/actions/context-actions.c * app/actions/context-commands.c[ch]: added actions to control the
average radius of color picker tools (bug #167765).

* app/actions/tool-options-actions.c: fixed a typo in a comment.

Servus, Daniel

Sven Neumann
2005-02-20 22:55:18 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Daniel Egger writes:

What is strange though, is that it only seems two use one CPU for both threads; maybe a stupid gthread implementation?

Since gthread is just a very thin wrapper around pthreads, that would mean that it's a stupid pthread implementation. To me this looks like the kernel believes that it would be better to keep the threads local than to move one to the other CPU. I wonder if perhaps the kernel is right and using two CPUs would actually cause more overhead than it's worth?

Sven

Marc) (A.) (Lehmann
2005-02-20 23:47:49 UTC (about 19 years ago)

GIMP and multiple processors

On Sun, Feb 20, 2005 at 10:55:18PM +0100, Sven Neumann wrote:

mean that it's a stupid pthread implementation. To me this looks like the kernel believes that it would be better to keep the threads local than to move one to the other CPU.

Linux will not keep two threads running on a single cpu if both are ready and nothing else is running, regardless of locality etc., as the kernel lacks the tools to effectively decide wether threads should stay on a cpu or not.

(I mean, it's of course bad to interlave operations on a per-pixel basis instead of e.g. a per-tile basis, but the kernel will run the threads concurrently wether or not it gets slower).

right and using two CPUs would actually cause more overhead than it's worth?

That's quite possible, but IFF the kernel indeed keeps the two threads on a single cpu then it means that both aren't ready at the same time, e.g. due to lock contention or other things.

Daniel Egger
2005-02-20 23:50:30 UTC (about 19 years ago)

GIMP and multiple processors

On 20.02.2005, at 22:55, Sven Neumann wrote:

Since gthread is just a very thin wrapper around pthreads, that would mean that it's a stupid pthread implementation. To me this looks like the kernel believes that it would be better to keep the threads local than to move one to the other CPU. I wonder if perhaps the kernel is right and using two CPUs would actually cause more overhead than it's worth?

Hm, it is using NPTL, e.g. if I set it to 8 threads I'll have: egger 31344 2.1 14.5 219412 149008 pts/0 Sl+ 23:10 0:05 /opt/gimp-mp-test/bin/gimp-2.3
egger 31344 10.8 14.5 219412 149012 pts/0 Rl+ 23:10 0:26 /opt/gimp-mp-test/bin/gimp-2.3
egger 31344 11.0 14.5 219412 149012 pts/0 Rl+ 23:10 0:26 /opt/gimp-mp-test/bin/gimp-2.3
egger 31344 6.5 14.5 219412 149012 pts/0 Rl+ 23:11 0:08 /opt/gimp-mp-test/bin/gimp-2.3
egger 31344 5.7 14.5 219412 149012 pts/0 Rl+ 23:11 0:07 /opt/gimp-mp-test/bin/gimp-2.3
egger 31344 6.8 14.5 219412 149016 pts/0 Rl+ 23:11 0:08 /opt/gimp-mp-test/bin/gimp-2.3
egger 31344 6.1 14.5 219412 149016 pts/0 Rl+ 23:11 0:08 /opt/gimp-mp-test/bin/gimp-2.3
egger 31344 6.8 14.5 219412 149016 pts/0 Rl+ 23:11 0:08 /opt/gimp-mp-test/bin/gimp-2.3
egger 31344 6.2 14.5 219412 149016 pts/0 Rl+ 23:11 0:08 /opt/gimp-mp-test/bin/gimp-2.3

If I force it back to classical pthreads then it'll be even worse:
egger 31456 70.8 15.0 164232 154852 pts/0 R+ 23:32 1:53 /opt/gimp-mp-test/bin/gimp-2.3
egger 31457 0.0 15.0 164232 154852 pts/0 S+ 23:32 0:00 /opt/gimp-mp-test/bin/gimp-2.3
egger 31458 0.0 15.0 164232 154852 pts/0 S+ 23:32 0:00 /opt/gimp-mp-test/bin/gimp-2.3
egger 31459 0.0 15.0 164232 154852 pts/0 S+ 23:32 0:00 /opt/gimp-mp-test/bin/gimp-2.3

One interesting thing here certainly is that one process (== thread) is doing all the work and the rest is doing nothing in a shape fill, while in a linear fill I'll get: egger 31474 3.4 10.8 120696 111508 pts/0 S+ 23:36 0:04 /opt/gimp-mp-test/bin/gimp-2.3
egger 31475 0.0 10.8 120696 111508 pts/0 S+ 23:36 0:00 /opt/gimp-mp-test/bin/gimp-2.3
egger 31476 35.1 10.8 120696 111508 pts/0 S+ 23:36 0:48 /opt/gimp-mp-test/bin/gimp-2.3
egger 31477 35.1 10.8 120696 111508 pts/0 S+ 23:36 0:48 /opt/gimp-mp-test/bin/gimp-2.3

However those are again running on the same CPU so rather disturbing each other then crunching away.

To be continued...

Servus, Daniel

Daniel Egger
2005-02-21 00:12:51 UTC (about 19 years ago)

GIMP and multiple processors

On 20.02.2005, at 23:47, wrote:

Linux will not keep two threads running on a single cpu if both are ready
and nothing else is running, regardless of locality etc., as the kernel lacks the tools to effectively decide wether threads should stay on a cpu
or not.

Yes and no. I just figured out that the tools I were looking for are called schedutils and can be used to change the affinity settings of a process, i.e. pin it to some CPU or allow it to migrate as the kernel decides between a set of CPUs.

Forcing the NPTL implementation to degrade to legacy pthreads means that one thread equals one process and thus can be controlled with taskset.

Oh yes, and I just noticed that now this isn't even necessary anymore because for some reason the kernel now migrates on of the pthread processes to the other CPU automatically after a short while of processing.

(I mean, it's of course bad to interlave operations on a per-pixel basis
instead of e.g. a per-tile basis, but the kernel will run the threads concurrently wether or not it gets slower).

Certainly. Opterons are bandwidth monsters but this doesn't mean that they'll be forgiving to stupid algorithms.

That's quite possible, but IFF the kernel indeed keeps the two threads on
a single cpu then it means that both aren't ready at the same time, e.g.
due to lock contention or other things.

I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads.

Servus, Daniel

Adam D. Moss
2005-02-21 00:52:16 UTC (about 19 years ago)

GIMP and multiple processors

Daniel Egger wrote:

I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads.

Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory, causing memory contention between CPUs.

--Adam * I haven't looked at it.

David Bonnell
2005-02-21 01:10:28 UTC (about 19 years ago)

GIMP and multiple processors

It sounds like the granularity of parallelism is too fine. That is, each "task" is too short and the overhead of task dispatching (your task queue processing, the kernels thread context switching, any IPC required, etc.) is longer then the duration of a single task.

I hit the same problem a decade and a half ago when I worked on distributed parallel ray tracing systems for my post graduate thesis. If each task is a pixel then you may want to consider increasing this to a (configurable size) bundle of pixels. Depending on the algorithm being parallelized the bundle may contain contiguous pixels (if processing of each pixel requires approximately uniform processor time) or a random set of pixels (if there is, or can potentially be, significant variance in per-pixel processing time).

-Dave

-----Original Message----- From: gimp-developer-bounces@lists.xcf.berkeley.edu [mailto:gimp-developer-bounces@lists.xcf.berkeley.edu] On Behalf Of Daniel Egger
Sent: Monday, 21 February 2005 9:13 AM To: pcg@goof.com ( Marc) (A.) (Lehmann ) Cc: Sven Neumann; Developer gimp-devel Subject: Re: [Gimp-developer] GIMP and multiple processors

On 20.02.2005, at 23:47, wrote:

Linux will not keep two threads running on a single cpu if both are ready
and nothing else is running, regardless of locality etc., as the kernel lacks the tools to effectively decide wether threads should stay on a cpu
or not.

Yes and no. I just figured out that the tools I were looking for are called schedutils and can be used to change the affinity settings of a process, i.e. pin it to some CPU or allow it to migrate as the kernel decides between a set of CPUs.

Forcing the NPTL implementation to degrade to legacy pthreads means that one thread equals one process and thus can be controlled with taskset.

Oh yes, and I just noticed that now this isn't even necessary anymore because for some reason the kernel now migrates on of the pthread processes to the other CPU automatically after a short while of processing.

(I mean, it's of course bad to interlave operations on a per-pixel basis
instead of e.g. a per-tile basis, but the kernel will run the threads concurrently wether or not it gets slower).

Certainly. Opterons are bandwidth monsters but this doesn't mean that they'll be forgiving to stupid algorithms.

That's quite possible, but IFF the kernel indeed keeps the two threads on
a single cpu then it means that both aren't ready at the same time, e.g.
due to lock contention or other things.

I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads.

Servus, Daniel

Sven Neumann
2005-02-21 02:00:39 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

"David Bonnell" writes:

It sounds like the granularity of parallelism is too fine. That is, each "task" is too short and the overhead of task dispatching (your task queue processing, the kernels thread context switching, any IPC required, etc.) is longer then the duration of a single task.

I hit the same problem a decade and a half ago when I worked on distributed parallel ray tracing systems for my post graduate thesis. If each task is a pixel then you may want to consider increasing this to a (configurable size) bundle of pixels. Depending on the algorithm being parallelized the bundle may contain contiguous pixels (if processing of each pixel requires approximately uniform processor time) or a random set of pixels (if there is, or can potentially be, significant variance in per-pixel processing time).

The task is not a single pixel but a single tile (that is usually a region of 64x64 pixels). GIMP processes pixel regions by iterating over the tiles. The multi-threaded pixel processor uses a configurable number of threads. Each thread obtains a lock on the pixel-region, takes a pointer to the next tile from the queue, releases the lock, processes the tile and starts over. This goes on until all tiles are processed. The main threads blocks until the queue is empty and all threads have finished their jobs. If a progress callback has been specified, the main thread wakes up regularily and updates the progress bar.

Sven

Marc) (A.) (Lehmann
2005-02-21 03:14:19 UTC (about 19 years ago)

GIMP and multiple processors

On Mon, Feb 21, 2005 at 12:12:51AM +0100, Daniel Egger wrote:

Linux will not keep two threads running on a single cpu if both are ready
and nothing else is running, regardless of locality etc., as the kernel lacks the tools to effectively decide wether threads should stay on a cpu
or not.

Yes and no. I just figured out that the tools I were looking for are called schedutils and can be used to change the affinity settings of a process, i.e. pin it to some CPU or allow it to migrate as the kernel decides between a set of CPUs.

I should have said "unless specially instructed to do so", i.e., not by default.

Forcing the NPTL implementation to degrade to legacy

BTW, is this 2.4 or 2.6?

I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads.

Now, that's definitely a result then :)

David Bonnell
2005-02-21 03:47:36 UTC (about 19 years ago)

GIMP and multiple processors

Thanks for the clarification.

If each thread obtains an (exclusive) lock on the pixel region then the tasks will effectively be serialized and overall execution time will increase compared to a non-threaded implementation due to the threading overheads. (Queue manipulation, thread creation and context switching, etc).

The main thread should lock the pixel region before pushing the tasks onto the work. Each task may then operate under the assumption that it has exclusive access to its tile. When the last task has completed the lock can be released.

If this is already the case then please ignore as I have not had the opportunity of yet to review the code.

-Dave

-----Original Message----- From: gimp-developer-bounces@lists.xcf.berkeley.edu [mailto:gimp-developer-bounces@lists.xcf.berkeley.edu] On Behalf Of Sven Neumann
Sent: Monday, 21 February 2005 11:01 AM To: David Bonnell
Cc: Gimp-developer@lists.xcf.berkeley.edu Subject: Re: [Gimp-developer] GIMP and multiple processors

Hi,

"David Bonnell" writes:

It sounds like the granularity of parallelism is too fine. That is, each "task" is too short and the overhead of task dispatching (your task queue processing, the kernels thread context switching, any IPC required, etc.) is longer then the duration of a single task.

I hit the same problem a decade and a half ago when I worked on distributed parallel ray tracing systems for my post graduate thesis. If each task is a pixel then you may want to consider increasing this to a (configurable size) bundle of pixels. Depending on the algorithm being parallelized the bundle may contain contiguous pixels (if processing of each pixel requires approximately uniform processor time) or a random set of pixels (if there is, or can potentially be, significant variance in per-pixel processing time).

The task is not a single pixel but a single tile (that is usually a region of 64x64 pixels). GIMP processes pixel regions by iterating over the tiles. The multi-threaded pixel processor uses a configurable number of threads. Each thread obtains a lock on the pixel-region, takes a pointer to the next tile from the queue, releases the lock, processes the tile and starts over. This goes on until all tiles are processed. The main threads blocks until the queue is empty and all threads have finished their jobs. If a progress callback has been specified, the main thread wakes up regularily and updates the progress bar.

Sven

Tino Schwarze
2005-02-21 09:14:13 UTC (about 19 years ago)

GIMP and multiple processors

On Sun, Feb 20, 2005 at 11:52:16PM +0000, Adam D. Moss wrote:

I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads.

Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory, causing memory contention between CPUs.

So it might make sense not to interleave tiles but let each thread start on another region of the image, like this (for 4 threads)

1xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
2xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
3xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
4xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

So each of the threads works on "it's own" memory at first and only when it finished it's region, it will get tiles from the remaining work. Some binary-search like algorithm could be used like this:

- figure out largest remaining part of the image (at start, it's the whole image)
- let x be the number of idle threads - if starting up,
divide into x regions
assign each thread the start of it's region else
divide into x+1 regions
assign idle threads the x+1st, x-th, x-1st etc. region (this leaves the first to the thread currently processing it) - start processing the regions
- if any thread runs out of work, restart algorithm

This should improve memory locality while distributing work across all threads and making sure that non-uniform workloads are also distributed.

Consider the above example where thread 4 has finished it's tiles:

1111xxxxxxxxxxxx xxxxxxxxxxxxxxxx
22222222xxxxxxxx
xxxxxxxxxxxxxxxx
333333333333333x
xxxxxxxxxxxxxxxx
4444444444444444
4444444444444444

It looks like 1 will need the longest time to finish it's region, so we jump in there and assign part of it's work to 4 (marked y):

1111xxxxxxxxxxxx xx4yyyyyyyyyyyyy
22222222xxxxxxxx
xxxxxxxxxxxxxxxx
333333333333333x
xxxxxxxxxxxxxxxx
4444444444444444
4444444444444444

Later, 3 finishes:

11111111xxxxxxxx xx44444yyyyyyyyy
2222222222222222
xxxxxxxxxxxxxxxx
3333333333333333
3333333333333333
4444444444444444
4444444444444444

Now, 2 has the largest region left, so half of it get's assign to 3 (marked z):

11111111xxxxxxxx
xx44444yyyyyyyyy
2222222222222222
xxxxxxxx3zzzzzzz
3333333333333333
3333333333333333
4444444444444444
4444444444444444

The state of affairs can be managed easily since there are never more regions than threads.

HTH! Tino.

PS: I don't know whether this algorithm works well in practice, I actually got the idea while replying Adam's response.

Daniel Egger
2005-02-21 09:43:51 UTC (about 19 years ago)

GIMP and multiple processors

On 21.02.2005, at 03:14, wrote:

Forcing the NPTL implementation to degrade to legacy

BTW, is this 2.4 or 2.6?

Linux ulli 2.6.9-k8 #31 SMP Wed Nov 3 10:58:29 CET 2004 x86_64 GNU/Linux

32bit userland from Debian unstable.

Servus, Daniel

David Bonnell
2005-02-21 12:28:42 UTC (about 19 years ago)

GIMP and multiple processors

Allocating batches of tiles to each thread and dynamically re-allocating them should a thread finish early will complicate the code and will only provide very marginal speed improvement. (Each thread will have to obtain a lock on a work queue to get its next tile either way but the proposed scheme has a slightly less chance of lock contention since there are effectively multiple queues).

As far as task sequencing goes that all depends on how tile memory is allocated. To maximize locality of reference and minimize paging the goal should be to have all threads working on contiguous areas of memory at the same time where possible.

-Dave

-----Original Message----- From: gimp-developer-bounces@lists.xcf.berkeley.edu [mailto:gimp-developer-bounces@lists.xcf.berkeley.edu] On Behalf Of Tino Schwarze
Sent: Monday, 21 February 2005 6:14 PM To: gimp-developer@lists.xcf.berkeley.edu Subject: Re: [Gimp-developer] GIMP and multiple processors

On Sun, Feb 20, 2005 at 11:52:16PM +0000, Adam D. Moss wrote:

I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads.

Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory, causing memory contention between CPUs.

So it might make sense not to interleave tiles but let each thread start on another region of the image, like this (for 4 threads)

1xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
2xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
3xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
4xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

So each of the threads works on "it's own" memory at first and only when it finished it's region, it will get tiles from the remaining work. Some binary-search like algorithm could be used like this:

- figure out largest remaining part of the image (at start, it's the whole image)
- let x be the number of idle threads - if starting up,
divide into x regions
assign each thread the start of it's region else
divide into x+1 regions
assign idle threads the x+1st, x-th, x-1st etc. region (this leaves the first to the thread currently processing it) - start processing the regions
- if any thread runs out of work, restart algorithm

This should improve memory locality while distributing work across all threads and making sure that non-uniform workloads are also distributed.

Consider the above example where thread 4 has finished it's tiles:

1111xxxxxxxxxxxx xxxxxxxxxxxxxxxx
22222222xxxxxxxx
xxxxxxxxxxxxxxxx
333333333333333x
xxxxxxxxxxxxxxxx
4444444444444444
4444444444444444

It looks like 1 will need the longest time to finish it's region, so we jump in there and assign part of it's work to 4 (marked y):

1111xxxxxxxxxxxx xx4yyyyyyyyyyyyy
22222222xxxxxxxx
xxxxxxxxxxxxxxxx
333333333333333x
xxxxxxxxxxxxxxxx
4444444444444444
4444444444444444

Later, 3 finishes:

11111111xxxxxxxx xx44444yyyyyyyyy
2222222222222222
xxxxxxxxxxxxxxxx
3333333333333333
3333333333333333
4444444444444444
4444444444444444

Now, 2 has the largest region left, so half of it get's assign to 3 (marked z):

11111111xxxxxxxx
xx44444yyyyyyyyy
2222222222222222
xxxxxxxx3zzzzzzz
3333333333333333
3333333333333333
4444444444444444
4444444444444444

The state of affairs can be managed easily since there are never more regions than threads.

HTH! Tino.

PS: I don't know whether this algorithm works well in practice, I actually got the idea while replying Adam's response.

Marc) (A.) (Lehmann
2005-02-21 13:45:22 UTC (about 19 years ago)

GIMP and multiple processors

On Mon, Feb 21, 2005 at 09:14:13AM +0100, Tino Schwarze wrote:

I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads.

Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory

This is unlikely, as the gimp has no say in the physical layout of the memory it gets from the kernel.

Also, interleaving should not have much of an effect, as the blocks are large, so there will be no thrashing between cpus (if hyperthreading is in use, then interleaving would actually be slightly better due to limited cache associativity).

It's more likely that gimp is doing something wrong currently, with respect to locking or sth. similar.

John Cupitt
2005-02-21 14:02:42 UTC (about 19 years ago)

GIMP and multiple processors

On Mon, 21 Feb 2005 02:00:39 +0100, Sven Neumann wrote:

It sounds like the granularity of parallelism is too fine. That is, each "task" is too short and the overhead of task dispatching (your task queue processing, the kernels thread context switching, any IPC required, etc.) is longer then the duration of a single task.

The task is not a single pixel but a single tile (that is usually a region of 64x64 pixels). GIMP processes pixel regions by iterating over the tiles. The multi-threaded pixel processor uses a configurable number of threads. Each thread obtains a lock on the pixel-region, takes a pointer to the next tile from the queue, releases the lock, processes the tile and starts over.

I maintain a threaded image processing library called VIPS.

http://www.vips.ecs.soton.ac.uk/

We looked at granularity a while ago and the 'sweet spot' at which the thread start/stop time became insignificant seemed to be around 50x50 pixels. I've pasted some numbers to the end of the mail in case anyone is interested. I realise gimp is using a very different evaluation strategy, but the point (maybe) is that thread manipulation is rather quick and you're probably not seeing it with 64x64 pixel tiles.

FWIW, vips works by having a thread pool (rather than a tile queue) and a simple for(;;) loop over tiles. At each tile, the for() loop waits for a thread to become free, then assigns it a tile to work on.

The benchmark is a 45 degree rotate of a 4000 by 4000 pixel image. Look for the point at which real time stops falling. The first two arguments to "try" are the tilesize. The more recent numbers are really too small to be accurate :-( but the benchmark is 10 years old and took minutes back then: oh well. I'm supposed to be getting a quad opteron soon, which will be interesting. Kernel 2.6 would help too no doubt.

cima: 1 cpu ultrasparc
./try huysum.hr.v fred.v 10 10 1 20 real 30.1
user 24.5
./try huysum.hr.v fred.v 20 20 1 20 real 19.2
user 16.9
./try huysum.hr.v fred.v 30 30 1 20 real 17.8
user 15.4
./try huysum.hr.v fred.v 40 40 1 20 real 17.1
user 15.1
./try huysum.hr.v fred.v 50 50 1 20 real 16.9
user 15.1
./try huysum.hr.v fred.v 60 60 1 20 real 16.6
user 15.0
./try huysum.hr.v fred.v 70 70 1 20 real 17.2
user 15.2
./try huysum.hr.v fred.v 80 80 1 20 real 17.3
user 15.1
./try huysum.hr.v fred.v 90 90 1 20 real 17.4
user 15.3

perugino: 2 cpu supersparc ./try huysum.hr.v fred.v 10 10 1 20 real 0m51.123s
user 1m7.623s
./try huysum.hr.v fred.v 20 20 1 20 real 0m24.601s
user 0m41.133s
./try huysum.hr.v fred.v 30 30 1 20 real 0m21.931s
user 0m38.393s
./try huysum.hr.v fred.v 40 40 1 20 real 0m20.208s
user 0m35.653s
./try huysum.hr.v fred.v 50 50 1 20 real 0m20.109s
user 0m35.283s
./try huysum.hr.v fred.v 60 60 1 20 real 0m19.501s
user 0m34.513s
./try huysum.hr.v fred.v 70 70 1 20 real 0m20.435s
user 0m34.813s
./try huysum.hr.v fred.v 80 80 1 20 real 0m20.558s
user 0m35.293s
./try huysum.hr.v fred.v 90 90 1 20 real 0m20.785s
user 0m35.313s

Run on furini, 2 CPU 450MHz PII Xeon, kernel 2.4.4, vips-7.7.19, gcc 2.95.3 ./try huysum.hr.v fred.v 10 10 1 20 real 0m4.542s
user 0m4.350s
sys 0m3.800s
./try huysum.hr.v fred.v 20 20 1 20 real 0m2.206s
user 0m2.750s
sys 0m1.250s
./try huysum.hr.v fred.v 30 30 1 20 real 0m1.678s
user 0m2.610s
sys 0m0.580s
./try huysum.hr.v fred.v 40 40 1 20 real 0m1.483s
user 0m2.460s
sys 0m0.410s
./try huysum.hr.v fred.v 50 50 1 20 real 0m1.443s
user 0m2.330s
sys 0m0.350s
./try huysum.hr.v fred.v 60 60 1 20 real 0m1.385s
user 0m2.390s
sys 0m0.220s
./try huysum.hr.v fred.v 70 70 1 20 real 0m1.394s
user 0m2.460s
sys 0m0.150s
./try huysum.hr.v fred.v 80 80 1 20 real 0m1.365s
user 0m2.360s
sys 0m0.200s
./try huysum.hr.v fred.v 90 90 1 20 real 0m1.393s
user 0m2.450s
sys 0m0.180s

Run on manet, 2 CPU 2.5GHz P4 Xeon, kernel 2.4.18, vips-7.8.5, gcc 2.95.3 ./try huysum.hr.v fred.v 10 10 1 20 real 0m1.582s
user 0m1.640s
sys 0m1.470s
./try huysum.hr.v fred.v 20 20 1 20 real 0m0.691s
user 0m0.970s
sys 0m0.410s
./try huysum.hr.v fred.v 30 30 1 20 real 0m0.548s
user 0m0.790s
sys 0m0.230s
./try huysum.hr.v fred.v 40 40 1 20 real 0m0.489s
user 0m0.790s
sys 0m0.160s
./try huysum.hr.v fred.v 50 50 1 20 real 0m0.465s
user 0m0.610s
sys 0m0.180s
./try huysum.hr.v fred.v 60 60 1 20 real 0m0.454s
user 0m0.740s
sys 0m0.030s
./try huysum.hr.v fred.v 70 70 1 20 real 0m0.505s
user 0m0.820s
sys 0m0.120s
./try huysum.hr.v fred.v 80 80 1 20 real 0m0.479s
user 0m0.840s
sys 0m0.090s
./try huysum.hr.v fred.v 90 90 1 20 real 0m0.436s
user 0m0.650s
sys 0m0.040s

Run on constable, 2 CPU 2.5GHz P4 Xeon, kernel 2.4.21, vips-7.10.8, gcc 3.3.1 ./try huysum.hr.v fred.v 10 10 1 20 real 0m1.544s
user 0m1.420s
sys 0m1.422s
./try huysum.hr.v fred.v 20 20 1 20 real 0m0.690s
user 0m0.834s
sys 0m0.441s
./try huysum.hr.v fred.v 30 30 1 20 real 0m0.494s
user 0m0.658s
sys 0m0.244s
./try huysum.hr.v fred.v 40 40 1 20 real 0m0.450s
user 0m0.657s
sys 0m0.174s
./try huysum.hr.v fred.v 50 50 1 20 real 0m0.397s
user 0m0.579s
sys 0m0.144s
./try huysum.hr.v fred.v 60 60 1 20 real 0m0.507s
user 0m0.813s
sys 0m0.123s
./try huysum.hr.v fred.v 70 70 1 20 real 0m0.381s
user 0m0.573s
sys 0m0.115s
./try huysum.hr.v fred.v 80 80 1 20 real 0m0.357s
user 0m0.530s
sys 0m0.101s
./try huysum.hr.v fred.v 90 90 1 20 real 0m0.528s
user 0m0.877s
sys 0m0.103s

Jay Cox
2005-02-21 19:07:59 UTC (about 19 years ago)

GIMP and multiple processors

On Sun, 2005-02-20 at 23:52 +0000, Adam D. Moss wrote:

Daniel Egger wrote:

I can force it to use both CPUs now, but even with 200% utilization it is 2s slower to run this stupid ubenchmark than on 1 CPU without threads.

Just a vague guess, but the multiprocessor GIMP pixel work scheduler might* farm alternating tiles to alternating CPUs. These are reasonably likely to have been allocated together and thus sit close together in memory, causing memory contention between CPUs.

I'm not sure what system the benchmark is being run on, but the cache line size on a P4 is 128Byes (most other systems have smaller cache line sizes). A simple test to see if this is the problem would be to change the tile allocation code to allocate an extra 128 bytes of memory per tile. See app/base/tile.c line 221

I think it is more likely that the problem is with sharing some other data between processors (for example the gradient struct?)

I think it would be a good idea to get some timings from some other operations also. Perhaps painting with a large brush, or flattening a complicated image.

Jay Cox
jaycox@gimp.org

Daniel Egger
2005-02-21 21:34:41 UTC (about 19 years ago)

GIMP and multiple processors

On 21.02.2005, at 19:07, Jay Cox wrote:

I'm not sure what system the benchmark is being run on, but the cache line size on a P4 is 128Byes (most other systems have smaller cache line
sizes). A simple test to see if this is the problem would be to change the tile allocation code to allocate an extra 128 bytes of memory per tile. See app/base/tile.c line 221

Dual Opteron.

I think it would be a good idea to get some timings from some other operations also. Perhaps painting with a large brush, or flattening a complicated image.

Sven, what procedures are currently "parallelized"?

Servus, Daniel

David Bonnell
2005-02-22 00:18:59 UTC (about 19 years ago)

GIMP and multiple processors

That all depends how it allocates tile memory. If it mallocs tile by tile then they could be anywhere, including separate pages. If it mallocs a large block (system page size) and then divides it into tiles itself then those tiles are *likely* to reside on the same page - which is what matters most for performance if memory is stretched.

-Dave

-----Original Message----- From: gimp-developer-bounces@lists.xcf.berkeley.edu [mailto:gimp-developer-bounces@lists.xcf.berkeley.edu] On Behalf Of pcg@goof.com ( Marc) (A.) (Lehmann ) Sent: Monday, 21 February 2005 10:45 PM To: gimp-developer@lists.xcf.berkeley.edu Subject: Re: [Gimp-developer] GIMP and multiple processors

...

This is unlikely, as the gimp has no say in the physical layout of the memory
it gets from the kernel.

Jay Cox
2005-02-22 07:52:53 UTC (about 19 years ago)

GIMP and multiple processors

On Mon, 2005-02-21 at 21:34 +0100, Daniel Egger wrote:

On 21.02.2005, at 19:07, Jay Cox wrote:

I'm not sure what system the benchmark is being run on, but the cache line size on a P4 is 128Byes (most other systems have smaller cache line
sizes). A simple test to see if this is the problem would be to change the tile allocation code to allocate an extra 128 bytes of memory per tile. See app/base/tile.c line 221

Dual Opteron.

Should only need 64 bytes then.

I think it would be a good idea to get some timings from some other operations also. Perhaps painting with a large brush, or flattening a complicated image.

Sven, what procedures are currently "parallelized"?

My recolection from many years ago is this: All of the tools->color_tools
Histogram calculation
All compositing(?) including:
Layer stack compositing
Brush->drawable compositing

More recently sven has parallelized: gimp_image_contiguous_region_by_color gimp_drawable_desaturate
gradient_fill_region

Jay Cox
jaycox@gimp.org

Sven Neumann
2005-02-22 21:27:14 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

"David Bonnell" writes:

If each thread obtains an (exclusive) lock on the pixel region then the tasks will effectively be serialized and overall execution time will increase compared to a non-threaded implementation due to the threading overheads. (Queue manipulation, thread creation and context switching, etc).

Please read my description again; I don't think it was that bad. Of course there's no exclusive lock on the pixel region while it is being processed.

Sven

Sven Neumann
2005-02-22 21:29:49 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

John Cupitt writes:

FWIW, vips works by having a thread pool (rather than a tile queue) and a simple for(;;) loop over tiles. At each tile, the for() loop waits for a thread to become free, then assigns it a tile to work on.

It would be trivial to change the GIMP code to work this way. We would use the existing GThreadPool and just push a tasks per tile into the pool. If someone wants to give this a try, perhaps it makes a difference.

Sven

Jay Cox
2005-02-26 02:44:22 UTC (about 19 years ago)

GIMP and multiple processors

On Wed, 2005-02-16 at 22:42 +0100, Sven Neumann wrote:

Hi,

I couldn't resist and changed the PixelProcessor to use a thread pool. Main motivation was to make progress callback work for the threaded case. So there's now a variant of pixel_regions_process_parallel() that takes progress function and data. This allows us to parallelize some of the slower core functions (like for example gradient blend).

It would be interesting to know if this actually gives a noticeable speedup on SMP systems. Would be nice if one of you could give this some testing. Please try to do gradient blends (w/o adaptive supersampling!) on large images. Changing the number of processors in the preferences dialog allows you to switch between using the thread pool and the single-threaded code.

Here are some results for you:

Dual 2.5ghz g5 mac, mac os x 10.3.8 CVS gimp Changelog revision 1.10539

Linear Gradient blend on a 3000x3000 pixel image (Dithering on) 1 Processor: 7.98 seconds 1x 2 processors: 5.20 seconds 1.5x 3 processors: 5.23 seconds 1.5x

Linear Gradient blend on a 3000x3000 pixel image (Dithering OFF) 1 processor: 3.89 seconds 1x 2 processors: 2.37 seconds 1.7x 3 processors 2.40 seconds 1.7x

Clearly the gradient code could use some tuning. A linear blend shouldn't take much more than 1/2 a second even with dithering.

The reason why the dithering case gets less of a speedup is because the threads are fighting over the GRand state. Each thread needs to have it's own GRand state.

It looks like the threads are also fighting over gradient->last_visited. My guess is that fixing this will get us much closer to the ideal 2x speed up.

I don't see myself hacking on the gradient code in the near future so anyone else should feel free...

Jay Cox jaycox@gimp.org

Sven Neumann
2005-02-26 12:10:26 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Jay Cox writes:

Here are some results for you:

Dual 2.5ghz g5 mac, mac os x 10.3.8 CVS gimp Changelog revision 1.10539

Linear Gradient blend on a 3000x3000 pixel image (Dithering on) 1 Processor: 7.98 seconds 1x 2 processors: 5.20 seconds 1.5x 3 processors: 5.23 seconds 1.5x

Linear Gradient blend on a 3000x3000 pixel image (Dithering OFF) 1 processor: 3.89 seconds 1x 2 processors: 2.37 seconds 1.7x 3 processors 2.40 seconds 1.7x

Clearly the gradient code could use some tuning. A linear blend shouldn't take much more than 1/2 a second even with dithering.

The reason why the dithering case gets less of a speedup is because the threads are fighting over the GRand state. Each thread needs to have it's own GRand state.

It looks like the threads are also fighting over gradient->last_visited. My guess is that fixing this will get us much closer to the ideal 2x speed up.

Eeek, I didn't think about gradient->last_visited. While the use of GRand is MT-safe, accessing the gradient obviously isn't. So this is not only slowing things done, it is a race condition that can lead to broken results. We definitely need to fix this or revert back to a non-parallel version of gimp_gradient_blend().

BTW, thanks for fixing my threads code and also thanks for these benchmarks!

Sven

Daniel Egger
2005-02-26 19:07:34 UTC (about 19 years ago)

GIMP and multiple processors

On 26.02.2005, at 02:44, Jay Cox wrote :

Dual 2.5ghz g5 mac, mac os x 10.3.8 CVS gimp Changelog revision 1.10539

Linear Gradient blend on a 3000x3000 pixel image (Dithering on) 1 Processor: 7.98 seconds 1x 2 processors: 5.20 seconds 1.5x 3 processors: 5.23 seconds 1.5x

Linear Gradient blend on a 3000x3000 pixel image (Dithering OFF) 1 processor: 3.89 seconds 1x 2 processors: 2.37 seconds 1.7x 3 processors 2.40 seconds 1.7x

Oh, I wouldn't have though that disabling dithering makes such a difference but indeed it does.

Same game on Dual Opteron again with latest sources:

Linear Gradient blend on a 3000x3000 pixel image (Dithering on) 1 processor: 7.5 s
2 processors: 9.6 s
4 processors: 9.6 s

Linear Gradient blend on a 3000x3000 pixel image (Dithering OFF) 1 processor: 3.0 s
2 processors: 2.5 s
4 processors: 2.5 s

Each time determined over three runs. Deviation is too small to measure.

So with dithering there's a loss and without a slight win on this box with several threads.

Servus, Daniel

Sven Neumann
2005-02-27 01:04:30 UTC (about 19 years ago)

GIMP and multiple processors

Hi again,

Jay Cox writes:

Clearly the gradient code could use some tuning. A linear blend shouldn't take much more than 1/2 a second even with dithering.

Could we improve this by preparing a reasonably fine array of samples and picking from these samples instead of calling gimp_gradient_get_color_at() over and over again?

The reason why the dithering case gets less of a speedup is because the threads are fighting over the GRand state. Each thread needs to have it's own GRand state.

It looks like the threads are also fighting over gradient->last_visited. My guess is that fixing this will get us much closer to the ideal 2x speed up.

I have eliminated last_visited from the gradient struct. Instead the caller of gimp_gradient_get_color_at() may now do the same optimization without any caching in the gradient itself. I very much doubt that this makes any difference though. Perhaps if you would benchmark a gradient blend with a lot of segments. But in the general case it's just a few of them, very often even only a single one.

Now that this race condition is eliminated I might look into adding hooks to the pixel-processor to allow initialisation of per-thread data, like for example a GRand.

Sven

Daniel Egger
2005-02-27 14:17:09 UTC (about 19 years ago)

GIMP and multiple processors

On 27.02.2005, at 01:04, Sven Neumann wrote:

I have eliminated last_visited from the gradient struct. Instead the caller of gimp_gradient_get_color_at() may now do the same optimization without any caching in the gradient itself. I very much doubt that this makes any difference though. Perhaps if you would benchmark a gradient blend with a lot of segments. But in the general case it's just a few of them, very often even only a single one.

Now that this race condition is eliminated I might look into adding hooks to the pixel-processor to allow initialisation of per-thread data, like for example a GRand.

Your change dramatically changed the picture.

Instead of:

Linear Gradient blend on a 3000x3000 pixel image (Dithering on) 1 processor: 7.5 s
2 processors: 9.6 s
4 processors: 9.6 s

Linear Gradient blend on a 3000x3000 pixel image (Dithering OFF) 1 processor: 3.0 s
2 processors: 2.5 s
4 processors: 2.5 s

I now get:
Linear Gradient blend on a 3000x3000 pixel image (Dithering on) 1 processor: 7.2 s
2 processors: 8.9 s

Linear Gradient blend on a 3000x3000 pixel image (Dithering OFF) 1 processor: 2.5 s
2 processors: 1.5 s

However dithering on is still loosing quite a bit on this SMP machine....

Servus,
Daniel

Sven Neumann
2005-02-27 14:31:08 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Daniel Egger writes:

Your change dramatically changed the picture.

That surprises me, but there is obviously a lot I have to learn about threads. Thanks for testing my changes.

However dithering on is still loosing quite a bit on this SMP machine....

Dithering makes heavy use of GRand and as long as the random number generator is shared between the threads... I wonder if it actually makes sense to add the overhead of per-thread data initialization or if we should rather replace the use of a random number generator with a pre-calculated dither pattern. That would improve the single-threaded case as well.

Sven

Daniel Egger
2005-02-27 14:59:19 UTC (about 19 years ago)

GIMP and multiple processors

On 27.02.2005, at 14:31, Sven Neumann wrote:

Dithering makes heavy use of GRand and as long as the random number generator is shared between the threads... I wonder if it actually makes sense to add the overhead of per-thread data initialization or if we should rather replace the use of a random number generator with a pre-calculated dither pattern. That would improve the single-threaded case as well.

Since the randomness doesn't play a big role here (like in a security environment) I wonder if it wouldn't be easiest to employ a per-thread pseudo-RNG seeded with a different "random" number. One global RNG would be nice for this...

BTW: How often is gradient_fill_single_region_rgb being called? Maybe we can simply make the RNG a local property of this and the ..._gray function and be done with it; that of course wouldn't make much sense if it ends up being called a million times per fill...

Servus,
Daniel

Sven Neumann
2005-02-27 15:19:21 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Daniel Egger writes:

Since the randomness doesn't play a big role here (like in a security environment) I wonder if it wouldn't be easiest to employ a per-thread pseudo-RNG seeded with a different "random" number. One global RNG would be nice for this...

GRand is such a pseudo-RNG.

BTW: How often is gradient_fill_single_region_rgb being called? Maybe we can simply make the RNG a local property of this and the ..._gray function and be done with it; that of course wouldn't make much sense if it ends up being called a million times per fill...

It is called once per tile. Your approach probably makes sense as long as don't use g_rand_new() but g_rand_new_with_seed(). g_rand_new() initializes the random number generator from /dev/urandom which is probably too expensive to be done once per tile.

Sven

Daniel Egger
2005-02-27 15:41:35 UTC (about 19 years ago)

GIMP and multiple processors

On 27.02.2005, at 15:19, Sven Neumann wrote:

It is called once per tile. Your approach probably makes sense as long as don't use g_rand_new() but g_rand_new_with_seed(). g_rand_new() initializes the random number generator from /dev/urandom which is probably too expensive to be done once per tile.

Incidentally this is exactly what I'm testing right now. ;=)

Servus, Daniel

Daniel Egger
2005-02-27 15:53:57 UTC (about 19 years ago)

GIMP and multiple processors

On 27.02.2005, at 15:19, Sven Neumann wrote:

It is called once per tile. Your approach probably makes sense as long as don't use g_rand_new() but g_rand_new_with_seed(). g_rand_new() initializes the random number generator from /dev/urandom which is probably too expensive to be done once per tile.

Not too bad, here are the results (Patch in different mail):

Before:

Linear Gradient blend on a 3000x3000 pixel image (Dithering on) 1 processor: 7.2 s
2 processors: 8.9 s

Linear Gradient blend on a 3000x3000 pixel image (Dithering OFF) 1 processor: 2.5 s
2 processors: 1.5 s

After
Linear Gradient blend on a 3000x3000 pixel image (Dithering on) 1 processor: 7.0 s
2 processors: 3.6 s

Linear Gradient blend on a 3000x3000 pixel image (Dithering OFF) 1 processor: 2.6 s
2 processors: 1.4 s

The degradation in the 1 processor setting without dithering seems to be a small measurement fluke in my last round since I consistently get better times in all other cases. Apologies for that.

I need to think about a larger test image since its starting to get too difficult to time this microbench...

Servus, Daniel

Sven Neumann
2005-02-27 17:24:41 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Daniel Egger writes:

It is called once per tile. Your approach probably makes sense as long as don't use g_rand_new() but g_rand_new_with_seed(). g_rand_new() initializes the random number generator from /dev/urandom which is probably too expensive to be done once per tile.

Incidentally this is exactly what I'm testing right now. ;=)

Incidentally that is what the code in CVS is doing. Looks like we were working on the same code. We should perhaps start using mutexes on the source code ;)

Sven

Daniel Egger
2005-02-27 17:32:32 UTC (about 19 years ago)

GIMP and multiple processors

On 27.02.2005, at 17:24, Sven Neumann wrote:

Incidentally this is exactly what I'm testing right now. ;=)

Incidentally that is what the code in CVS is doing. Looks like we were working on the same code. We should perhaps start using mutexes on the source code ;)

Heh, you did not only do the same but you've chosen the exact same approach and function names...

At least I have the performance figures you don't ;-p

Servus, Daniel

Jay Cox
2005-02-27 23:28:47 UTC (about 19 years ago)

GIMP and multiple processors

On Sun, 2005-02-27 at 01:04 +0100, Sven Neumann wrote:

Hi again,

Jay Cox writes:

Clearly the gradient code could use some tuning. A linear blend shouldn't take much more than 1/2 a second even with dithering.

Could we improve this by preparing a reasonably fine array of samples and picking from these samples instead of calling gimp_gradient_get_color_at() over and over again?

That is one obvious optomization.

Other obvious optomizations:

The dither code is way too complex. It looks like it should boil down to: color.{r,g,b,a} += g_rand_int()/RAND_MAX.

We shouldn't need 32 bits of random data per component. 8 bits should do, so we only need one call to g_rand_int per pixel.

For linear blends we should calculate delta values for the factor variable. This will allow us to calculate the factor for each pixel with only one add (plus edge condition checks).

The reason why the dithering case gets less of a speedup is because the threads are fighting over the GRand state. Each thread needs to have it's own GRand state.

It looks like the threads are also fighting over gradient->last_visited. My guess is that fixing this will get us much closer to the ideal 2x speed up.

I have eliminated last_visited from the gradient struct. Instead the caller of gimp_gradient_get_color_at() may now do the same optimization without any caching in the gradient itself. I very much doubt that this makes any difference though. Perhaps if you would benchmark a gradient blend with a lot of segments. But in the general case it's just a few of them, very often even only a single one.

Now that this race condition is eliminated I might look into adding hooks to the pixel-processor to allow initialisation of per-thread data, like for example a GRand.

I think that is the correct way to do it. It should be done generaly enough so that the histogram code can be moved over to use the pixel_region_process_parallel functions.

Jay Cox jaycox@gimp.org

Sven Neumann
2005-02-28 00:25:25 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Jay Cox writes:

Now that this race condition is eliminated I might look into adding hooks to the pixel-processor to allow initialisation of per-thread data, like for example a GRand.

I think that is the correct way to do it. It should be done generaly enough so that the histogram code can be moved over to use the pixel_region_process_parallel functions.

The histogram code does already use the threaded pixel-processor. You think we could simplify the code? IMO the current solution isn't all that bad. But I haven't benchmarked it so I don't really know...

I tried to introduce the per-thread initialization code today but figured that it adds quite some complexity. It could certainly be done but I don't see much need for it any longer.

Sven

Daniel Egger
2005-02-28 02:11:21 UTC (about 19 years ago)

GIMP and multiple processors

On 28.02.2005, at 00:25, Sven Neumann wrote:

The histogram code does already use the threaded pixel-processor. You think we could simplify the code? IMO the current solution isn't all that bad. But I haven't benchmarked it so I don't really know...

I tried to introduce the per-thread initialization code today but figured that it adds quite some complexity. It could certainly be done but I don't see much need for it any longer.

The first goal should be to reduce complexity. Today I looked into parallelizing some *really* slow code: supersampling. Activating it slows down the processing *much* more than just the factor of the supersampling "depth". And hey, the code is uglee at least a top candidate for the ugliest code left in The GIMP.

NB: Not that I had any idea what supersampling might be good for in the case of gradients, but what do I know...

Servus, Daniel

Jay Cox
2005-02-28 09:34:45 UTC (about 19 years ago)

GIMP and multiple processors

On Mon, 2005-02-28 at 00:25 +0100, Sven Neumann wrote:

Hi,

Jay Cox writes:

Now that this race condition is eliminated I might look into adding hooks to the pixel-processor to allow initialisation of per-thread data, like for example a GRand.

I think that is the correct way to do it. It should be done generaly enough so that the histogram code can be moved over to use the pixel_region_process_parallel functions.

The histogram code does already use the threaded pixel-processor. You think we could simplify the code? IMO the current solution isn't all that bad. But I haven't benchmarked it so I don't really know...

Of course you are correct. I guess it has been a while since I looked at that code... It reads like an clever hack, not elegant design.

I tried to introduce the per-thread initialization code today but figured that it adds quite some complexity. It could certainly be done but I don't see much need for it any longer.

I think per-thread initialization and finalization would make the histogram code much cleaner (and slightly faster and more scaleable). It should also let us more easily paralellize a wider selection of algorithms (not that I can think of any off the top of my head...).

Any thoughts on moving some of the _process_parallel stuff to libgimp? I think it would be cool if we could integrate it with the preview code.

Jay Cox jaycox@gimp.org

PS: Shouldn't we be ignoring all this stuff and working on gegl? ;)

GSR - FR
2005-02-28 16:01:15 UTC (about 19 years ago)

GIMP and multiple processors

Hi,
de@axiros.com (2005-02-28 at 0211.21 +0100):

NB: Not that I had any idea what supersampling might be good for in the case of gradients, but what do I know...

High frequency gradients or abrupt changes, compare the yellow - black transition in left side with the one in right side of http://www.infernal-iceberg.com/gimp/tmp/gradient-supersampling-crop.png

Of course, it depends in the display and the eyes/brain, some people see the problem more than others. Same issue than with dithering, it depends with the case.

GSR

Sven Neumann
2005-02-28 21:15:57 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Daniel Egger writes:

The first goal should be to reduce complexity. Today I looked into parallelizing some *really* slow code: supersampling. Activating it slows down the processing *much* more than just the factor of the supersampling "depth". And hey, the code is uglee at least a top candidate for the ugliest code left in The GIMP.

Since supersampling is in libgimpcolor, it will probably be difficult to improve without breaking backwards compatibility. But of course noone forces us to actually use the functionality in libgimpcolor...

Sven

Daniel Egger
2005-02-28 21:33:25 UTC (about 19 years ago)

GIMP and multiple processors

On 28.02.2005, at 21:15, Sven Neumann wrote:

Since supersampling is in libgimpcolor, it will probably be difficult to improve without breaking backwards compatibility. But of course noone forces us to actually use the functionality in libgimpcolor...

But, and this is one of many upsides of improving library code, all users of supersampling will benefit at the same time.

Maybe it would be best if someone could come up with a reasonable example what supersampling might be useful for in gradient blend code except for slowing down everything...

Servus, Daniel

Sven Neumann
2005-03-01 00:35:10 UTC (about 19 years ago)

GIMP and multiple processors

Hi,

Jay Cox writes:

The dither code is way too complex. It looks like it should boil down to: color.{r,g,b,a} += g_rand_int()/RAND_MAX.

We shouldn't need 32 bits of random data per component. 8 bits should do, so we only need one call to g_rand_int per pixel.

Good point. This made a lot of a difference. The dithered case is now only about 10% slower than the non-dithered one :)

Sven

Adam D. Moss
2005-03-01 01:44:44 UTC (about 19 years ago)

GIMP and multiple processors

Daniel Egger wrote:

Maybe it would be best if someone could come> example what supersampling might be useful for in gradient blend code except for slowing down everything...

For an A->B linear blend it probably doesn't noticably help much (unless, possibly, it's a long colour range in a small area). But GIMP gradients *can* be fairly arbitrary user-defined 1D designs including near-discontinuities, which benefits from supersampling in the same way that a stroke, line or polygon benefits from antialiasing.

--Adam