Basic benchmark script for a very simple kernel loop#1
Basic benchmark script for a very simple kernel loop#1erikvansebille wants to merge 3 commits intomainfrom
Conversation
|
OK, so a first morning of benchmarking on this very simple script resulted in the graph below Parcels v3 is extremely fast, and the current I first tried changing back to using Some further digging led to the realisation that the main bottleneck is actually the updating of the value of a particle attribute, what in def __setattr__(self, name, value):
if name in ["_data", "_index"]:
object.__setattr__(self, name, value)
else:
self._data[name][self._index] = valueThat last line is very slow, and simply changing it to (Parcels-code/Parcels#2092) + self._data[name][self._index] = value
- self._data[name].data[self._index] = valueimproved the performance by 85% compared to Note that the time-as-float change does not further seem to improve performance on top of the Now, to further explore what kind of performance boost we can expect, I also wrote a PR that does not set the value of particles at all (Parcels-code/Parcels#2093). Here, the setattr function is simply def __setattr__(self, name, value):
if name in ["_data", "_index"]:
object.__setattr__(self, name, value)
else:
passThis leads to a speed increase of 95% compared to So in summary: a major bottleneck seems to be the setattr in the new xarray particle class. In v3 we use a dictionary of arrays, so I'll spend some time now to see what the performance gain is if we change the v4 code from xarray to a dictionary |
|
Another update: I changed the data structure for
In summary, we are now down to
|
|
I added a few smaller fixes in Parcels-code/Parcels#2096, which increases the kernel loop performance a bit more. For reference, in combination with Parcels-code/Parcels#2094, the time for |
This comment was marked as outdated.
This comment was marked as outdated.
Is this with the latest versions of the PRs where Parcels-code/Parcels#2096 has been merged in? |
|
No, I'll re-run |
|
Another update on the timings, now with the changes to the kernel loop optimisation (Parcels-code/Parcels#2096) included (extended to
So we are now slightly faster than v3-Scipy for Parcels-code/Parcels#2094! And the fact that Parcels-code/Parcels#2093 takes effectively 0 seconds means that we have no serious overhead except the ParticleSet.setattr |
|
Ran some (updated) profiling for the branches: time as float (Parcels-code/Parcels#2090) (NOT updated with Parcels-code/Parcels#2096 ) xarray.data setattr (Parcels-code/Parcels#2092) no setattr (Parcels-code/Parcels#2093) |
|
OK, and do you also have the profile of Parcels-code/Parcels#2094? |
|
|
Some of the graphs got really messy, so there was some tuning done on the % threshold to show (in some I trimmed off anything with <1%). |
|
Thanks for these profiling graphs! Would you agree that these graphs clearly show that calling a |
|
Yes - I don't think that there is really a way to get around this in xarray-world (for completeness over the coming week or 2 I might put together a post to send to pangeo asking about this usecase along with a minimal example). |
|
1 second actually, just looking at the This means that on every iteration that a new data array is constructed (+ a bunch of other xarray machinery is invoked). Isn't there a way that we can get the best of both worlds?
or is that needlessly complex? (all good if its the latter) |
|
Isn't there a way that we can get the best of both worlds?
It's a really cool and original idea! I just implemented it in Parcels-code/Parcels#2097; is that what you had in mind? Long live copy-by-reference ;-) |




A first, very simple script, that can be used for benchmarking the Kernel-loop itself. Since the Kernel is doing nothing, there is no field access or call to interpolation methods.
With the current commit of v4-dev (488e3fb), the scaling is pretty poor.
The timing on my machine for the
pset.execute()fornpart=1, 10, 100, 1000 and 2500particles is 0:01, 0:10, 1:36, 15:48 and 37:57 minutes, respectively. That's a nicely linear scaling, but of course not at all efficientFor
main(i.e. Parcels v3), thepset.execute()for all values ofnpartare taking 0 seconds, irrespective of number of particlesI'll start digging in to get this poor scaling of the Kernel/pset.execute() loop with particle size and hopefully find an improvement soon