Skip to content

Conversation

@dkazanc
Copy link
Collaborator

@dkazanc dkazanc commented Jan 19, 2026

This moves block.to_cpu() from writing doing the transfer in dataset_store, which doesn't seem the right place. I've added some logic around how to recognise that the block needs to be moved. Also to be able to clock with the monitor the device to host transfer time.
The resulting csv file reflects the intended behavior. Few tests to fix.

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have made corresponding changes to the documentation

@dkazanc dkazanc marked this pull request as ready for review January 20, 2026 10:25
@dkazanc
Copy link
Collaborator Author

dkazanc commented Jan 20, 2026

@yousefmoazzam I think it is ready but I couldn't get my head around one last failing test: test_execute_section_for_block. Can you have a look please? cheers

@dkazanc dkazanc requested a review from yousefmoazzam January 20, 2026 10:27
Comment on lines 286 to 299
self, method: MethodWrapper, block: DataSetBlock, convert_gpu_block_to_cpu: bool
) -> DataSetBlock:
start = time.perf_counter_ns()
block = method.execute(block)
end = time.perf_counter_ns()

if block.is_last_in_chunk:
self.append_side_outputs(method.get_side_output())

if convert_gpu_block_to_cpu:
with catchtime() as t:
block.to_cpu()
method.gpu_time.device2host += t.elapsed

end = time.perf_counter_ns()
Copy link
Collaborator

@yousefmoazzam yousefmoazzam Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, having a parameter about block transferring in a method called _execute_method() feels a bit odd. I'd naively expect a method called _execute_method() to do stuff about method execution only, and not do data transfer things but rather have the transfer happen somewhere else.

The idea behind this change I believe is to transfer a GPU block at the end of a section to CPU, before the block is given to the dataset store, is that correct? If so, it seems like the transfer might make more sense to go into a method handling section execution, the _execute_section_block() method maybe.

If the data transfer is done there instead of here, then there's no need to pass this convert_gpu_block_to_cpu parameter, and the tests actually don't need any changes to keep passing I think.

This approach also has the advantage that the data transfer from GPU to CPU after the method completes doesn't accidentally get put into the clocking for the method execution (the clocking of the method execution is in _execute_method(), whereas the data transfer from GPU to CPU would happen in _execute_section_block()), which is probably preferable. Currently, it looks like the data transfer form GPU to CPU is included in the method execution time, which we may not want.

Therefore, I'd tentatively suggest moving the following block of code:

        if convert_gpu_block_to_cpu:
            with catchtime() as t:
                block.to_cpu()
            method.gpu_time.device2host += t.elapsed

to the _execute_section_block() method after line 254, and see how that feels.

Copy link
Collaborator Author

@dkazanc dkazanc Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the insight @yousefmoazzam . I originally implemented exactly as you suggest, so that block to cpu transfer happens in _execute_section_block(). The problem was, however, is to make report_method_block to work correctly. Outside of _execute_method I had no access to the clocked time that is related to block transfers. So it was then I decided to pass this flag to _execute_method and make it aware of the last block processing.

So with that: method.gpu_time.device2host += t.elapsed I'm adding the expected D2H time while being withing the monitor reach. Even though outside of the transfer function within the method.execute, it is still recorded. Also note that it is wrapped within start-end time for it to be a part of the whole business about I/O and method kernel execution for the block. That is how we normally record the transfers, and the numbers do seem to add up while I'm interpreting the CSV file.

I guess with your proposition the whole monitor business should be moved outside of _execute_method to _execute_section_block ? Would it something we should try then?

Copy link
Collaborator

@yousefmoazzam yousefmoazzam Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, yeah I understand what you mean: the call to report_method_block() requires the GPU to CPU transfer time, and that call happens inside _execute_method(), so if the GPU to CPU transfer happens after _execute_method(), then that info can't be passed to report_method_block().

Yeah, you're right that my proposition would probably require moving report_method_block(). Maybe it's worth doing, but perhaps not right now - it might be better to do that reorganising of where things are clocked when we have a fuller picture of things.

Let me know what you'd like to do, but feel free to ignore my initial suggestion and defer it till later, I could understand not wanting to tackle this properly right now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the code, is it something like this we'd prefer? Haven't fixed the tests yet. I guess it will be unfixing the fixed ones )

Comment on lines +255 to +263
start = time.perf_counter_ns()
block = self._execute_method(method, block)

if convert_gpu_block_to_cpu:
with catchtime() as t:
block.to_cpu()
method.gpu_time.device2host += t.elapsed

end = time.perf_counter_ns()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this latest change in general looks like a reasonable way of moving the call to report_method_block().

If we'd like method execution time to not include the transfer time (I'm not sure if we wanted to make this change at all, or in the future but not yet), one small detail could be to move the end of the method clocking to be before the transfer from GPU to CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants