-
Notifications
You must be signed in to change notification settings - Fork 4
Block to cpu move to _execute_section_block #675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@yousefmoazzam I think it is ready but I couldn't get my head around one last failing test: |
httomo/runner/task_runner.py
Outdated
| self, method: MethodWrapper, block: DataSetBlock, convert_gpu_block_to_cpu: bool | ||
| ) -> DataSetBlock: | ||
| start = time.perf_counter_ns() | ||
| block = method.execute(block) | ||
| end = time.perf_counter_ns() | ||
|
|
||
| if block.is_last_in_chunk: | ||
| self.append_side_outputs(method.get_side_output()) | ||
|
|
||
| if convert_gpu_block_to_cpu: | ||
| with catchtime() as t: | ||
| block.to_cpu() | ||
| method.gpu_time.device2host += t.elapsed | ||
|
|
||
| end = time.perf_counter_ns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, having a parameter about block transferring in a method called _execute_method() feels a bit odd. I'd naively expect a method called _execute_method() to do stuff about method execution only, and not do data transfer things but rather have the transfer happen somewhere else.
The idea behind this change I believe is to transfer a GPU block at the end of a section to CPU, before the block is given to the dataset store, is that correct? If so, it seems like the transfer might make more sense to go into a method handling section execution, the _execute_section_block() method maybe.
If the data transfer is done there instead of here, then there's no need to pass this convert_gpu_block_to_cpu parameter, and the tests actually don't need any changes to keep passing I think.
This approach also has the advantage that the data transfer from GPU to CPU after the method completes doesn't accidentally get put into the clocking for the method execution (the clocking of the method execution is in _execute_method(), whereas the data transfer from GPU to CPU would happen in _execute_section_block()), which is probably preferable. Currently, it looks like the data transfer form GPU to CPU is included in the method execution time, which we may not want.
Therefore, I'd tentatively suggest moving the following block of code:
if convert_gpu_block_to_cpu:
with catchtime() as t:
block.to_cpu()
method.gpu_time.device2host += t.elapsedto the _execute_section_block() method after line 254, and see how that feels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the insight @yousefmoazzam . I originally implemented exactly as you suggest, so that block to cpu transfer happens in _execute_section_block(). The problem was, however, is to make report_method_block to work correctly. Outside of _execute_method I had no access to the clocked time that is related to block transfers. So it was then I decided to pass this flag to _execute_method and make it aware of the last block processing.
So with that: method.gpu_time.device2host += t.elapsed I'm adding the expected D2H time while being withing the monitor reach. Even though outside of the transfer function within the method.execute, it is still recorded. Also note that it is wrapped within start-end time for it to be a part of the whole business about I/O and method kernel execution for the block. That is how we normally record the transfers, and the numbers do seem to add up while I'm interpreting the CSV file.
I guess with your proposition the whole monitor business should be moved outside of _execute_method to _execute_section_block ? Would it something we should try then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, yeah I understand what you mean: the call to report_method_block() requires the GPU to CPU transfer time, and that call happens inside _execute_method(), so if the GPU to CPU transfer happens after _execute_method(), then that info can't be passed to report_method_block().
Yeah, you're right that my proposition would probably require moving report_method_block(). Maybe it's worth doing, but perhaps not right now - it might be better to do that reorganising of where things are clocked when we have a fuller picture of things.
Let me know what you'd like to do, but feel free to ignore my initial suggestion and defer it till later, I could understand not wanting to tackle this properly right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the code, is it something like this we'd prefer? Haven't fixed the tests yet. I guess it will be unfixing the fixed ones )
| start = time.perf_counter_ns() | ||
| block = self._execute_method(method, block) | ||
|
|
||
| if convert_gpu_block_to_cpu: | ||
| with catchtime() as t: | ||
| block.to_cpu() | ||
| method.gpu_time.device2host += t.elapsed | ||
|
|
||
| end = time.perf_counter_ns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this latest change in general looks like a reasonable way of moving the call to report_method_block().
If we'd like method execution time to not include the transfer time (I'm not sure if we wanted to make this change at all, or in the future but not yet), one small detail could be to move the end of the method clocking to be before the transfer from GPU to CPU.
This moves
block.to_cpu()from writing doing the transfer in dataset_store, which doesn't seem the right place. I've added some logic around how to recognise that the block needs to be moved. Also to be able to clock with the monitor the device to host transfer time.The resulting csv file reflects the intended behavior. Few tests to fix.
Checklist