Skip to content

Return errors more often#160

Closed
ConradIrwin wants to merge 1 commit into
kvark:mainfrom
zed-industries:no-panic-in-submit
Closed

Return errors more often#160
ConradIrwin wants to merge 1 commit into
kvark:mainfrom
zed-industries:no-panic-in-submit

Conversation

@ConradIrwin

Copy link
Copy Markdown
Contributor

At Zed, by far the most common panics we experience across the userbase
are from Linux users who crash during a draw call in the Blade
renderer.

We'd like to be able to detect that something has gone wrong and show a
notification via dbus to help people understand the problem and debug.

This PR is a work in progress, as before I completely change the API I wanted
to check that this is the right direction.

Even with these changes I'd stil like to try and reduce the frequency that
users see these problems, but I'm not sure how to do that.

At Zed, by far the most common panics we experience across the userbase
are from Linux users who crash during a `draw` call in the Blade
renderer.

We'd like to be able to detect that something has gone wrong and show a
notification via dbus to help people understand the problem and debug.
@kvark

kvark commented Aug 23, 2024

Copy link
Copy Markdown
Owner

I'd strongly lean on the way of ergonomics. Blade isn't fully safe by design (that's what wgpu is for). So it can't comprehensively return errors from submit() anyway. Instead, the user should be responsible for not doing silly things that crash GPU or cause UB.
What kind of errors are you trying to surface up to the user? I believe any error in submit() should be treated like "internal" and addressed as a bug (preferably, quickly).

@ConradIrwin

Copy link
Copy Markdown
Contributor Author

@kvark In the last 2 hours:

  • 9 of:
GPU has crashed, and no debug information is available.
core::panicking::panic_fmt
<blade_graphics::hal::Context as blade_graphics::traits::CommandDevice>::submit
gpui::platform::blade::blade_renderer::BladeRenderer::draw
<gpui::platform::linux::x11::window::X11Window as gpui::platform::PlatformWindow>::draw
  • 7 of
Aquire image error ERROR_SURFACE_LOST_KHR
core::panicking::panic_fmt
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::acquire_frame
gpui::platform::blade::blade_renderer::BladeRenderer::draw
<gpui::platform::linux::x11::window::X11Window as 
  • 3 of:
called `Result::unwrap()` on an `Err` value: ERROR_INITIALIZATION_FAILED
core::panicking::panic_fmt
core::result::unwrap_failed
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::init_impl
<gpui::platform::linux::x11::client::X11Client as gpui::platform::linux::platform::LinuxClient>::open_window
  • 1 of:
called `Result::unwrap()` on an `Err` value: ERROR_OUT_OF_HOST_MEMORY
core::panicking::panic_fmt
core::result::unwrap_failed
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::resize

Unfortunately the way our panic reporting works, it's tricky for me to get more information about these. I also know that some of these are caused by the initialization failures on intel we havent' fixed yet: #144.

@kvark

kvark commented Aug 25, 2024

Copy link
Copy Markdown
Owner

Thanks for the info! That's quite high of a number. I wish we had one of the users to work with...
Would you have a way to collect hardware infos from those users? Something like vulkaninfo plus the log coming from Blade (e.g. at "Info" level) would help a lot.

@ConradIrwin

Copy link
Copy Markdown
Contributor Author

In theory we could run vulkaninfo during the panic handler (or maybe during the upload process), but it makes me a bit squeamish to upload everything.

Are there other specific things you think we should check for that would make the output size smaller / could be queried in-process? or is it more of an unknown unknown at this point?

@kvark

kvark commented Aug 29, 2024

Copy link
Copy Markdown
Owner

Are you concerned that VulkanInfo contains too much private information about the machine? I imagine this would be behind some kind of uesr concern.
At the very minimum, we'd want to know the instance version/extensions, and the list of physical devices and the basic info about them: vendor, driver version, supported extensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants