Skip to content

[WIP] Zero Copy String Deserialization#88

Open
ParkMyCar wants to merge 2 commits into
mainfrom
string-zero-copy
Open

[WIP] Zero Copy String Deserialization#88
ParkMyCar wants to merge 2 commits into
mainfrom
string-zero-copy

Conversation

@ParkMyCar

Copy link
Copy Markdown
Contributor

I thought this was a pretty interesting task, adding zero copy de-serialization for Strings! Still a work in progress, but basically I created a type StrBytes which is a wrapper around a Bytes struct, and on creation we assert it's valid UTF-8, which it should be because based on the protobuf spec, strings are encoded in valid UTF-8.

Benchmarks (2014 MacBook Pro with an i7)

test bench::benches::bench_deserialize_string                               ... bench:     106,025 ns/iter (+/- 12,029)
test bench::benches::bench_deserialize_vec_bytes                            ... bench:     259,522 ns/iter (+/- 20,860)
test bench::benches::bench_deserialize_zero_copy_bytes                      ... bench:          98 ns/iter (+/- 14)
test bench::benches::bench_deserialize_zero_copy_string                     ... bench:      42,821 ns/iter (+/- 15,625)
test bench::prost::bench_deserialize_prost_bytes                            ... bench:     266,399 ns/iter (+/- 66,497)
test bench::prost::bench_deserialize_prost_string                           ... bench:     107,524 ns/iter (+/- 8,994)
test bench::rust_protobuf::bench_deserialize_rust_protobuf_zero_copy_bytes  ... bench:          49 ns/iter (+/- 10)
test bench::rust_protobuf::bench_deserialize_rust_protobuf_zero_copy_string ... bench:      44,083 ns/iter (+/- 12,499)

Note: The reason zero copy strings are not as fast as zero copy bytes is because we do the extra validation step

@ParkMyCar

Copy link
Copy Markdown
Contributor Author

Because Strings encoded in a proto message should be UTF8, I added a feature flag zero_copy_string_no_utf8_check, to skip utf8 validation, when using zero-copy strings. Using this flag, we get performance similar to zero copy bytes

test bench::benches::bench_deserialize_zero_copy_bytes                      ... bench:          98 ns/iter (+/- 4)
test bench::benches::bench_deserialize_zero_copy_string                     ... bench:         101 ns/iter (+/- 2)

@nipunn1313 nipunn1313 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super good!
Saw that tests weren't passing - so poke at that a bit more.

Comment thread examples/Cargo.toml
[dependencies]
bytes = "0.5.6"
pb-jelly = "0.0.5"
pb-jelly = { path = "../pb-jelly" }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#multiple-locations - you can actually specify both path and version

Actually, in this case, since examples aren't being posted to crates.io, we probably should keep it as only a path dependency.

versions = {
u"lazy_static": u' version = "1.4.0" ',
u"pb-jelly": u' version = "0.0.5" ',
u"pb-jelly": u' path = "../../../../../pb-jelly" ',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, we gotta find a better way to do the right thing here =\

Bazel and Spec.toml totally sidestep this issue - it's actually the entire point of Spec.toml - to template the location/version of the dependency.

This seems like a place where we'd accidentally have the wrong version provided - and it seems like our test suite will not actually be using the right pb-jelly dependency here, unless you make this change.

Maybe one idea - is that the version of pb-jelly-gen could get passed in as an argument to codegen.py - where our testsuite could inject ../../../../../pb-jelly. Then we're not hard coding a version number here.

Perhaps an idea for a separate issue?

Comment thread pb-jelly/Cargo.toml
# of bytes is UTF-8 on deserialization shouldn't be needed. As a precaution though, we do this
# validation step. But validation can take a relatively large amount of time, so for users
# who wish to skip this validation, we provide the `zero_copy_string_no_utf8_check` feature flag.
zero_copy_string_no_utf8_check = []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for cargo bench to test with and w/o this feature flag?
Seems like it might not be possible w/ just how cargo bench works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants