Skip to content

feat: Docs, benchmarking and optimisation#6

Draft
daconjurer wants to merge 7 commits intomainfrom
chore/document-and-benchmark
Draft

feat: Docs, benchmarking and optimisation#6
daconjurer wants to merge 7 commits intomainfrom
chore/document-and-benchmark

Conversation

@daconjurer
Copy link
Owner

@daconjurer daconjurer commented Mar 1, 2026

Changes in this MR add doc comments, a benchmark function (added to the Python CLI) and a few optimisations:

  • changing the type of the input member of Tokenizer here
  • using slices of that input instead of instances of String when consuming numbers and strings in tokenize()
  • move the use of the format! macro into the err_on_missing_expected_comma() function here so that the String is not eagerly allocated through Debug

The parser (considering the Python bindings) is still slower than the Python implementations, so I added the pure Rust time to the benchmark output just for fun and reference of how much overhead the Python bindings bring.

@daconjurer daconjurer self-assigned this Mar 1, 2026
@daconjurer daconjurer requested a review from jhodapp March 2, 2026 22:35
@daconjurer
Copy link
Owner Author

The output of the benchmark flag looks a bit like this:

image

Copy link
Collaborator

@jhodapp jhodapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really, really good Victor! The documentation in particular is thorough, helpful, and professional.

You can definitely be proud of this complete Rust JSON parser with a really slick Python CLI wrapper on top.

I'm going to intuit that you are feeling pretty well prepared now to mix Rust with Python for future endeavors, and possibly even spearhead some Rust-only projects. You seem ready from what I can tell, and I look forward to hearing your thoughts on this.

Look at this performance on my MacBook Air M1:

  ┌─────────────┬──────────┬─────────────┬─────────────────┬─────────────────┬────────────┐
  │    File     │   Size   │ Rust (pure) │ Rust + bindings │ Python json (C) │ simplejson │
  ├─────────────┼──────────┼─────────────┼─────────────────┼─────────────────┼────────────┤
  │ small.json  │ 85 B     │ 1.2 us      │ 1.6 us          │ 1.2 us          │ 9.2 us     │
  ├─────────────┼──────────┼─────────────┼─────────────────┼─────────────────┼────────────┤
  │ medium.json │ 16.6 KB  │ 126 us      │ 161 us          │ 90 us           │ 1.1 ms     │
  ├─────────────┼──────────┼─────────────┼─────────────────┼─────────────────┼────────────┤
  │ large.json  │ 178.4 KB │ 735 us      │ 937 us          │ 782 us          │ 7.8 ms     │
  └─────────────┴──────────┴─────────────┴─────────────────┴─────────────────┴────────────┘

What's Working Well

  1. Extensive documentation beyond the minimum: You documented every public item, not just the required ones, with working examples and Python-style docstrings on all PyO3 functions.

  2. Sophisticated benchmarking design: You independently added a 4th "pure Rust" benchmark column to isolate binding overhead, and built in warmup rounds, auto-scaling, and with_capacity for the timing vectors.

  3. Thoughtful performance optimizations in the tokenizer: The refactor to &'input str with byte-level scanning, the fast-path/slow-path split in consume_string(), and slice-based consumption in consume_number()/consume_keyword() are well-executed.

  4. Polished CLI experience: Using argparse with --rounds, --warmup, directory-based benchmarking, auto-discovery, and human-readable file sizes goes well beyond the curriculum's sys.argv suggestion.

Areas for Improvement

  1. Clippy warning: redundant field name in Tokenizer
  2. Use of unwrap() in the median() function

Comment on lines +160 to +161
let result = parse_file(path)?;
result.into_pyobject(py)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could improve this even further into a one-liner that I think doesn't make it any less understandable: parse_file(path).into_pyobject(py)

) -> PyResult<Bound<'py, PyDict>> {
let n = rounds as usize;

// --- Rust (with no bindings) ---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not major, but this method benchmark_performance() could read even better if you split the 4-5 sections of it up into separate smaller methods. You wouldn't even need the comments then if you name your methods effectively enough.

Some(b)
}

fn _input_slice_to_string(&self, start: usize, end: usize) -> String {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to prefix this method name with a '_', that's reserved in Rust for meaning an unused method or variable symbol.

'"' => {
self.advance(); // consume closing quote
return Ok(consumed_string);
fn consume_string_slow(&mut self, s: &mut String) -> JsonResult<String> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this named consume_string_slow because it's a pretty unoptimized method?

Comment on lines +201 to +208
times.select_nth_unstable_by(mid, |a, b| a.partial_cmp(b).unwrap());
if times.len() % 2 == 1 {
times[mid]
} else {
let left = *times[..mid]
.iter()
.max_by(|a, b| a.partial_cmp(b).unwrap())
.unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you figure out a way of getting rid of these unwrap() uses? Hint for you: are there error cases to handle or surface, or would a default value suffice, or mapping to something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants