Conversation
… instead of converting to [char]
… of a char; chore: use slices of input in string and number consuming functions
…comma() function; chore(parser): remove clone() from advance()
jhodapp
left a comment
There was a problem hiding this comment.
Looking really, really good Victor! The documentation in particular is thorough, helpful, and professional.
You can definitely be proud of this complete Rust JSON parser with a really slick Python CLI wrapper on top.
I'm going to intuit that you are feeling pretty well prepared now to mix Rust with Python for future endeavors, and possibly even spearhead some Rust-only projects. You seem ready from what I can tell, and I look forward to hearing your thoughts on this.
Look at this performance on my MacBook Air M1:
┌─────────────┬──────────┬─────────────┬─────────────────┬─────────────────┬────────────┐
│ File │ Size │ Rust (pure) │ Rust + bindings │ Python json (C) │ simplejson │
├─────────────┼──────────┼─────────────┼─────────────────┼─────────────────┼────────────┤
│ small.json │ 85 B │ 1.2 us │ 1.6 us │ 1.2 us │ 9.2 us │
├─────────────┼──────────┼─────────────┼─────────────────┼─────────────────┼────────────┤
│ medium.json │ 16.6 KB │ 126 us │ 161 us │ 90 us │ 1.1 ms │
├─────────────┼──────────┼─────────────┼─────────────────┼─────────────────┼────────────┤
│ large.json │ 178.4 KB │ 735 us │ 937 us │ 782 us │ 7.8 ms │
└─────────────┴──────────┴─────────────┴─────────────────┴─────────────────┴────────────┘
What's Working Well
-
Extensive documentation beyond the minimum: You documented every public item, not just the required ones, with working examples and Python-style docstrings on all PyO3 functions.
-
Sophisticated benchmarking design: You independently added a 4th "pure Rust" benchmark column to isolate binding overhead, and built in warmup rounds, auto-scaling, and
with_capacityfor the timing vectors. -
Thoughtful performance optimizations in the tokenizer: The refactor to
&'input strwith byte-level scanning, the fast-path/slow-path split inconsume_string(), and slice-based consumption inconsume_number()/consume_keyword()are well-executed. -
Polished CLI experience: Using
argparsewith--rounds,--warmup, directory-based benchmarking, auto-discovery, and human-readable file sizes goes well beyond the curriculum'ssys.argvsuggestion.
Areas for Improvement
- Clippy warning: redundant field name in Tokenizer
- Use of
unwrap()in themedian()function
| let result = parse_file(path)?; | ||
| result.into_pyobject(py) |
There was a problem hiding this comment.
You could improve this even further into a one-liner that I think doesn't make it any less understandable: parse_file(path).into_pyobject(py)
| ) -> PyResult<Bound<'py, PyDict>> { | ||
| let n = rounds as usize; | ||
|
|
||
| // --- Rust (with no bindings) --- |
There was a problem hiding this comment.
Not major, but this method benchmark_performance() could read even better if you split the 4-5 sections of it up into separate smaller methods. You wouldn't even need the comments then if you name your methods effectively enough.
| Some(b) | ||
| } | ||
|
|
||
| fn _input_slice_to_string(&self, start: usize, end: usize) -> String { |
There was a problem hiding this comment.
You don't need to prefix this method name with a '_', that's reserved in Rust for meaning an unused method or variable symbol.
| '"' => { | ||
| self.advance(); // consume closing quote | ||
| return Ok(consumed_string); | ||
| fn consume_string_slow(&mut self, s: &mut String) -> JsonResult<String> { |
There was a problem hiding this comment.
Is this named consume_string_slow because it's a pretty unoptimized method?
| times.select_nth_unstable_by(mid, |a, b| a.partial_cmp(b).unwrap()); | ||
| if times.len() % 2 == 1 { | ||
| times[mid] | ||
| } else { | ||
| let left = *times[..mid] | ||
| .iter() | ||
| .max_by(|a, b| a.partial_cmp(b).unwrap()) | ||
| .unwrap(); |
There was a problem hiding this comment.
Can you figure out a way of getting rid of these unwrap() uses? Hint for you: are there error cases to handle or surface, or would a default value suffice, or mapping to something else?

Changes in this MR add doc comments, a benchmark function (added to the Python CLI) and a few optimisations:
inputmember ofTokenizerhereinputinstead of instances ofStringwhen consuming numbers and strings intokenize()format!macro into theerr_on_missing_expected_comma()function here so that the String is not eagerly allocated throughDebugThe parser (considering the Python bindings) is still slower than the Python implementations, so I added the pure Rust time to the benchmark output just for fun and reference of how much overhead the Python bindings bring.