[Lexer] Add Unicode identifier and whitespace recognition#23
[Lexer] Add Unicode identifier and whitespace recognition#23
Conversation
|
@labath @cmtice |
|
Did some benchmarking. Using |
|
I'm going to be very unsympathetic to any arguments about performance until I see some data that shows that the lexer takes up an appreciable portion of the time it takes to evaluate a DIL expression. Since most of the DIL expressions are going to be less than ~20 characters long, I find it very hard to imagine an implementation that would be too slow. That said, the reason I suggested this function is because I thought you wouldn't be doing any unicode conversions in the lexer (When I said I wanted to treat all unicode chars as identifiers, I really meant all of them, Ogham Space Marks (U+1680) included). If you're already doing unicode conversions (*), then counting those is good enough for me, as the main thing I'm optimising for here is the complexity of the implementation (*) There's a lot less space characters than there are potential identifier chars, and I think they're a lot less ambiguous, so if you really think they are needed (I don't), I think I'd be fine with that. That said, given that there's so few of those, and in the aforementioned interest of reducing the amount of code written. I think it would be easier to skip those via something like: (i.e., let the compiler convert these into byte sequences and then use StringRef operations for the rest). |
|
Removing the non-standard whitespaces makes sense, I guess even the Unicode identifiers are usually separated by a regular whitespace anyway. |
Added skipping both Unicode and ASCII whitespaces in the beginning.
Replaced
IsWordcode with Unicode identifier recognition, the output then gets checked for being a keyword like before. Codepoints for Unicode whitespaces and identifiers are taken from Swift lexer.The length of an identifier gets counted in Unicode characters and added to the position tracker.