Edit: Aha! I knew there was an anti-pattern in what I was doing.
The simple solution is to modify the lexer rules, adding a new rule with a symbol ID of #ERROR (-1) and making it match EVERYTHING. Since it will be added to the state machine last every other match overrides it. This allows the state machine building code to do the heavy lifting of crafting the logic to gather error tokens.
I've actually done this before. I can't believe it didn't occur to me. I can't believe my code didn't already do it. It's one of those simple rules of lexing that I seemed to have learned at one point and then forgotten. Never again. This took me hours.
Edit 2: Aaand that didn't work because it needs to lazy match for that to be effective.
You can do lazy matching with DFAs using a little known technique. Constructing Fast Lexical Analyzers with RE/flex - Why Another Scanner Generator?[^] does it, and the author promised to release a paper on how that worked but I haven't seen anything further on it and the source code is ... creative - everything is done in constructors, just for starters so I haven't been able to make heads or tails of it.
Edit 3: I think I finally solved it without hacking anything too badly.
Edit 4: Solved it and have one of the template generators for the new code (targeting C# so far) implemented:
This is what victory looks like.
Tokenizing: ... /* ...a*/ baz ... 12343 foo 123.22 bar....
AbsolutePosition: 0, AbsoluteLength: 3, Position: 0, Length: 3, SymbolId: -1, Value: ..., Line: 1, Column: 1
AbsolutePosition: 5, AbsoluteLength: 9, Position: 5, Length: 9, SymbolId: 40, Value: /* ...a*/, Line: 1, Column: 9
AbsolutePosition: 15, AbsoluteLength: 3, Position: 15, Length: 3, SymbolId: 6, Value: baz, Line: 1, Column: 19
AbsolutePosition: 20, AbsoluteLength: 3, Position: 20, Length: 3, SymbolId: -1, Value: ..., Line: 1, Column: 24
AbsolutePosition: 24, AbsoluteLength: 5, Position: 24, Length: 5, SymbolId: 3, Value: 12343, Line: 1, Column: 28
AbsolutePosition: 30, AbsoluteLength: 3, Position: 30, Length: 3, SymbolId: 6, Value: foo, Line: 1, Column: 34
AbsolutePosition: 34, AbsoluteLength: 6, Position: 34, Length: 6, SymbolId: 4, Value: 123.22, Line: 1, Column: 41
AbsolutePosition: 41, AbsoluteLength: 3, Position: 41, Length: 3, SymbolId: 6, Value: bar, Line: 1, Column: 48
AbsolutePosition: 44, AbsoluteLength: 4, Position: 44, Length: 4, SymbolId: -1, Value: ...., Line: 1, Column: 51
AbsolutePosition: 0, AbsoluteLength: 3, Position: 0, Length: 3, SymbolId: -1, Value: ..., Line: 1, Column: 1
AbsolutePosition: 5, AbsoluteLength: 9, Position: 5, Length: 9, SymbolId: 40, Value: /* ...a*/, Line: 1, Column: 9
AbsolutePosition: 15, AbsoluteLength: 3, Position: 15, Length: 3, SymbolId: 6, Value: baz, Line: 1, Column: 19
AbsolutePosition: 20, AbsoluteLength: 3, Position: 20, Length: 3, SymbolId: -1, Value: ..., Line: 1, Column: 24
AbsolutePosition: 24, AbsoluteLength: 5, Position: 24, Length: 5, SymbolId: 3, Value: 12343, Line: 1, Column: 28
AbsolutePosition: 30, AbsoluteLength: 3, Position: 30, Length: 3, SymbolId: 6, Value: foo, Line: 1, Column: 34
AbsolutePosition: 34, AbsoluteLength: 6, Position: 34, Length: 6, SymbolId: 4, Value: 123.22, Line: 1, Column: 41
AbsolutePosition: 41, AbsoluteLength: 3, Position: 41, Length: 3, SymbolId: 6, Value: bar, Line: 1, Column: 48
AbsolutePosition: 44, AbsoluteLength: 4, Position: 44, Length: 4, SymbolId: -1, Value: ...., Line: 1, Column: 51
___ snip ___
Trying to match tokens such that runs of error characters get reported as one error, rather than one error for each rejected character.
This is shockingly difficult. I've given up on it in the past, like with my Rolex project, but Reggie is to replace Rolex, among other things, and I'm not willing to ship the latest without that sorted out.
The reason it's such a big deal is multiple errors for one error result can mess with "panic mode" error recovery in parsers that are built on top of a lexer like this because it will get confused as to how many bad tokens there actually are in the text, which makes an already bad situation worse when parsing a document with errors in it.
The thing is it seems so bloody simple, but every similarly simple approach I've taken with it has fallen flat on its face.
This is getting in the way of me releasing code.
Real programmers use butterflies
modified 31-Oct-21 18:08pm.
|