What language do you want to scan? PHP?
If you want to write your own analyzer, you need to identify the following lexical tokens of you language to scan:
- Comments (//..., #..., /*...*/)
- Strings ("...", '...', handle escaping within the string literals)
- Numbers (0, 1, ..., 3.141592653589793238462643, ...)
- Words (including keywords)
- Operators and punctuation (=>, <<, >>, ++, --, ... +, -, ... $, ... {, }, ...)
- Spaces (space, tab, nl, cr, ...)
Write the regex for each of these tokens and concatenate them into one regex with each sub-regex as alternative ((Comment)|(String)|(Number)|(Word)|(Op)|(Space)|(Error)).
Scan the text with the given regex until no match is found anymore by detecting which of the sub-regex group is matched.
Cheers
Andi