3.2 Lexical Translations
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
- A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \u xxxx , where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx . This translation step allows any program to be expressed using only ASCII characters.
- A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).
- A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).
The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. Thus the input characters a--b are tokenized (§3.5) as a, --, b, which is not part of any grammatically correct program, even though the tokenization a, -, -, b could be part of a grammatically correct program.