- Unicode
- Lexical Translations
- Unicode Escapes
- Line Terminators
- Input Elements and Tokens
- White Space
- Comments
- Identifiers
- Keywords
- Literals
- Separators
- Operators
3.5 Input Elements and Tokens
The input characters and line terminators that result from escape processing (§3.3) and then input line recognition (§3.4) are reduced to a sequence of input elements. Those input elements that are not white space (§3.6) or comments (§3.7) are tokens. The tokens are the terminal symbols of the syntactic grammar (§2.3).
This process is specified by the following productions:
Input: InputElementsopt Subopt InputElements: InputElement InputElements InputElement InputElement: WhiteSpace Comment Token Token: Identifier Keyword Literal Separator Operator Sub: the ASCII SUB character, also known as "control-Z"
White space (§3.6) and comments (§3.7) can serve to separate tokens that, if adjacent, might be tokenized in another manner. For example, the ASCII characters - and = in the input can form the operator token -= (§3.12) only if there is no intervening white space or comment.
As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream.
Consider two tokens x and y in the resulting input stream. If x precedes y , then we say that x is to the left of y and that y is to the right of x .
For example, in this simple piece of code:
class Empty { }
we say that the } token is to the right of the { token, even though it appears, in this two-dimensional representation on paper, downward and to the left of the { token. This convention about the use of the words left and right allows us to speak, for example, of the right-hand operand of a binary operator or of the left-hand side of an assignment.