Lexer¶
File: src/LLLangCompiler/Lexer.fs (~185 lines).
Token definitions: src/LLLangCompiler/Token.fs.
The lexer is a single-pass character scanner that produces a
Tok list plus synthetic layout tokens. No external state between calls.
Token shape¶
type Token =
| KwFn | KwLet | KwIn | KwType | KwTag | KwUnit
| KwTrait | KwImpl | KwImport | KwExport | KwModule
| KwIf | KwThen | KwElse | KwTrue | KwFalse
| KwMatch | KwWith
| Ident of string
| TypeId of string
| IntLit of int64 | FloatLit of float | StrLit of string
| CharLit of char
| Arrow | Backslash | Dot | Comma | Colon | ColonColon
| Eq | Bar | LBrack | RBrack | LParen | RParen
| Plus | Minus | Star | Slash | Caret
| Lt | Gt | Le | Ge | EqEq | Neq
| Underscore
| Indent | Dedent | Newline
| Eof
type Tok = { Token: Token; Line: int; Col: int }
Line/Col track 1-based source position; Col is computed from
pos - lineStart + 1.
Keyword table¶
Keywords are literal strings recognized after identifier tokenization
via a small Map<string, Token>:
let private keywords =
Map.ofList [
"fn", KwFn; "let", KwLet; "in", KwIn; "type", KwType
"tag", KwTag; "unit", KwUnit; "trait", KwTrait; "impl", KwImpl
"import", KwImport; "export", KwExport; "module", KwModule
"if", KwIf; "then", KwThen; "else", KwElse
"true", KwTrue; "false", KwFalse
"match", KwMatch; "with", KwWith
]
Case-based identifier split¶
After reading a run of [A-Za-z0-9_] starting with a letter or _, the
lexer checks Char.IsUpper source[start] to distinguish uppercase-leading
identifiers (types, constructors, module segments → TypeId) from
lowercase (Ident). This keeps the parser free of case checks.
Two-character operators¶
Two-character operators (->, <=, >=, ==, !=, ::) are checked
with peek() before single-character operators. Order matters: < must
fall through to <= being tried first.
| '-' when peek() = '>' -> let t = mk Arrow in advance(); advance(); result.Add(t); scan()
| '<' when peek() = '=' -> let t = mk Le in advance(); advance(); result.Add(t); scan()
| ':' when peek() = ':' -> let t = mk ColonColon in advance(); advance(); result.Add(t); scan()
...
| '-' -> add Minus; advance(); scan()
| '<' -> add Lt; advance(); scan()
Note: Arrow is used for both the type arrow (Int -> Bool) and the pipe
operator (e -> f). Disambiguation happens in the parser based on
context. ColonColon is the list-cons operator used both in expressions
(x :: xs) and patterns.
String literals¶
Standard escape handling: \n, \t, \", \\ are unescaped inline.
Unrecognized escapes (\x) pass the escaped character through as-is.
Unterminated string throws inside scan(), which the top-level try
converts into Error ex.Message.
Char literals¶
Single-quoted char literals ('a', '\n') produce CharLit of char.
The escape set is: \n, \t, \r, \\, \', \", \0. Any other
backslash sequence raises an Invalid char escape error. An
unterminated or missing-closing-quote char literal also errors out.
| '\'' ->
advance() // consume opening '
let ch =
if source[pos] = '\\' then
advance()
match source[pos] with
| 'n' -> '\n' | 't' -> '\t' | 'r' -> '\r'
| '\\' -> '\\' | '\'' -> '\'' | '"' -> '"'
| '0' -> '\000'
| other -> failwith $"Invalid char escape '\\{other}'"
else source[pos]
...
Line comments¶
-- starts a comment that runs to the end of the line. The scanner
consumes characters until the next \n without emitting a token:
| '-' when peek() = '-' ->
while pos < source.Length && source[pos] <> '\n' do advance()
scan()
Comment-only lines are also treated as non-significant by the indent handler so they do not fire spurious DEDENTs inside a block.
Layout tokens (INDENT / DEDENT)¶
This is the only non-obvious part of the lexer. ll-lang is
indentation-sensitive — but rather than leaking whitespace into the
parser, the lexer synthesizes Indent and Dedent tokens at significant
indentation changes.
Algorithm¶
An indentStack : Stack<int> tracks the currently-open indent levels.
Starts as [0].
On each newline:
- The
Newlinetoken is emitted. measureAndHandleIndent()counts leading spaces on the new line (treating\tas 4 spaces).- Determine if the line is "significant": non-empty, non-comment-only.
- If the count is greater than the stack top: push it and emit
Indent. - If the count is less than the stack top: pop levels while the top
is greater than the new count, emitting
Dedentfor each pop. - If the count equals the top: do nothing.
At end-of-input, any remaining levels above 0 are drained as Dedent
tokens. Then Eof is appended.
Why only significant lines matter¶
Blank lines and comment-only lines must not trigger DEDENT — otherwise a blank line in the middle of an indented block would prematurely close it. The check:
let significant =
pos < source.Length &&
source[pos] <> '\n' && source[pos] <> '\r' &&
not (source[pos] = '-' && pos + 1 < source.Length && source[pos + 1] = '-')
Tab policy¶
Tabs are silently treated as 4 spaces. Mixing tabs and spaces in the same file is not recommended but is not a hard error.
Error model¶
tokenize returns Result<Tok list, string>. The only source of errors
is an unterminated string literal, which bubbles up as a failwith and
is caught by the top-level try/with.
Test coverage¶
See tests/LLLangTests/LexerTests.fs. Key cases:
- Basic token recognition for each operator and keyword.
- INDENT/DEDENT synthesis across multiple nesting levels.
- Blank-line tolerance inside blocks.
- String escape handling.
-->vs->(second-is unary minus / binary minus after arrow).
Run:
dotnet test --filter LexerTests