diff --git a/CHANGELOG.md b/CHANGELOG.md index fbc7b6f..bd6c512 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -83,8 +83,14 @@ - `ClassDeclaration`: support `extends` keyword (ngoài dạng paren) - Docs: xóa JSON-based dummy output từ Vite/Webpack plugin +### Tokenizer +- State-machine tokenizer: char-code dispatch (`packages/parser/src/tokenizer.ts`), keyword trie với 3 boundary kinds (`WORD` cho ASCII `\b`, `IDENT` cho keyword VI kết thúc non-ASCII, `NONE` cho keyword không cần biên), operator longest-match trie, bounded backtracking cho identifier đa-từ vs keyword đa-từ +- Hạ tầng bench: 5 fixtures (tiny / medium / keywordHeavy / stringHeavy / large), `pnpm bench` + `pnpm bench:baseline` (dump JSON) +- Snapshot drift tests + smoke parity tests cho fixtures +- Doc: `docs/architecture/tokenizer.md` + ### Stats -- Tests: 70 → 298 (100% pass) +- Tests: 70 → 357 (100% pass) - Coverage: 79.85% → 92.44% statements / 71.23% → 85.88% branches / 84.89% → 96.25% functions - Compatibility matrix: 40.8% → 98.5% complete (135/137 features) diff --git a/README.md b/README.md index a7208bf..42dfc02 100644 --- a/README.md +++ b/README.md @@ -73,9 +73,12 @@ VietScript giữ ngữ nghĩa JavaScript 100% — chỉ thay keyword tiếng Anh | ✅ | Escape sequence đầy đủ (`\n`, `\x41`, `\u{1F600}`, v.v.) | | ✅ | Error messages tiếng Việt có file:line:col + snippet | | ✅ | Source maps (debug stack trace trỏ về file `.vjs`) | +| ✅ | Tokenizer state-machine (char-code dispatch + keyword trie + bounded backtracking) | **Kiểm tra chi tiết:** [docs/compatibility.md](docs/compatibility.md) — 70.9% ✅ complete, 24.6% 🟡 partial, 4.5% ❌ missing. +**Kiến trúc tokenizer:** [docs/architecture/tokenizer.md](docs/architecture/tokenizer.md). + **Lộ trình:** [docs/roadmap.md](docs/roadmap.md). ## Dự án cấu trúc @@ -94,8 +97,9 @@ packages/ ```bash pnpm install -pnpm test # 249 test +pnpm test # 402 test pnpm test:coverage # coverage report (≥88% statements) +pnpm bench # benchmark tokenizer (regex vs FSM) pnpm lint pnpm build # build tất cả package ``` diff --git a/docs/.vitepress/config.ts b/docs/.vitepress/config.ts index a958d69..258d25d 100644 --- a/docs/.vitepress/config.ts +++ b/docs/.vitepress/config.ts @@ -31,6 +31,12 @@ export default defineConfig({ { text: 'Câu lệnh duyệt', link: '/basics/switch-case' }, ], }, + { + text: 'Kiến trúc', + items: [ + { text: 'Tokenizer (state machine)', link: '/architecture/tokenizer' }, + ], + }, ], socialLinks: [{ icon: 'github', link: 'https://github.com/vuejs/vitepress' }], diff --git a/docs/architecture/tokenizer.md b/docs/architecture/tokenizer.md new file mode 100644 index 0000000..e9bbbaa --- /dev/null +++ b/docs/architecture/tokenizer.md @@ -0,0 +1,125 @@ +# Kiến trúc Tokenizer + +Tokenizer của VietScript là một **state machine** chạy thủ công trên chuỗi nguồn `.vjs`, kết hợp **trie ký tự** cho keyword và toán tử, cùng **bounded backtracking** cho identifier đa-từ. Trang này mô tả thiết kế, các thành phần chính, và cách thêm keyword mới. + +- File: [`packages/parser/src/tokenizer.ts`](https://github.com/imrim12/vietscript/blob/main/packages/parser/src/tokenizer.ts) +- API: `getNextToken()`, `isEOF()`, `rollback(step)`, `getCursor()`. + +## Vòng lặp chính + +`getNextToken()` đọc `source.charCodeAt(cursor)` rồi rẽ nhánh trực tiếp theo char-code: + +``` +DEFAULT + ├── whitespace → skip + ├── '/' '/' → line comment, skip đến '\n' + ├── '/' '*' → block comment, skip đến '*/' + ├── '`' → scanTemplateLiteral + ├── '/' (regex ctx) → scanRegexLiteral + ├── '"' | "'" → scanString + ├── digit | '.digit' → scanNumber + ├── ident-start → scanIdentifierOrKeyword + └── otherwise → scanOperator (longest-match trie) +``` + +Không có cấp phát chuỗi tạm và không gọi regex engine trong hot path — chỉ index vào source qua `charCodeAt`. + +## Keyword trie + +Tất cả keyword (English + Vietnamese aliases) được build thành **một trie** keyed theo char-code, kể cả ký tự space cho keyword đa-từ như `khai báo`, `phá vòng lặp`, `kiểu của`, `khởi tạo cha`, `không xác định`. + +Mỗi entry mang một quy tắc biên (boundary) để xác định xem một keyword có thực sự đứng độc lập hay đang đứng cạnh ký tự định danh khác: + +| Boundary | Ý nghĩa | Ví dụ keyword | +|---|---|---| +| `WORD` | Char tiếp theo không được là `[A-Za-z0-9_]` (giả lập JS `\b`) | `var`, `let`, `for`, `khai báo`, `nếu` | +| `IDENT` | Char tiếp theo không được là `[A-Za-zÀ-ỹ]` (cho keyword VI kết thúc bằng ký tự non-ASCII) | `riêng tư`, `bảo vệ`, `chờ`, `khi mà` | +| `NONE` | Không kiểm tra biên (cho keyword có thể đứng cạnh identifier khác) | `else`, `return`, `try`, `as`, `from`, `const`, `async` | + +Lookup là một vòng đơn đi xuống trie: + +``` +1. Bắt đầu ở root, vị trí p = cursor. +2. Đọc charCodeAt(p), bước xuống child tương ứng. +3. Nếu node có .type và boundary qua → ghi nhận match dài nhất. +4. Lặp đến khi không còn child khớp. +5. Trả về match dài nhất (nếu có). +``` + +Độ phức tạp: O(L) với L = độ dài keyword dài nhất (hằng số). Không phụ thuộc số keyword. + +## Backtracking giới hạn cho identifier đa-từ + +VietScript cho phép identifier đa-từ ngăn cách bằng space, ví dụ `con mèo đẹp = 1`. Tokenizer cần phân biệt: + +- `khai báo con mèo đẹp = 1` → `[VAR, IDENT('con mèo đẹp'), '=', ...]` +- `khai báo một lớp gì đó = 1` → phải dừng identifier trước `lớp` (vì `lớp` là keyword), tạo lỗi syntax có chủ đích. + +Logic ở `scanIdentifier`: + +``` +1. Đọc word đầu (ident-start, ident-cont*). +2. Loop: + - Nếu next char là space và char sau space là ident-start: + - Peek: từ vị trí sau-space, gọi matchKeyword(). + - Nếu thấy keyword → DỪNG, không nuốt space. + - Nếu không → consume space + word, continue. + - Else → DỪNG. +``` + +Đây là backtracking **bounded**: peek tối đa độ dài keyword dài nhất (hằng số), không phải full backtracking parser. + +## Operator longest-match trie + +Tương tự keyword trie nhưng cho dấu toán tử. Đảm bảo `>>>=` thắng `>>>`, `>>>` thắng `>>=`, `>>` thắng `>` v.v. — không phụ thuộc thứ tự khai báo trong mảng `OPERATORS`. + +## Phân biệt regex literal vs phép chia + +Tokenizer track `lastTokenType`. Nếu token trước đó nằm trong tập `REGEX_PRECEDING_TOKENS` (sau `=`, `(`, `,`, `return`, ...), `/` đầu = regex literal; ngược lại = toán tử chia. + +## Template literal & regex literal + +Hai phần này được scan thủ công bằng `charCodeAt` lookup vì chúng có nội tại nested: + +- **Template literal** hỗ trợ `${ ... }` lồng nhau, bên trong có thể chứa template literal khác (đệ quy qua một stack độ sâu). +- **Regex literal** hỗ trợ character class `[...]` (slash bên trong không kết thúc regex), escape `\/`, và flags `[a-z]*` ở cuối. + +## Bench + +```bash +pnpm bench # chạy bench, in bảng so sánh giữa các fixture +pnpm bench:baseline # chạy bench, dump JSON vào packages/parser/bench/baseline.json +``` + +File bench: +- [`packages/parser/src/__bench__/tokenizer.bench.ts`](https://github.com/imrim12/vietscript/blob/main/packages/parser/src/__bench__/tokenizer.bench.ts) — bench tokenizer trên 5 fixture (tiny / medium / keywordHeavy / stringHeavy / large). +- [`packages/parser/src/__bench__/parser.bench.ts`](https://github.com/imrim12/vietscript/blob/main/packages/parser/src/__bench__/parser.bench.ts) — bench end-to-end parse. +- [`packages/parser/src/__bench__/fixtures/`](https://github.com/imrim12/vietscript/blob/main/packages/parser/src/__bench__/fixtures) — source `.vjs` đại diện. + +Khi thay đổi tokenizer, chạy bench trước/sau và commit kèm PR để dễ review. + +## Kiểm chứng + +3 lớp test bảo vệ hành vi: + +1. **Tokenizer behavior** — `tokenizer-edge.test.ts`, `vietnamese-keywords.test.ts`, `identifier-match-keyword.test.ts` cover edge cases trực tiếp. +2. **Snapshot drift** — `bench-fixture-snapshot.test.ts` cố định output token (type|value|start-end) cho mỗi fixture. Sửa tokenizer mà ra output khác → test fail và phải xem lại. +3. **Smoke** — `bench-fixture-parity.test.ts` đảm bảo mọi fixture tokenize được đến EOF không lỗi. + +Cộng thêm 70+ parser tests gián tiếp cover tokenizer qua từng node loại AST. + +## Thêm keyword mới + +1. Thêm vào enum: [`packages/shared/parser/keyword.enum.ts`](https://github.com/imrim12/vietscript/blob/main/packages/shared/parser/keyword.enum.ts). +2. Thêm entry vào mảng `KEYWORDS` ở đầu `tokenizer.ts`. Chọn `boundary` đúng: + - `WORD` cho keyword ASCII (mô phỏng `\b`). + - `IDENT` cho keyword VI kết thúc bằng ký tự non-ASCII (vì `\b` không hoạt động trên Unicode). + - `NONE` chỉ khi cố ý cho phép keyword đứng cạnh identifier khác. +3. Thêm test trong `vietnamese-keywords.test.ts` (1 case dùng EN, 1 case dùng VI). +4. Chạy `pnpm test` (full suite) + `pnpm bench` (kiểm tra không regression). + +## Tài liệu liên quan + +- [Roadmap](../roadmap.md) — phần Phase 0 ghi keyword EN + VI policy. +- [Compatibility matrix](../compatibility.md) — trạng thái cú pháp. +- Source: `packages/parser/src/tokenizer.ts`, `packages/parser/src/parser.ts`. diff --git a/docs/getting-started.md b/docs/getting-started.md index 3f2eda7..1d0bbfb 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -116,7 +116,11 @@ Dự án phát triển theo TDD. Mỗi cú pháp mới có 3 loại test: parser ```bash pnpm test # Chạy toàn bộ test pnpm test:coverage # Test + coverage report +pnpm bench # Chạy benchmark tokenizer trên 5 fixture +pnpm bench:baseline # Bench → JSON baseline (packages/parser/bench/baseline.json) pnpm build # Build tất cả package ``` +Kiến trúc tokenizer (state machine + keyword trie + bounded backtracking): xem [Kiến trúc Tokenizer](./architecture/tokenizer). + Hướng dẫn thêm parser node: [CONTRIBUTING.md](../CONTRIBUTING.md). diff --git a/docs/roadmap.md b/docs/roadmap.md index 8703d8d..2e26d62 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -4,6 +4,8 @@ Kế hoạch đầy đủ đưa dự án từ trạng thái hiện tại (parser Tham chiếu trạng thái: [compatibility.md](./compatibility.md). +Kiến trúc tokenizer được mô tả tại [architecture/tokenizer.md](./architecture/tokenizer.md). + --- ## 0. Nguyên tắc & quyết định cố định diff --git a/packages/parser/bench/baseline.json b/packages/parser/bench/baseline.json index bccbd19..554141e 100644 --- a/packages/parser/bench/baseline.json +++ b/packages/parser/bench/baseline.json @@ -21,132 +21,132 @@ "filepath": "/home/user/vietscript/packages/parser/src/__bench__/tokenizer.bench.ts", "groups": [ { - "fullName": "packages/parser/src/__bench__/tokenizer.bench.ts > tokenizer (regex baseline)", + "fullName": "packages/parser/src/__bench__/tokenizer.bench.ts > tokenizer", "benchmarks": [ { "id": "-1068277940_0_0", "name": "tokenize tiny (129 chars)", "rank": 1, - "rme": 1.1426219639578934, + "rme": 1.141015862752919, "samples": [], - "totalTime": 500.33317099999954, - "min": 0.4448810000001231, - "max": 1.2078570000001037, - "hz": 1964.6908439736467, - "period": 0.5089859318413017, - "mean": 0.5089859318413017, - "variance": 0.008654820571585425, - "sd": 0.0930312881324634, - "sem": 0.002967237270752279, - "df": 982, + "totalTime": 500.00185199994417, + "min": 0.0020930000000589644, + "max": 0.7843789999999444, + "hz": 299578.8903598236, + "period": 0.003338018906468684, + "mean": 0.003338018906468684, + "variance": 0.000056562894653669586, + "sd": 0.007520830715663635, + "sem": 0.0000194323087880098, + "df": 149789, "critical": 1.96, - "moe": 0.005815785050674467, - "p75": 0.5024739999998928, - "p99": 0.8974330000000919, - "p995": 0.955120000000079, - "p999": 1.2078570000001037, - "sampleCount": 983, - "median": 0.4800920000000133 + "moe": 0.000038087325224499204, + "p75": 0.003083000000060565, + "p99": 0.013623999999936132, + "p995": 0.020438999999896623, + "p999": 0.03829700000005687, + "sampleCount": 149790, + "median": 0.0024290000001201406 }, { "id": "-1068277940_0_1", "name": "tokenize medium (1777 chars)", "rank": 3, - "rme": 0.6302605202939511, + "rme": 1.2727299895304194, "samples": [], - "totalTime": 516.6511870000006, - "min": 19.24686200000042, - "max": 20.44344299999989, - "hz": 50.324088387316465, - "period": 19.871199500000024, - "mean": 19.871199500000024, - "variance": 0.09610086164718201, - "sd": 0.3100013897504042, - "sem": 0.06079627444531513, - "df": 25, - "critical": 2.06, - "moe": 0.12524032535734916, - "p75": 20.03240300000016, - "p99": 20.44344299999989, - "p995": 20.44344299999989, - "p999": 20.44344299999989, - "sampleCount": 26, - "median": 19.872763499999905 + "totalTime": 500.0366130000066, + "min": 0.03256499999997686, + "max": 1.0443820000000414, + "hz": 21186.448601114455, + "period": 0.04719998234849977, + "mean": 0.04719998234849977, + "variance": 0.0009951855316763752, + "sd": 0.03154656132887347, + "sem": 0.0003064940461236842, + "df": 10593, + "critical": 1.96, + "moe": 0.0006007283304024209, + "p75": 0.05080900000029942, + "p99": 0.12844799999993484, + "p995": 0.15704199999981938, + "p999": 0.545332000000144, + "sampleCount": 10594, + "median": 0.037829999999985375 }, { "id": "-1068277940_0_2", "name": "tokenize keywordHeavy (2511 chars)", "rank": 4, - "rme": 0.5495901106494003, + "rme": 1.2775559544495476, "samples": [], - "totalTime": 506.61291200000005, - "min": 27.58257700000013, - "max": 28.69541799999979, - "hz": 35.53008534452829, - "period": 28.14516177777778, - "mean": 28.14516177777778, - "variance": 0.09673706615570357, - "sd": 0.3110258287597729, - "sem": 0.07330949088006712, - "df": 17, - "critical": 2.11, - "moe": 0.15468302575694162, - "p75": 28.271655999999894, - "p99": 28.69541799999979, - "p995": 28.69541799999979, - "p999": 28.69541799999979, - "sampleCount": 18, - "median": 28.091197499999907 + "totalTime": 500.01672599998346, + "min": 0.040463000000272586, + "max": 1.1820819999998093, + "hz": 18065.395676384433, + "period": 0.055354447691794914, + "mean": 0.055354447691794914, + "variance": 0.0011759389410946788, + "sd": 0.03429196613049008, + "sem": 0.00036080818496897254, + "df": 9032, + "critical": 1.96, + "moe": 0.0007071840425391862, + "p75": 0.05822900000021036, + "p99": 0.13829299999997602, + "p995": 0.17417199999999866, + "p999": 0.578490000000329, + "sampleCount": 9033, + "median": 0.04572200000029625 }, { "id": "-1068277940_0_3", "name": "tokenize stringHeavy (1410 chars)", "rank": 2, - "rme": 0.6673262731198819, + "rme": 1.1398010880120715, "samples": [], - "totalTime": 500.565410000002, - "min": 6.132623999999851, - "max": 7.134853000000021, - "hz": 149.83056859641923, - "period": 6.674205466666693, - "mean": 6.674205466666693, - "variance": 0.037456074319165325, - "sd": 0.193535718458287, - "sem": 0.02234757982993992, - "df": 74, - "critical": 1.993, - "moe": 0.044538726601070264, - "p75": 6.781645000000026, - "p99": 7.134853000000021, - "p995": 7.134853000000021, - "p999": 7.134853000000021, - "sampleCount": 75, - "median": 6.676206999999977 + "totalTime": 500.08148599995, + "min": 0.010930999999800406, + "max": 0.9104910000000928, + "hz": 67796.95099530918, + "period": 0.014749925849455817, + "mean": 0.014749925849455817, + "variance": 0.0002494460350087862, + "sd": 0.015793860674603477, + "sem": 0.00008577541597605673, + "df": 33903, + "critical": 1.96, + "moe": 0.00016811981531307117, + "p75": 0.013374000000112574, + "p99": 0.04493300000012823, + "p995": 0.05427600000029997, + "p999": 0.08798699999988457, + "sampleCount": 33904, + "median": 0.01170300000012503 }, { "id": "-1068277940_0_4", "name": "tokenize large (14262 chars)", "rank": 5, - "rme": 1.1911127448688184, + "rme": 1.6847795453088072, "samples": [], - "totalTime": 8981.281721999996, - "min": 888.0166659999995, - "max": 939.8688099999999, - "hz": 1.1134268258732618, - "period": 898.1281721999997, - "mean": 898.1281721999997, - "variance": 223.66456306729435, - "sd": 14.955419187281056, - "sem": 4.729318799439242, - "df": 9, - "critical": 2.262, - "moe": 10.697719124331565, - "p75": 896.0057759999981, - "p99": 939.8688099999999, - "p995": 939.8688099999999, - "p999": 939.8688099999999, - "sampleCount": 10, - "median": 893.9150835 + "totalTime": 500.3381490000138, + "min": 0.2647550000001502, + "max": 1.21205800000007, + "hz": 2808.1008869862553, + "period": 0.35611256156584614, + "mean": 0.35611256156584614, + "variance": 0.013165123286550166, + "sd": 0.11473937112669812, + "sem": 0.0030610773446615356, + "df": 1404, + "critical": 1.96, + "moe": 0.005999711595536609, + "p75": 0.3726120000001174, + "p99": 0.8343439999998736, + "p995": 0.9964389999995547, + "p999": 1.1970249999994849, + "sampleCount": 1405, + "median": 0.3244280000008075 } ] } diff --git a/packages/parser/bench/comparison.json b/packages/parser/bench/comparison.json deleted file mode 100644 index 31011fc..0000000 --- a/packages/parser/bench/comparison.json +++ /dev/null @@ -1,301 +0,0 @@ -{ - "files": [ - { - "filepath": "/home/user/vietscript/packages/parser/src/__bench__/parser.bench.ts", - "groups": [ - { - "fullName": "packages/parser/src/__bench__/parser.bench.ts > parser end-to-end (regex baseline)", - "benchmarks": [ - { - "id": "1166627732_0_0", - "name": "parse tiny (129 chars)", - "rank": 1, - "rme": 0, - "samples": [] - } - ] - } - ] - }, - { - "filepath": "/home/user/vietscript/packages/parser/src/__bench__/tokenizer.bench.ts", - "groups": [ - { - "fullName": "packages/parser/src/__bench__/tokenizer.bench.ts > tokenize tiny (129 chars)", - "benchmarks": [ - { - "id": "-1068277940_0_0", - "name": "regex", - "rank": 2, - "rme": 1.5350840371170378, - "samples": [], - "totalTime": 500.0193509999999, - "min": 0.4276030000000901, - "max": 1.3723669999999402, - "hz": 1769.931500111083, - "period": 0.5649936169491524, - "mean": 0.5649936169491524, - "variance": 0.017329359925297418, - "sd": 0.13164102675570946, - "sem": 0.0044250647063860315, - "df": 884, - "critical": 1.96, - "moe": 0.008673126824516621, - "p75": 0.5542259999999715, - "p99": 1.0561989999998787, - "p995": 1.1007520000000568, - "p999": 1.3723669999999402, - "sampleCount": 885, - "median": 0.5261639999998806 - }, - { - "id": "-1068277940_0_1", - "name": "fsm", - "rank": 1, - "rme": 0.6133533190779474, - "samples": [], - "totalTime": 500.00123599994686, - "min": 0.002124999999978172, - "max": 0.4388850000000275, - "hz": 364125.0998827918, - "period": 0.0027463088930751818, - "mean": 0.0027463088930751818, - "variance": 0.000013447134783211087, - "sd": 0.0036670335126926623, - "sem": 0.000008594171810106872, - "df": 182062, - "critical": 1.96, - "moe": 0.000016844576747809467, - "p75": 0.00261000000000422, - "p99": 0.005317999999988388, - "p995": 0.0074489999997240375, - "p999": 0.023729999999886786, - "sampleCount": 182063, - "median": 0.0025460000001658045 - } - ] - }, - { - "fullName": "packages/parser/src/__bench__/tokenizer.bench.ts > tokenize medium (1777 chars)", - "benchmarks": [ - { - "id": "-1068277940_1_0", - "name": "regex", - "rank": 2, - "rme": 3.524667752133275, - "samples": [], - "totalTime": 518.7961389999991, - "min": 19.4718170000001, - "max": 28.74305000000004, - "hz": 48.188485072746545, - "period": 20.751845559999964, - "mean": 20.751845559999964, - "variance": 3.139571992830767, - "sd": 1.7718837413416173, - "sem": 0.35437674826832344, - "df": 24, - "critical": 2.064, - "moe": 0.7314336084258196, - "p75": 20.659483999999793, - "p99": 28.74305000000004, - "p995": 28.74305000000004, - "p999": 28.74305000000004, - "sampleCount": 25, - "median": 20.37862300000006 - }, - { - "id": "-1068277940_1_1", - "name": "fsm", - "rank": 1, - "rme": 0.6624735904903833, - "samples": [], - "totalTime": 500.0035550000025, - "min": 0.03033499999992273, - "max": 0.5823090000003504, - "hz": 26333.812766591098, - "period": 0.03797399217741342, - "mean": 0.03797399217741342, - "variance": 0.0002169123830559907, - "sd": 0.014727945649546331, - "sem": 0.00012835085175012654, - "df": 13166, - "critical": 1.96, - "moe": 0.000251567669430248, - "p75": 0.036670999999842024, - "p99": 0.06638299999985975, - "p995": 0.0816330000002381, - "p999": 0.2839810000000398, - "sampleCount": 13167, - "median": 0.035983999999643856 - } - ] - }, - { - "fullName": "packages/parser/src/__bench__/tokenizer.bench.ts > tokenize keywordHeavy (2511 chars)", - "benchmarks": [ - { - "id": "-1068277940_2_0", - "name": "regex", - "rank": 2, - "rme": 1.7877956419477135, - "samples": [], - "totalTime": 515.2327339999997, - "min": 29.07310099999995, - "max": 33.27790499999992, - "hz": 32.99479803625988, - "period": 30.307807882352925, - "mean": 30.307807882352925, - "variance": 1.1105087871830166, - "sd": 1.053806807333781, - "sem": 0.2555856926842411, - "df": 16, - "critical": 2.12, - "moe": 0.5418416684905911, - "p75": 30.90304600000036, - "p99": 33.27790499999992, - "p995": 33.27790499999992, - "p999": 33.27790499999992, - "sampleCount": 17, - "median": 30.187858000000233 - }, - { - "id": "-1068277940_2_1", - "name": "fsm", - "rank": 1, - "rme": 0.6175740445255259, - "samples": [], - "totalTime": 500.0284290000027, - "min": 0.04476100000010774, - "max": 0.5986629999997604, - "hz": 20106.856764337987, - "period": 0.04973427779988091, - "mean": 0.04973427779988091, - "variance": 0.0002468973565791904, - "sd": 0.015712967783941722, - "sem": 0.00015670713822667616, - "df": 10053, - "critical": 1.96, - "moe": 0.00030714599092428527, - "p75": 0.047118000000409666, - "p99": 0.0876379999999699, - "p995": 0.10085900000012771, - "p999": 0.3146649999998772, - "sampleCount": 10054, - "median": 0.04607899999973597 - } - ] - }, - { - "fullName": "packages/parser/src/__bench__/tokenizer.bench.ts > tokenize stringHeavy (1410 chars)", - "benchmarks": [ - { - "id": "-1068277940_3_0", - "name": "regex", - "rank": 2, - "rme": 0.7766105799152846, - "samples": [], - "totalTime": 503.6576939999986, - "min": 6.501181999999972, - "max": 7.724500999999691, - "hz": 142.95423430978144, - "period": 6.995245749999981, - "mean": 6.995245749999981, - "variance": 0.05341669000115708, - "sd": 0.2311205096938761, - "sem": 0.027237813279305165, - "df": 71, - "critical": 1.9945, - "moe": 0.05432581858557415, - "p75": 7.049482999999782, - "p99": 7.724500999999691, - "p995": 7.724500999999691, - "p999": 7.724500999999691, - "sampleCount": 72, - "median": 6.967653500000324 - }, - { - "id": "-1068277940_3_1", - "name": "fsm", - "rank": 1, - "rme": 0.5493915324505317, - "samples": [], - "totalTime": 500.00108399988767, - "min": 0.01133399999980611, - "max": 0.5432410000003074, - "hz": 69159.8500614606, - "period": 0.014459256333137296, - "mean": 0.014459256333137296, - "variance": 0.00005680266124359761, - "sd": 0.007536754025679597, - "sem": 0.0000405295560967212, - "df": 34579, - "critical": 1.96, - "moe": 0.00007943792994957356, - "p75": 0.01378500000009808, - "p99": 0.03121000000010099, - "p995": 0.03526799999963259, - "p999": 0.052106000000094355, - "sampleCount": 34580, - "median": 0.013637999999446038 - } - ] - }, - { - "fullName": "packages/parser/src/__bench__/tokenizer.bench.ts > tokenize large (14262 chars)", - "benchmarks": [ - { - "id": "-1068277940_4_0", - "name": "regex", - "rank": 2, - "rme": 0.9425965082975333, - "samples": [], - "totalTime": 9050.40537, - "min": 884.817156000001, - "max": 926.1038919999992, - "hz": 1.104922883691783, - "period": 905.0405370000001, - "mean": 905.0405370000001, - "variance": 142.2337026237795, - "sd": 11.92617720075379, - "sem": 3.7713883733153164, - "df": 9, - "critical": 2.262, - "moe": 8.530880500439245, - "p75": 914.2840939999987, - "p99": 926.1038919999992, - "p995": 926.1038919999992, - "p999": 926.1038919999992, - "sampleCount": 10, - "median": 903.8145745000011 - }, - { - "id": "-1068277940_4_1", - "name": "fsm", - "rank": 1, - "rme": 0.7733163929814905, - "samples": [], - "totalTime": 500.2985060000319, - "min": 0.24359799999729148, - "max": 0.892772999999579, - "hz": 3280.041775699197, - "period": 0.30487416575260934, - "mean": 0.30487416575260934, - "variance": 0.002374390365389728, - "sd": 0.04872771660348685, - "sem": 0.0012028785212910658, - "df": 1640, - "critical": 1.96, - "moe": 0.002357641901730489, - "p75": 0.31062000000019907, - "p99": 0.5434040000000095, - "p995": 0.5940250000021479, - "p999": 0.6961689999989176, - "sampleCount": 1641, - "median": 0.2921910000004573 - } - ] - } - ] - } - ] -} \ No newline at end of file diff --git a/packages/parser/src/__bench__/parser.bench.ts b/packages/parser/src/__bench__/parser.bench.ts index eb2766e..466f67f 100644 --- a/packages/parser/src/__bench__/parser.bench.ts +++ b/packages/parser/src/__bench__/parser.bench.ts @@ -7,7 +7,7 @@ import { fixtures } from './fixtures' // The other fixtures still exercise the tokenizer (see tokenizer.bench.ts). const PARSE_OK = ['tiny'] as const -describe('parser end-to-end (regex baseline)', () => { +describe('parser end-to-end', () => { for (const name of PARSE_OK) { const source = fixtures[name] bench(`parse ${name} (${source.length} chars)`, () => { diff --git a/packages/parser/src/__bench__/tokenizer.bench.ts b/packages/parser/src/__bench__/tokenizer.bench.ts index f55bbf4..dff3f19 100644 --- a/packages/parser/src/__bench__/tokenizer.bench.ts +++ b/packages/parser/src/__bench__/tokenizer.bench.ts @@ -1,12 +1,11 @@ import type { Token } from '@vietscript/shared' import { Parser } from '@parser/parser' import { Tokenizer } from '@parser/tokenizer' -import { TokenizerFSM } from '@parser/tokenizer-fsm' import { bench, describe } from 'vitest' import { fixtures } from './fixtures' -function tokenizeRegex(source: string): number { +function tokenizeAll(source: string): number { const parser = new Parser() parser.syntax = source parser.tokenizer = new Tokenizer(parser) @@ -19,29 +18,13 @@ function tokenizeRegex(source: string): number { return count } -function tokenizeFSM(source: string): number { - const parser = new Parser() - parser.syntax = source - parser.tokenizer = new TokenizerFSM(parser) - let count = 0 - let tok: Token | null = parser.tokenizer.getNextToken() - while (tok !== null) { - count++ - tok = parser.tokenizer.getNextToken() - } - return count -} - const NAMES = ['tiny', 'medium', 'keywordHeavy', 'stringHeavy', 'large'] as const -for (const name of NAMES) { - const source = fixtures[name] - describe(`tokenize ${name} (${source.length} chars)`, () => { - bench('regex', () => { - tokenizeRegex(source) +describe('tokenizer', () => { + for (const name of NAMES) { + const source = fixtures[name] + bench(`tokenize ${name} (${source.length} chars)`, () => { + tokenizeAll(source) }) - bench('fsm', () => { - tokenizeFSM(source) - }) - }) -} + } +}) diff --git a/packages/parser/src/__test__/tokenizer-fsm.test.ts b/packages/parser/src/__test__/tokenizer-fsm.test.ts deleted file mode 100644 index 57b373f..0000000 --- a/packages/parser/src/__test__/tokenizer-fsm.test.ts +++ /dev/null @@ -1,232 +0,0 @@ -import type { Token } from '@vietscript/shared' -import { Parser } from '@parser/parser' - -import { Tokenizer } from '@parser/tokenizer' -import { TokenizerFSM } from '@parser/tokenizer-fsm' -import { Keyword } from '@vietscript/shared' - -import { fixtures } from '../__bench__/fixtures' - -function tokenizeRegex(source: string): Token[] { - const parser = new Parser() - parser.syntax = source - parser.tokenizer = new Tokenizer(parser) - const out: Token[] = [] - let t = parser.tokenizer.getNextToken() - while (t !== null) { - out.push(t) - t = parser.tokenizer.getNextToken() - } - return out -} - -function tokenizeFSM(source: string): Token[] { - const parser = new Parser() - parser.syntax = source - parser.tokenizer = new TokenizerFSM(parser) - const out: Token[] = [] - let t = parser.tokenizer.getNextToken() - while (t !== null) { - out.push(t) - t = parser.tokenizer.getNextToken() - } - return out -} - -describe('tokenizer-fsm: parity with regex tokenizer', () => { - it('empty input', () => { - expect(tokenizeFSM('')).toEqual(tokenizeRegex('')) - }) - - it('whitespace only', () => { - expect(tokenizeFSM(' \n\t')).toEqual(tokenizeRegex(' \n\t')) - }) - - it('line comment', () => { - expect(tokenizeFSM('// hello\n42')).toEqual(tokenizeRegex('// hello\n42')) - }) - - it('block comment', () => { - expect(tokenizeFSM('/* a\nb */ x')).toEqual(tokenizeRegex('/* a\nb */ x')) - }) - - it('integer literal', () => { - expect(tokenizeFSM('42')).toEqual(tokenizeRegex('42')) - }) - - it('hex literal', () => { - expect(tokenizeFSM('0xFF')).toEqual(tokenizeRegex('0xFF')) - }) - - it('octal literal', () => { - expect(tokenizeFSM('0o17')).toEqual(tokenizeRegex('0o17')) - }) - - it('binary literal', () => { - expect(tokenizeFSM('0b1010')).toEqual(tokenizeRegex('0b1010')) - }) - - it('decimal with fraction', () => { - expect(tokenizeFSM('3.14')).toEqual(tokenizeRegex('3.14')) - }) - - it('exponent literal', () => { - expect(tokenizeFSM('1.5e10')).toEqual(tokenizeRegex('1.5e10')) - }) - - it('bigint literal', () => { - expect(tokenizeFSM('100n')).toEqual(tokenizeRegex('100n')) - }) - - it('numeric separators', () => { - expect(tokenizeFSM('1_000_000')).toEqual(tokenizeRegex('1_000_000')) - }) - - it('leading dot decimal', () => { - expect(tokenizeFSM('.5')).toEqual(tokenizeRegex('.5')) - }) - - it('double-quoted string', () => { - expect(tokenizeFSM('"hello"')).toEqual(tokenizeRegex('"hello"')) - }) - - it('single-quoted string', () => { - expect(tokenizeFSM('\'hello\'')).toEqual(tokenizeRegex('\'hello\'')) - }) - - it('string with escape', () => { - expect(tokenizeFSM('"a\\"b"')).toEqual(tokenizeRegex('"a\\"b"')) - }) - - it('string with unicode escape', () => { - expect(tokenizeFSM('"\\u00E1"')).toEqual(tokenizeRegex('"\\u00E1"')) - }) - - it('template literal simple', () => { - expect(tokenizeFSM('`abc`')).toEqual(tokenizeRegex('`abc`')) - }) - - it('template literal with interpolation', () => { - expect(tokenizeFSM('`a${b}c`')).toEqual(tokenizeRegex('`a${b}c`')) - }) - - it('nested template literal', () => { - expect(tokenizeFSM('`a${`b${c}d`}e`')).toEqual(tokenizeRegex('`a${`b${c}d`}e`')) - }) - - it('regex literal', () => { - expect(tokenizeFSM('x = /abc/g')).toEqual(tokenizeRegex('x = /abc/g')) - }) - - it('division vs regex (after identifier)', () => { - expect(tokenizeFSM('x / y')).toEqual(tokenizeRegex('x / y')) - }) - - it('regex with character class', () => { - expect(tokenizeFSM('var r = /[a-z]+/i')).toEqual(tokenizeRegex('var r = /[a-z]+/i')) - }) - - it('all single-char operators', () => { - expect(tokenizeFSM('+-*/%~!&|^?:.,;')).toEqual(tokenizeRegex('+-*/%~!&|^?:.,;')) - }) - - it('compound operator >>>=', () => { - expect(tokenizeFSM('a >>>= b')).toEqual(tokenizeRegex('a >>>= b')) - }) - - it('arrow function tokens', () => { - expect(tokenizeFSM('(a, b) => a + b')).toEqual(tokenizeRegex('(a, b) => a + b')) - }) - - it('all bracket types', () => { - expect(tokenizeFSM('([{}])')).toEqual(tokenizeRegex('([{}])')) - }) - - it('english keyword: var', () => { - expect(tokenizeFSM('var x = 1')).toEqual(tokenizeRegex('var x = 1')) - }) - - it('vietnamese keyword: khai báo', () => { - expect(tokenizeFSM('khai báo x = 1')).toEqual(tokenizeRegex('khai báo x = 1')) - }) - - it('vietnamese keyword: phá vòng lặp', () => { - expect(tokenizeFSM('phá vòng lặp')).toEqual(tokenizeRegex('phá vòng lặp')) - }) - - it('vietnamese keyword: kiểu của', () => { - expect(tokenizeFSM('kiểu của x')).toEqual(tokenizeRegex('kiểu của x')) - }) - - it('multi-word identifier (no embedded keyword)', () => { - expect(tokenizeFSM('khai báo con mèo đẹp = 1')).toEqual(tokenizeRegex('khai báo con mèo đẹp = 1')) - }) - - it('embedded keyword inside identifier should split', () => { - const fsm = tokenizeFSM('khai báo một lớp gì đó = 1') - const rx = tokenizeRegex('khai báo một lớp gì đó = 1') - expect(fsm).toEqual(rx) - }) - - it('boolean keywords vi/en', () => { - expect(tokenizeFSM('đúng sai true false')).toEqual(tokenizeRegex('đúng sai true false')) - }) - - it('null/Infinity/NaN/undefined', () => { - expect(tokenizeFSM('null Infinity NaN undefined rỗng vô cực không xác định')).toEqual(tokenizeRegex('null Infinity NaN undefined rỗng vô cực không xác định')) - }) - - it('throw on unknown char', () => { - expect(() => tokenizeFSM('@@@')).toThrow() - }) - - it('riêng tư followed by identifier should not consume trailing letters', () => { - const t = tokenizeFSM('riêng tư x') - expect(t).toHaveLength(2) - expect(t[0].type).toBe(Keyword.PRIVATE) - expect(t[1].value).toBe('x') - }) - - it('bảo vệ keyword followed by identifier', () => { - const t = tokenizeFSM('bảo vệ phương thức') - expect(t).toHaveLength(2) - expect(t[0].type).toBe(Keyword.PROTECTED) - expect(t[1].type).toBe(Keyword.IDENTIFIER) - }) -}) - -describe('tokenizer-fsm: produces identical output to regex tokenizer on bench fixtures', () => { - for (const [name, source] of Object.entries(fixtures)) { - it(`fixture parity: ${name}`, () => { - const rx = tokenizeRegex(source) - const fsm = tokenizeFSM(source) - expect(fsm.length).toBe(rx.length) - for (let i = 0; i < rx.length; i++) { - expect({ idx: i, ...fsm[i] }).toEqual({ idx: i, ...rx[i] }) - } - }, 60_000) - } -}) - -describe('tokenizer-fsm: rollback works', () => { - it('rollback decrements cursor and lookahead end', () => { - const parser = new Parser() - parser.syntax = 'abc; def' - parser.tokenizer = new TokenizerFSM(parser) - const first = parser.tokenizer.getNextToken() - expect(first?.value).toBe('abc') - const before = (parser.tokenizer as TokenizerFSM).getCursor() - ;(parser.tokenizer as TokenizerFSM).rollback(2) - expect((parser.tokenizer as TokenizerFSM).getCursor()).toBe(before - 2) - }) -}) - -describe('tokenizer-fsm: isEOF', () => { - it('is true at end', () => { - const parser = new Parser() - parser.syntax = 'a' - parser.tokenizer = new TokenizerFSM(parser) - parser.tokenizer.getNextToken() - expect(parser.tokenizer.isEOF()).toBe(true) - }) -}) diff --git a/packages/parser/src/constants/specs.ts b/packages/parser/src/constants/specs.ts deleted file mode 100644 index 5f64172..0000000 --- a/packages/parser/src/constants/specs.ts +++ /dev/null @@ -1,159 +0,0 @@ -import type { Spec } from '@vietscript/shared' -import { Keyword } from '@vietscript/shared' - -export const SpecIdentifier = [/^[A-Za-z\u00C0-\u1EF9][A-Za-z0-9\u00C0-\u1EF9]*(\s[A-Za-z\u00C0-\u1EF9][A-Za-z0-9\u00C0-\u1EF9]*)*/, Keyword.IDENTIFIER] as Spec - -export const Specs: Array = [ - // -------------------------------------- - // Whitespace: - [/^\s+/, null], - - // -------------------------------------- - // Comments: - [/^\/\/.*/, null], - [/^\/\*[\s\S]*?\*\//, null], - - // -------------------------------------- - // Symbols and delimiters (ordered longest-first to avoid prefix conflicts): - [/^\[/, '['], - [/^\]/, ']'], - [/^\(/, '('], - [/^\)/, ')'], - [/^\{/, '{'], - [/^\}/, '}'], - [/^;/, ';'], - [/^,/, ','], - [/^:/, ':'], - [/^\.{3}/, '...'], - [/^\.[\d_]+([eE][+-]?\d[\d_]*)?n?/, Keyword.NUMBER], - [/^\./, '.'], - [/^#/, '#'], - - [/^>>>=/, '>>>='], - [/^>>>/, '>>>'], - [/^>>=/, '>>='], - [/^>>/, '>>'], - [/^<<=/, '<<='], - [/^<=/, '>='], - [/^/, '>'], - - [/^===/, '==='], - [/^!==/, '!=='], - [/^==/, '=='], - [/^!=/, '!='], - - [/^=>/, '=>'], - [/^\*\*=/, '**='], - [/^\*\*/, '**'], - [/^\*=/, '*='], - [/^\*/, '*'], - [/^\+\+/, '++'], - [/^\+=/, '+='], - [/^\+/, '+'], - [/^--/, '--'], - [/^-=/, '-='], - [/^-/, '-'], - [/^\/=/, '/='], - [/^\//, '/'], - [/^%=/, '%='], - [/^%/, '%'], - - [/^&&=/, '&&='], - [/^&&/, '&&'], - [/^&=/, '&='], - [/^&/, '&'], - [/^\|\|=/, '||='], - [/^\|\|/, '||'], - [/^\|=/, '|='], - [/^\|/, '|'], - [/^\^=/, '^='], - [/^\^/, '^'], - [/^~/, '~'], - [/^!/, '!'], - - [/^\?\?=/, '??='], - [/^\?\?/, '??'], - [/^\?\./, '?.'], - [/^\?/, '?'], - - [/^=/, '='], - - // -------------------------------------- - // Keywords - [/^(var|khai b\u00E1o)\b/, Keyword.VAR], - [/^(break|ph\u00E1 v\u00F2ng l\u1EB7p)\b/, Keyword.BREAK], - [/^(do|th\u1EF1c hi\u1EC7n)\b/, Keyword.DO], - [/^(instanceof|l\u00E0 ki\u1EC3u)\b/, Keyword.INSTANCEOF], - [/^(typeof|ki\u1EC3u c\u1EE7a)\b/, Keyword.TYPEOF], - [/^(switch|duy\u1EC7t)\b/, Keyword.SWITCH], - [/^(case|tr\u01B0\u1EDDng h\u1EE3p)\b/, Keyword.CASE], - [/^(if|n\u1EBFu)\b/, Keyword.IF], - [/^(else|kh\u00F4ng th\u00EC)/, Keyword.ELSE], - [/^new\b/, Keyword.NEW], - [/^(catch|b\u1EAFt l\u1ED7i)\b/, Keyword.CATCH], - [/^(finally|cu\u1ED1i c\u00F9ng)\b/, Keyword.FINALLY], - [/^(return|tr\u1EA3 v\u1EC1)/, Keyword.RETURN], - [/^void\b/, Keyword.VOID], - [/^(continue|ti\u1EBFp t\u1EE5c)\b/, Keyword.CONTINUE], - [/^(for|l\u1EB7p)\b/, Keyword.FOR], - [/^(while\b|khi m\u00E0(?![A-Za-z\u00C0-\u1EF9]))/, Keyword.WHILE], - [/^debugger\b/, Keyword.DEBUGGER], - [/^(function|h\u00E0m)\b/, Keyword.FUNCTION], - [/^(this\b|\u0111\u00E2y\b)/, Keyword.THIS], - [/^with\b/, Keyword.WITH], - [/^(default|m\u1EB7c \u0111\u1ECBnh)\b/, Keyword.DEFAULT], - [/^(throw|b\u00E1o l\u1ED7i)\b/, Keyword.THROW], - [/^(delete\b|xo\u00E1(?![A-Za-z\u00C0-\u1EF9]))/, Keyword.DELETE], - [/^(in|trong)\b/, Keyword.IN], - [/^(of|c\u1EE7a)\b/, Keyword.OF], - [/^(try|th\u1EED)/, Keyword.TRY], - [/^(as|nh\u01B0 l\u00E0)/, Keyword.AS], - [/^(from|t\u1EEB)/, Keyword.FROM], - - // -------------------------------------- - // Future Reserved Words - [/^const|h\u1EB1ng s\u1ED1/, Keyword.CONST], - [/^(class|l\u1EDBp)\b/, Keyword.CLASS], - [/^(super|kh\u1EDFi t\u1EA1o cha)\b/, Keyword.SUPER], - [/^(constructor|kh\u1EDFi t\u1EA1o)\b/, Keyword.CONSTRUCTOR], - [/^(extends|k\u1EBF th\u1EEBa)\b/, Keyword.EXTENDS], - [/^(export|cho ph\u00E9p)\b/, Keyword.EXPORT], - [/^(import|s\u1EED d\u1EE5ng)\b/, Keyword.IMPORT], - [/^(async|b\u1EA5t \u0111\u1ED3ng b\u1ED9)/, Keyword.ASYNC], - [/^(await\b|ch\u1EDD(?![A-Za-z\u00C0-\u1EF9]))/, Keyword.AWAIT], - [/^(yield|nh\u01B0\u1EDDng)\b/, Keyword.YIELD], - [/^(let|bi\u1EBFn)\b/, Keyword.LET], - [/^(private\b|ri\u00EAng t\u01B0(?![A-Za-z\u00C0-\u1EF9]))/, Keyword.PRIVATE], - [/^(public|c\u00F4ng khai)\b/, Keyword.PUBLIC], - [/^(protected\b|b\u1EA3o v\u1EC7(?![A-Za-z\u00C0-\u1EF9]))/, Keyword.PROTECTED], - [/^(static|t\u0129nh)\b/, Keyword.STATIC], - [/^(get|l\u1EA5y)\b/, Keyword.GET], - [/^(set|g\u00E1n)\b/, Keyword.SET], - - // -------------------------------------- - // Numbers (order matters: hex/oct/bin before decimal): - [/^0[xX][0-9a-fA-F][0-9a-fA-F_]*n?/, Keyword.NUMBER], - [/^0[oO][0-7][0-7_]*n?/, Keyword.NUMBER], - [/^0[bB][01][01_]*n?/, Keyword.NUMBER], - [/^(\d[\d_]*(\.[\d_]*)?|\.[\d_]+)([eE][+-]?\d[\d_]*)?n?/, Keyword.NUMBER], - - // -------------------------------------- - // Strings (with escape support): - [/^"(?:\\[\s\S]|[^"\\])*"/, Keyword.STRING], - [/^'(?:\\[\s\S]|[^'\\])*'/, Keyword.STRING], - - // -------------------------------------- - // Literal with Keyword: - [/^(null|r\u1ED7ng)\b/, Keyword.NULL], - [/^NaN\b/, Keyword.NAN], - [/^(Infinity|v\u00F4 c\u1EF1c)\b/, Keyword.INFINITY], - [/^(undefined|kh\u00F4ng x\u00E1c \u0111\u1ECBnh)\b/, Keyword.UNDEFINED], - [/(true|false|\u0111\u00FAng|sai)\b/, Keyword.BOOLEAN], - - // -------------------------------------- - // Identifier - SpecIdentifier, -] diff --git a/packages/parser/src/index.ts b/packages/parser/src/index.ts index aa68b96..e855540 100644 --- a/packages/parser/src/index.ts +++ b/packages/parser/src/index.ts @@ -6,9 +6,7 @@ export default parser export { VietScriptError } from './errors' export { Parser } from './parser' -export type { ITokenizer, ParserOptions, TokenizerKind } from './parser' export { Tokenizer } from './tokenizer' -export { TokenizerFSM } from './tokenizer-fsm' if (typeof window !== 'undefined') { (window as unknown as { VietScript: { parser: Parser } }).VietScript = { parser } diff --git a/packages/parser/src/nodes/literals/TemplateLiteral.ts b/packages/parser/src/nodes/literals/TemplateLiteral.ts index 4df5a45..720e8b5 100644 --- a/packages/parser/src/nodes/literals/TemplateLiteral.ts +++ b/packages/parser/src/nodes/literals/TemplateLiteral.ts @@ -1,4 +1,5 @@ -import { createTokenizer, Parser } from '@parser/parser' +import { Parser } from '@parser/parser' +import { Tokenizer } from '@parser/tokenizer' import { Expression } from '../expressions/Expression' @@ -122,9 +123,9 @@ export class TemplateLiteral { } for (const exprSource of expressions) { - const subParser = new Parser({ tokenizer: parser.tokenizerKind }) + const subParser = new Parser() subParser.syntax = exprSource - subParser.tokenizer = createTokenizer(subParser) + subParser.tokenizer = new Tokenizer(subParser) subParser.lookahead = subParser.tokenizer.getNextToken() this.expressions.push(new Expression(subParser)) } diff --git a/packages/parser/src/parser.ts b/packages/parser/src/parser.ts index a695ee1..16bfa6c 100644 --- a/packages/parser/src/parser.ts +++ b/packages/parser/src/parser.ts @@ -4,32 +4,11 @@ import { Keyword } from '@vietscript/shared' import { VietScriptError } from './errors' import { Program } from './nodes/Program' import { Tokenizer } from './tokenizer' -import { TokenizerFSM } from './tokenizer-fsm' - -export type TokenizerKind = 'regex' | 'fsm' - -export interface ITokenizer { - getNextToken: () => Token | null - isEOF: () => boolean - rollback: (step: number) => number -} - -export interface ParserOptions { - tokenizer?: TokenizerKind -} - -export function createTokenizer(parser: Parser): ITokenizer { - return parser.tokenizerKind === 'fsm' - ? new TokenizerFSM(parser) - : new Tokenizer(parser) -} export class Parser { public syntax: string - public tokenizer: ITokenizer - - public tokenizerKind: TokenizerKind + public tokenizer: Tokenizer public lookahead: Token | null @@ -37,16 +16,15 @@ export class Parser { public ternaryDepth = 0 - constructor(options: ParserOptions = {}) { + constructor() { this.syntax = '' - this.tokenizerKind = options.tokenizer ?? 'fsm' - this.tokenizer = createTokenizer(this) + this.tokenizer = new Tokenizer(this) this.lookahead = null } public parse(syntax: string, InitAtsNodeClass?: new (parser: Parser) => unknown): any { this.syntax = syntax - this.tokenizer = createTokenizer(this) + this.tokenizer = new Tokenizer(this) this.lookahead = this.tokenizer.getNextToken() if (InitAtsNodeClass) diff --git a/packages/parser/src/tokenizer-fsm.ts b/packages/parser/src/tokenizer-fsm.ts deleted file mode 100644 index b67709b..0000000 --- a/packages/parser/src/tokenizer-fsm.ts +++ /dev/null @@ -1,917 +0,0 @@ -import type { Token } from '@vietscript/shared' -import type { Parser } from './parser' - -import { Keyword } from '@vietscript/shared' - -// Boundary kinds (post-keyword check): -// WORD: next char must not be in [A-Za-z0-9_] (mimics JS \b after ASCII word). -// IDENT: next char must not be in [A-Za-zÀ-ỹ] (mimics negative -// lookahead used for VI keywords ending in non-ASCII). -// NONE: no boundary check (mimics keywords without \b such as `else`, -// `return`, `try`, `as`, `from`, `const`, `async`). -const Boundary = { - WORD: 0, - IDENT: 1, - NONE: 2, -} as const - -type BoundaryKind = typeof Boundary[keyof typeof Boundary] - -interface KeywordEntry { - text: string - type: Keyword - boundary: BoundaryKind -} - -// Mirrors the regex spec table in constants/specs.ts. The FSM walks a trie -// built from these entries to detect keywords; ordering does not matter here -// because the trie picks the longest valid match with a passing boundary. -const KEYWORDS: ReadonlyArray = [ - { text: 'var', type: Keyword.VAR, boundary: Boundary.WORD }, - { text: 'khai báo', type: Keyword.VAR, boundary: Boundary.WORD }, - { text: 'break', type: Keyword.BREAK, boundary: Boundary.WORD }, - { text: 'phá vòng lặp', type: Keyword.BREAK, boundary: Boundary.WORD }, - { text: 'do', type: Keyword.DO, boundary: Boundary.WORD }, - { text: 'thực hiện', type: Keyword.DO, boundary: Boundary.WORD }, - { text: 'instanceof', type: Keyword.INSTANCEOF, boundary: Boundary.WORD }, - { text: 'là kiểu', type: Keyword.INSTANCEOF, boundary: Boundary.WORD }, - { text: 'typeof', type: Keyword.TYPEOF, boundary: Boundary.WORD }, - { text: 'kiểu của', type: Keyword.TYPEOF, boundary: Boundary.WORD }, - { text: 'switch', type: Keyword.SWITCH, boundary: Boundary.WORD }, - { text: 'duyệt', type: Keyword.SWITCH, boundary: Boundary.WORD }, - { text: 'case', type: Keyword.CASE, boundary: Boundary.WORD }, - { text: 'trường hợp', type: Keyword.CASE, boundary: Boundary.WORD }, - { text: 'if', type: Keyword.IF, boundary: Boundary.WORD }, - { text: 'nếu', type: Keyword.IF, boundary: Boundary.WORD }, - { text: 'else', type: Keyword.ELSE, boundary: Boundary.NONE }, - { text: 'không thì', type: Keyword.ELSE, boundary: Boundary.NONE }, - { text: 'new', type: Keyword.NEW, boundary: Boundary.WORD }, - { text: 'catch', type: Keyword.CATCH, boundary: Boundary.WORD }, - { text: 'bắt lỗi', type: Keyword.CATCH, boundary: Boundary.WORD }, - { text: 'finally', type: Keyword.FINALLY, boundary: Boundary.WORD }, - { text: 'cuối cùng', type: Keyword.FINALLY, boundary: Boundary.WORD }, - { text: 'return', type: Keyword.RETURN, boundary: Boundary.NONE }, - { text: 'trả về', type: Keyword.RETURN, boundary: Boundary.NONE }, - { text: 'void', type: Keyword.VOID, boundary: Boundary.WORD }, - { text: 'continue', type: Keyword.CONTINUE, boundary: Boundary.WORD }, - { text: 'tiếp tục', type: Keyword.CONTINUE, boundary: Boundary.WORD }, - { text: 'for', type: Keyword.FOR, boundary: Boundary.WORD }, - { text: 'lặp', type: Keyword.FOR, boundary: Boundary.WORD }, - { text: 'while', type: Keyword.WHILE, boundary: Boundary.WORD }, - { text: 'khi mà', type: Keyword.WHILE, boundary: Boundary.IDENT }, - { text: 'debugger', type: Keyword.DEBUGGER, boundary: Boundary.WORD }, - { text: 'function', type: Keyword.FUNCTION, boundary: Boundary.WORD }, - { text: 'hàm', type: Keyword.FUNCTION, boundary: Boundary.WORD }, - { text: 'this', type: Keyword.THIS, boundary: Boundary.WORD }, - { text: 'đây', type: Keyword.THIS, boundary: Boundary.WORD }, - { text: 'with', type: Keyword.WITH, boundary: Boundary.WORD }, - { text: 'default', type: Keyword.DEFAULT, boundary: Boundary.WORD }, - { text: 'mặc định', type: Keyword.DEFAULT, boundary: Boundary.WORD }, - { text: 'throw', type: Keyword.THROW, boundary: Boundary.WORD }, - { text: 'báo lỗi', type: Keyword.THROW, boundary: Boundary.WORD }, - { text: 'delete', type: Keyword.DELETE, boundary: Boundary.WORD }, - { text: 'xoá', type: Keyword.DELETE, boundary: Boundary.IDENT }, - { text: 'in', type: Keyword.IN, boundary: Boundary.WORD }, - { text: 'trong', type: Keyword.IN, boundary: Boundary.WORD }, - { text: 'of', type: Keyword.OF, boundary: Boundary.WORD }, - { text: 'của', type: Keyword.OF, boundary: Boundary.WORD }, - { text: 'try', type: Keyword.TRY, boundary: Boundary.NONE }, - { text: 'thử', type: Keyword.TRY, boundary: Boundary.NONE }, - { text: 'as', type: Keyword.AS, boundary: Boundary.NONE }, - { text: 'như là', type: Keyword.AS, boundary: Boundary.NONE }, - { text: 'from', type: Keyword.FROM, boundary: Boundary.NONE }, - { text: 'từ', type: Keyword.FROM, boundary: Boundary.NONE }, - { text: 'const', type: Keyword.CONST, boundary: Boundary.NONE }, - { text: 'hằng số', type: Keyword.CONST, boundary: Boundary.NONE }, - { text: 'class', type: Keyword.CLASS, boundary: Boundary.WORD }, - { text: 'lớp', type: Keyword.CLASS, boundary: Boundary.WORD }, - { text: 'super', type: Keyword.SUPER, boundary: Boundary.WORD }, - { text: 'khởi tạo cha', type: Keyword.SUPER, boundary: Boundary.WORD }, - { text: 'constructor', type: Keyword.CONSTRUCTOR, boundary: Boundary.WORD }, - { text: 'khởi tạo', type: Keyword.CONSTRUCTOR, boundary: Boundary.WORD }, - { text: 'extends', type: Keyword.EXTENDS, boundary: Boundary.WORD }, - { text: 'kế thừa', type: Keyword.EXTENDS, boundary: Boundary.WORD }, - { text: 'export', type: Keyword.EXPORT, boundary: Boundary.WORD }, - { text: 'cho phép', type: Keyword.EXPORT, boundary: Boundary.WORD }, - { text: 'import', type: Keyword.IMPORT, boundary: Boundary.WORD }, - { text: 'sử dụng', type: Keyword.IMPORT, boundary: Boundary.WORD }, - { text: 'async', type: Keyword.ASYNC, boundary: Boundary.NONE }, - { text: 'bất đồng bộ', type: Keyword.ASYNC, boundary: Boundary.NONE }, - { text: 'await', type: Keyword.AWAIT, boundary: Boundary.WORD }, - { text: 'chờ', type: Keyword.AWAIT, boundary: Boundary.IDENT }, - { text: 'yield', type: Keyword.YIELD, boundary: Boundary.WORD }, - { text: 'nhường', type: Keyword.YIELD, boundary: Boundary.WORD }, - { text: 'let', type: Keyword.LET, boundary: Boundary.WORD }, - { text: 'biến', type: Keyword.LET, boundary: Boundary.WORD }, - { text: 'private', type: Keyword.PRIVATE, boundary: Boundary.WORD }, - { text: 'riêng tư', type: Keyword.PRIVATE, boundary: Boundary.IDENT }, - { text: 'public', type: Keyword.PUBLIC, boundary: Boundary.WORD }, - { text: 'công khai', type: Keyword.PUBLIC, boundary: Boundary.WORD }, - { text: 'protected', type: Keyword.PROTECTED, boundary: Boundary.WORD }, - { text: 'bảo vệ', type: Keyword.PROTECTED, boundary: Boundary.IDENT }, - { text: 'static', type: Keyword.STATIC, boundary: Boundary.WORD }, - { text: 'tĩnh', type: Keyword.STATIC, boundary: Boundary.WORD }, - { text: 'get', type: Keyword.GET, boundary: Boundary.WORD }, - { text: 'lấy', type: Keyword.GET, boundary: Boundary.WORD }, - { text: 'set', type: Keyword.SET, boundary: Boundary.WORD }, - { text: 'gán', type: Keyword.SET, boundary: Boundary.WORD }, - { text: 'null', type: Keyword.NULL, boundary: Boundary.WORD }, - { text: 'rỗng', type: Keyword.NULL, boundary: Boundary.WORD }, - { text: 'NaN', type: Keyword.NAN, boundary: Boundary.WORD }, - { text: 'Infinity', type: Keyword.INFINITY, boundary: Boundary.WORD }, - { text: 'vô cực', type: Keyword.INFINITY, boundary: Boundary.WORD }, - { text: 'undefined', type: Keyword.UNDEFINED, boundary: Boundary.WORD }, - { text: 'không xác định', type: Keyword.UNDEFINED, boundary: Boundary.WORD }, - { text: 'true', type: Keyword.BOOLEAN, boundary: Boundary.WORD }, - { text: 'false', type: Keyword.BOOLEAN, boundary: Boundary.WORD }, - { text: 'đúng', type: Keyword.BOOLEAN, boundary: Boundary.WORD }, - { text: 'sai', type: Keyword.BOOLEAN, boundary: Boundary.WORD }, -] - -class TrieNode { - children = new Map() - type: Keyword | null = null - boundary: BoundaryKind = Boundary.WORD -} - -const KEYWORD_TRIE: TrieNode = (() => { - const root = new TrieNode() - for (const entry of KEYWORDS) { - let node = root - for (let i = 0; i < entry.text.length; i++) { - const code = entry.text.charCodeAt(i) - let child = node.children.get(code) - if (!child) { - child = new TrieNode() - node.children.set(code, child) - } - node = child - } - if (node.type === null) { - node.type = entry.type - node.boundary = entry.boundary - } - } - return root -})() - -// Operator longest-match tree. Built from a flat list of operator strings. -// Order independent — at lookup time we walk character-by-character and keep -// track of the longest valid operator end seen. -const OPERATORS: readonly string[] = [ - '...', - '.', - '>>>=', - '>>>', - '>>=', - '>>', - '<<=', - '<<', - '<=', - '>=', - '<', - '>', - '===', - '!==', - '==', - '!=', - '=>', - '**=', - '**', - '*=', - '*', - '++', - '+=', - '+', - '--', - '-=', - '-', - '/=', - '/', - '%=', - '%', - '&&=', - '&&', - '&=', - '&', - '||=', - '||', - '|=', - '|', - '^=', - '^', - '~', - '!', - '??=', - '??', - '?.', - '?', - '=', - '[', - ']', - '(', - ')', - '{', - '}', - ';', - ',', - ':', - '#', -] - -class OperatorNode { - children = new Map() - value: string | null = null -} - -const OPERATOR_TRIE: OperatorNode = (() => { - const root = new OperatorNode() - for (const op of OPERATORS) { - let node = root - for (let i = 0; i < op.length; i++) { - const code = op.charCodeAt(i) - let child = node.children.get(code) - if (!child) { - child = new OperatorNode() - node.children.set(code, child) - } - node = child - } - node.value = op - } - return root -})() - -const REGEX_PRECEDING_TOKENS = new Set([ - '(', - '[', - '{', - ',', - ';', - ':', - '=', - '!', - '?', - '+', - '-', - '*', - '/', - '%', - '&&', - '||', - '??', - '=>', - '==', - '===', - '!=', - '!==', - '<', - '>', - '<=', - '>=', - '&', - '|', - '^', - '~', - '<<', - '>>', - '>>>', - '+=', - '-=', - '*=', - '/=', - '%=', - '**=', - '&=', - '|=', - '^=', - '<<=', - '>>=', - '>>>=', - '&&=', - '||=', - '??=', - '...', - Keyword.RETURN, - Keyword.YIELD, - Keyword.AWAIT, - Keyword.TYPEOF, - Keyword.VOID, - Keyword.DELETE, - Keyword.NEW, - Keyword.THROW, - Keyword.IN, - Keyword.OF, - Keyword.INSTANCEOF, - Keyword.CASE, - Keyword.DEFAULT, -]) - -function isWhitespace(code: number): boolean { - // Mirrors JS \s minus what we don't expect in source. We rely on String.prototype - // to be lenient — anything not handled below falls through to the operator - // dispatcher and throws cleanly. - return code === 0x20 /* space */ - || code === 0x09 /* tab */ - || code === 0x0A /* LF */ - || code === 0x0D /* CR */ - || code === 0x0B /* VT */ - || code === 0x0C /* FF */ - || code === 0xA0 /* NBSP */ -} - -function isAsciiLetter(code: number): boolean { - return (code >= 0x41 && code <= 0x5A) || (code >= 0x61 && code <= 0x7A) -} - -function isVietnameseLetter(code: number): boolean { - return code >= 0x00C0 && code <= 0x1EF9 -} - -function isIdentStart(code: number): boolean { - return isAsciiLetter(code) || isVietnameseLetter(code) -} - -function isDigit(code: number): boolean { - return code >= 0x30 && code <= 0x39 -} - -function isIdentCont(code: number): boolean { - return isIdentStart(code) || isDigit(code) -} - -function isHexDigit(code: number): boolean { - return isDigit(code) - || (code >= 0x41 && code <= 0x46) - || (code >= 0x61 && code <= 0x66) -} - -function isOctalDigit(code: number): boolean { - return code >= 0x30 && code <= 0x37 -} - -function isBinaryDigit(code: number): boolean { - return code === 0x30 || code === 0x31 -} - -function isAsciiWordChar(code: number): boolean { - return isAsciiLetter(code) || isDigit(code) || code === 0x5F /* _ */ -} - -export class TokenizerFSM { - private parser: Parser - - private cursor: number - - private lastTokenType: string | null = null - - constructor(parser: Parser) { - this.parser = parser - this.cursor = 0 - } - - public getCursor(): number { - return this.cursor - } - - public rollback(step: number): number { - if (this.parser.lookahead) - this.parser.lookahead.end -= step - this.cursor -= step - return this.cursor - } - - public isEOF(): boolean { - return this.cursor === this.parser.syntax.length - } - - protected hasMoreTokens(): boolean { - return this.cursor < this.parser.syntax.length - } - - public getNextToken(): Token | null { - const source = this.parser.syntax - const length = source.length - - while (this.cursor < length) { - const start = this.cursor - const code = source.charCodeAt(this.cursor) - - if (isWhitespace(code)) { - this.cursor++ - continue - } - - // Line comment: //... - if (code === 0x2F && source.charCodeAt(this.cursor + 1) === 0x2F) { - this.cursor += 2 - while (this.cursor < length && source.charCodeAt(this.cursor) !== 0x0A) { - this.cursor++ - } - continue - } - - // Block comment: /* ... */ - if (code === 0x2F && source.charCodeAt(this.cursor + 1) === 0x2A) { - this.cursor += 2 - while (this.cursor < length) { - if (source.charCodeAt(this.cursor) === 0x2A - && source.charCodeAt(this.cursor + 1) === 0x2F) { - this.cursor += 2 - break - } - this.cursor++ - } - continue - } - - // Template literal - if (code === 0x60) { - return this.scanTemplateLiteral(start) - } - - // Regex literal (context-sensitive) - if (code === 0x2F && this.isRegexExpected()) { - const tok = this.scanRegexLiteral(start) - if (tok !== null) { - this.lastTokenType = tok.type as string - return tok - } - } - - // String literals - if (code === 0x22 || code === 0x27) { - return this.scanString(start, code) - } - - // Numeric literals: digits, or `.` followed by digit - if (isDigit(code)) { - return this.scanNumber(start) - } - if (code === 0x2E /* . */ && isDigit(source.charCodeAt(this.cursor + 1))) { - return this.scanNumber(start) - } - - // Identifier / keyword - if (isIdentStart(code)) { - return this.scanIdentifierOrKeyword(start) - } - - // Operator (longest match via trie) - const opTok = this.scanOperator(start) - if (opTok !== null) { - return opTok - } - - throw new SyntaxError(`Unexpected token: "${source[this.cursor]}"`) - } - - return null - } - - private isRegexExpected(): boolean { - if (this.lastTokenType === null) - return true - return REGEX_PRECEDING_TOKENS.has(this.lastTokenType) - } - - private scanString(start: number, quote: number): Token { - const source = this.parser.syntax - const length = source.length - let i = start + 1 - while (i < length) { - const ch = source.charCodeAt(i) - if (ch === 0x5C /* \ */) { - i += 2 - continue - } - if (ch === quote) { - i++ - const value = source.slice(start, i) - this.cursor = i - this.lastTokenType = Keyword.STRING - return { - type: Keyword.STRING, - value, - start, - end: i, - } - } - i++ - } - throw new SyntaxError(`Unterminated string literal at ${start}`) - } - - private scanNumber(start: number): Token { - const source = this.parser.syntax - const length = source.length - let i = start - - // Leading dot decimal: .5, .5e2, .5n (n probably nonsense, but match regex) - if (source.charCodeAt(i) === 0x2E /* . */) { - i++ - while (i < length) { - const c = source.charCodeAt(i) - if (isDigit(c) || c === 0x5F) { - i++ - continue - } - break - } - i = this.consumeExponent(i) - i = this.consumeBigIntSuffix(i) - return this.emitNumber(start, i) - } - - // 0x / 0o / 0b - if (source.charCodeAt(i) === 0x30) { - const next = source.charCodeAt(i + 1) - if (next === 0x78 || next === 0x58) { - i += 2 - while (i < length) { - const c = source.charCodeAt(i) - if (isHexDigit(c) || c === 0x5F) { - i++ - continue - } - break - } - i = this.consumeBigIntSuffix(i) - return this.emitNumber(start, i) - } - if (next === 0x6F || next === 0x4F) { - i += 2 - while (i < length) { - const c = source.charCodeAt(i) - if (isOctalDigit(c) || c === 0x5F) { - i++ - continue - } - break - } - i = this.consumeBigIntSuffix(i) - return this.emitNumber(start, i) - } - if (next === 0x62 || next === 0x42) { - i += 2 - while (i < length) { - const c = source.charCodeAt(i) - if (isBinaryDigit(c) || c === 0x5F) { - i++ - continue - } - break - } - i = this.consumeBigIntSuffix(i) - return this.emitNumber(start, i) - } - } - - // Decimal: digits[_digits]*[.digits[_digits]*]? - while (i < length) { - const c = source.charCodeAt(i) - if (isDigit(c) || c === 0x5F) { - i++ - continue - } - break - } - if (source.charCodeAt(i) === 0x2E /* . */) { - i++ - while (i < length) { - const c = source.charCodeAt(i) - if (isDigit(c) || c === 0x5F) { - i++ - continue - } - break - } - } - i = this.consumeExponent(i) - i = this.consumeBigIntSuffix(i) - return this.emitNumber(start, i) - } - - private consumeExponent(i: number): number { - const source = this.parser.syntax - const length = source.length - const c = source.charCodeAt(i) - if (c !== 0x65 && c !== 0x45 /* e/E */) { - return i - } - let j = i + 1 - const sign = source.charCodeAt(j) - if (sign === 0x2B || sign === 0x2D /* + or - */) { - j++ - } - if (!isDigit(source.charCodeAt(j))) { - return i - } - j++ - while (j < length) { - const cc = source.charCodeAt(j) - if (isDigit(cc) || cc === 0x5F) { - j++ - continue - } - break - } - return j - } - - private consumeBigIntSuffix(i: number): number { - if (this.parser.syntax.charCodeAt(i) === 0x6E /* n */) { - return i + 1 - } - return i - } - - private emitNumber(start: number, end: number): Token { - const value = this.parser.syntax.slice(start, end) - this.cursor = end - this.lastTokenType = Keyword.NUMBER - return { - type: Keyword.NUMBER, - value, - start, - end, - } - } - - private scanIdentifierOrKeyword(start: number): Token { - // Try keyword trie first; on a hit with passing boundary we emit the keyword. - // Otherwise we fall back to identifier scanning (with multi-word support - // and embedded-keyword truncation). - const kw = this.matchKeyword(start) - if (kw !== null) { - this.cursor = kw.end - this.lastTokenType = kw.type as string - return { - type: kw.type, - value: this.parser.syntax.slice(start, kw.end), - start, - end: kw.end, - } - } - return this.scanIdentifier(start) - } - - private matchKeyword(start: number): { type: Keyword, end: number } | null { - const source = this.parser.syntax - const length = source.length - let node: TrieNode | undefined = KEYWORD_TRIE - let i = start - let bestType: Keyword | null = null - let bestEnd = -1 - - while (i < length && node !== undefined) { - const code = source.charCodeAt(i) - const next = node.children.get(code) - if (next === undefined) - break - i++ - node = next - if (node.type !== null && this.boundaryOk(node.boundary, i)) { - bestType = node.type - bestEnd = i - } - } - - if (bestType !== null && bestEnd !== -1) { - return { type: bestType, end: bestEnd } - } - return null - } - - private boundaryOk(kind: BoundaryKind, end: number): boolean { - if (kind === Boundary.NONE) - return true - if (end >= this.parser.syntax.length) - return true - const code = this.parser.syntax.charCodeAt(end) - if (kind === Boundary.WORD) { - // Mimic JS \b after ASCII word char: next must not be [A-Za-z0-9_] - return !isAsciiWordChar(code) - } - // IDENT: next must not be [A-Za-zÀ-ỹ] - return !isIdentStart(code) - } - - private scanIdentifier(start: number): Token { - const source = this.parser.syntax - const length = source.length - let i = start - if (!isIdentStart(source.charCodeAt(i))) { - throw new SyntaxError(`Unexpected token: "${source[i]}"`) - } - i++ - while (i < length && isIdentCont(source.charCodeAt(i))) { - i++ - } - - // Multi-word identifier: consume ` ` repeatedly, but stop before - // a word that would itself start a keyword at this position. - while (i < length) { - if (source.charCodeAt(i) !== 0x20 /* space */) - break - const wordStart = i + 1 - if (wordStart >= length) - break - const wordCode = source.charCodeAt(wordStart) - if (!isIdentStart(wordCode)) - break - // Embedded-keyword check: starting from wordStart, would a keyword - // match? If yes, do NOT consume the space, terminate identifier here. - if (this.matchKeyword(wordStart) !== null) { - break - } - i = wordStart + 1 - while (i < length && isIdentCont(source.charCodeAt(i))) { - i++ - } - } - - const value = source.slice(start, i) - this.cursor = i - this.lastTokenType = Keyword.IDENTIFIER - return { - type: Keyword.IDENTIFIER, - value, - start, - end: i, - } - } - - private scanOperator(start: number): Token | null { - const source = this.parser.syntax - const length = source.length - let node: OperatorNode | undefined = OPERATOR_TRIE - let i = start - let bestEnd = -1 - let bestValue: string | null = null - - while (i < length && node !== undefined) { - const code = source.charCodeAt(i) - const next = node.children.get(code) - if (next === undefined) - break - i++ - node = next - if (node.value !== null) { - bestEnd = i - bestValue = node.value - } - } - - if (bestValue === null || bestEnd === -1) { - return null - } - this.cursor = bestEnd - this.lastTokenType = bestValue - return { - type: bestValue, - value: bestValue, - start, - end: bestEnd, - } - } - - private scanRegexLiteral(start: number): Token | null { - const source = this.parser.syntax - const length = source.length - let i = start + 1 - let inCharClass = false - - while (i < length) { - const ch = source.charCodeAt(i) - if (ch === 0x5C /* \ */) { - i += 2 - continue - } - if (ch === 0x5B /* [ */) { - inCharClass = true - i++ - continue - } - if (ch === 0x5D /* ] */) { - inCharClass = false - i++ - continue - } - if (ch === 0x2F /* / */ && !inCharClass) { - i++ - while (i < length) { - const fc = source.charCodeAt(i) - if (fc >= 0x61 && fc <= 0x7A) { - i++ - continue - } - break - } - const value = source.slice(start, i) - this.cursor = i - return { - type: 'RegExpLiteral', - value, - start, - end: i, - } - } - if (ch === 0x0A /* \n */) { - return null - } - i++ - } - return null - } - - private scanTemplateLiteral(start: number): Token { - const source = this.parser.syntax - const length = source.length - let i = start + 1 - - while (i < length) { - const ch = source.charCodeAt(i) - - if (ch === 0x5C /* \ */) { - i += 2 - continue - } - - if (ch === 0x60 /* ` */) { - i++ - const value = source.slice(start, i) - this.cursor = i - this.lastTokenType = 'TemplateLiteral' - return { - type: 'TemplateLiteral', - value, - start, - end: i, - } - } - - if (ch === 0x24 /* $ */ && source.charCodeAt(i + 1) === 0x7B /* { */) { - i += 2 - let depth = 1 - while (i < length && depth > 0) { - const inner = source.charCodeAt(i) - if (inner === 0x5C) { - i += 2 - continue - } - if (inner === 0x22 || inner === 0x27) { - const quote = inner - i++ - while (i < length && source.charCodeAt(i) !== quote) { - if (source.charCodeAt(i) === 0x5C) - i++ - i++ - } - i++ - continue - } - if (inner === 0x60 /* ` */) { - i++ - while (i < length) { - if (source.charCodeAt(i) === 0x5C) { - i += 2 - continue - } - if (source.charCodeAt(i) === 0x24 - && source.charCodeAt(i + 1) === 0x7B) { - i += 2 - let innerDepth = 1 - while (i < length && innerDepth > 0) { - const ic = source.charCodeAt(i) - if (ic === 0x7B) - innerDepth++ - else if (ic === 0x7D) - innerDepth-- - i++ - } - continue - } - if (source.charCodeAt(i) === 0x60) { - i++ - break - } - i++ - } - continue - } - if (inner === 0x7B) - depth++ - else if (inner === 0x7D) - depth-- - i++ - } - continue - } - - i++ - } - - throw new SyntaxError(`Template literal không đóng, bắt đầu tại vị trí ${start}`) - } -} diff --git a/packages/parser/src/tokenizer.ts b/packages/parser/src/tokenizer.ts index fe1c5a8..de3ff4c 100644 --- a/packages/parser/src/tokenizer.ts +++ b/packages/parser/src/tokenizer.ts @@ -2,8 +2,247 @@ import type { Token } from '@vietscript/shared' import type { Parser } from './parser' import { Keyword } from '@vietscript/shared' -import { Specs } from './constants/specs' +// Boundary kinds (post-keyword check): +// WORD: next char must not be in [A-Za-z0-9_] (mimics JS \b after ASCII word). +// IDENT: next char must not be in [A-Za-zÀ-ỹ] (used for VI keywords ending +// in non-ASCII, where \b doesn't apply). +// NONE: no boundary check (for keywords like `else`, `return`, `try`, `as`, +// `from`, `const`, `async` that may abut other identifiers). +const Boundary = { + WORD: 0, + IDENT: 1, + NONE: 2, +} as const + +type BoundaryKind = typeof Boundary[keyof typeof Boundary] + +interface KeywordEntry { + text: string + type: Keyword + boundary: BoundaryKind +} + +// Source of truth for every keyword (English + Vietnamese aliases). The trie +// below is built from this list; ordering does not matter — `matchKeyword` +// returns the longest valid match with a passing boundary. +const KEYWORDS: ReadonlyArray = [ + { text: 'var', type: Keyword.VAR, boundary: Boundary.WORD }, + { text: 'khai báo', type: Keyword.VAR, boundary: Boundary.WORD }, + { text: 'break', type: Keyword.BREAK, boundary: Boundary.WORD }, + { text: 'phá vòng lặp', type: Keyword.BREAK, boundary: Boundary.WORD }, + { text: 'do', type: Keyword.DO, boundary: Boundary.WORD }, + { text: 'thực hiện', type: Keyword.DO, boundary: Boundary.WORD }, + { text: 'instanceof', type: Keyword.INSTANCEOF, boundary: Boundary.WORD }, + { text: 'là kiểu', type: Keyword.INSTANCEOF, boundary: Boundary.WORD }, + { text: 'typeof', type: Keyword.TYPEOF, boundary: Boundary.WORD }, + { text: 'kiểu của', type: Keyword.TYPEOF, boundary: Boundary.WORD }, + { text: 'switch', type: Keyword.SWITCH, boundary: Boundary.WORD }, + { text: 'duyệt', type: Keyword.SWITCH, boundary: Boundary.WORD }, + { text: 'case', type: Keyword.CASE, boundary: Boundary.WORD }, + { text: 'trường hợp', type: Keyword.CASE, boundary: Boundary.WORD }, + { text: 'if', type: Keyword.IF, boundary: Boundary.WORD }, + { text: 'nếu', type: Keyword.IF, boundary: Boundary.WORD }, + { text: 'else', type: Keyword.ELSE, boundary: Boundary.NONE }, + { text: 'không thì', type: Keyword.ELSE, boundary: Boundary.NONE }, + { text: 'new', type: Keyword.NEW, boundary: Boundary.WORD }, + { text: 'catch', type: Keyword.CATCH, boundary: Boundary.WORD }, + { text: 'bắt lỗi', type: Keyword.CATCH, boundary: Boundary.WORD }, + { text: 'finally', type: Keyword.FINALLY, boundary: Boundary.WORD }, + { text: 'cuối cùng', type: Keyword.FINALLY, boundary: Boundary.WORD }, + { text: 'return', type: Keyword.RETURN, boundary: Boundary.NONE }, + { text: 'trả về', type: Keyword.RETURN, boundary: Boundary.NONE }, + { text: 'void', type: Keyword.VOID, boundary: Boundary.WORD }, + { text: 'continue', type: Keyword.CONTINUE, boundary: Boundary.WORD }, + { text: 'tiếp tục', type: Keyword.CONTINUE, boundary: Boundary.WORD }, + { text: 'for', type: Keyword.FOR, boundary: Boundary.WORD }, + { text: 'lặp', type: Keyword.FOR, boundary: Boundary.WORD }, + { text: 'while', type: Keyword.WHILE, boundary: Boundary.WORD }, + { text: 'khi mà', type: Keyword.WHILE, boundary: Boundary.IDENT }, + { text: 'debugger', type: Keyword.DEBUGGER, boundary: Boundary.WORD }, + { text: 'function', type: Keyword.FUNCTION, boundary: Boundary.WORD }, + { text: 'hàm', type: Keyword.FUNCTION, boundary: Boundary.WORD }, + { text: 'this', type: Keyword.THIS, boundary: Boundary.WORD }, + { text: 'đây', type: Keyword.THIS, boundary: Boundary.WORD }, + { text: 'with', type: Keyword.WITH, boundary: Boundary.WORD }, + { text: 'default', type: Keyword.DEFAULT, boundary: Boundary.WORD }, + { text: 'mặc định', type: Keyword.DEFAULT, boundary: Boundary.WORD }, + { text: 'throw', type: Keyword.THROW, boundary: Boundary.WORD }, + { text: 'báo lỗi', type: Keyword.THROW, boundary: Boundary.WORD }, + { text: 'delete', type: Keyword.DELETE, boundary: Boundary.WORD }, + { text: 'xoá', type: Keyword.DELETE, boundary: Boundary.IDENT }, + { text: 'in', type: Keyword.IN, boundary: Boundary.WORD }, + { text: 'trong', type: Keyword.IN, boundary: Boundary.WORD }, + { text: 'of', type: Keyword.OF, boundary: Boundary.WORD }, + { text: 'của', type: Keyword.OF, boundary: Boundary.WORD }, + { text: 'try', type: Keyword.TRY, boundary: Boundary.NONE }, + { text: 'thử', type: Keyword.TRY, boundary: Boundary.NONE }, + { text: 'as', type: Keyword.AS, boundary: Boundary.NONE }, + { text: 'như là', type: Keyword.AS, boundary: Boundary.NONE }, + { text: 'from', type: Keyword.FROM, boundary: Boundary.NONE }, + { text: 'từ', type: Keyword.FROM, boundary: Boundary.NONE }, + { text: 'const', type: Keyword.CONST, boundary: Boundary.NONE }, + { text: 'hằng số', type: Keyword.CONST, boundary: Boundary.NONE }, + { text: 'class', type: Keyword.CLASS, boundary: Boundary.WORD }, + { text: 'lớp', type: Keyword.CLASS, boundary: Boundary.WORD }, + { text: 'super', type: Keyword.SUPER, boundary: Boundary.WORD }, + { text: 'khởi tạo cha', type: Keyword.SUPER, boundary: Boundary.WORD }, + { text: 'constructor', type: Keyword.CONSTRUCTOR, boundary: Boundary.WORD }, + { text: 'khởi tạo', type: Keyword.CONSTRUCTOR, boundary: Boundary.WORD }, + { text: 'extends', type: Keyword.EXTENDS, boundary: Boundary.WORD }, + { text: 'kế thừa', type: Keyword.EXTENDS, boundary: Boundary.WORD }, + { text: 'export', type: Keyword.EXPORT, boundary: Boundary.WORD }, + { text: 'cho phép', type: Keyword.EXPORT, boundary: Boundary.WORD }, + { text: 'import', type: Keyword.IMPORT, boundary: Boundary.WORD }, + { text: 'sử dụng', type: Keyword.IMPORT, boundary: Boundary.WORD }, + { text: 'async', type: Keyword.ASYNC, boundary: Boundary.NONE }, + { text: 'bất đồng bộ', type: Keyword.ASYNC, boundary: Boundary.NONE }, + { text: 'await', type: Keyword.AWAIT, boundary: Boundary.WORD }, + { text: 'chờ', type: Keyword.AWAIT, boundary: Boundary.IDENT }, + { text: 'yield', type: Keyword.YIELD, boundary: Boundary.WORD }, + { text: 'nhường', type: Keyword.YIELD, boundary: Boundary.WORD }, + { text: 'let', type: Keyword.LET, boundary: Boundary.WORD }, + { text: 'biến', type: Keyword.LET, boundary: Boundary.WORD }, + { text: 'private', type: Keyword.PRIVATE, boundary: Boundary.WORD }, + { text: 'riêng tư', type: Keyword.PRIVATE, boundary: Boundary.IDENT }, + { text: 'public', type: Keyword.PUBLIC, boundary: Boundary.WORD }, + { text: 'công khai', type: Keyword.PUBLIC, boundary: Boundary.WORD }, + { text: 'protected', type: Keyword.PROTECTED, boundary: Boundary.WORD }, + { text: 'bảo vệ', type: Keyword.PROTECTED, boundary: Boundary.IDENT }, + { text: 'static', type: Keyword.STATIC, boundary: Boundary.WORD }, + { text: 'tĩnh', type: Keyword.STATIC, boundary: Boundary.WORD }, + { text: 'get', type: Keyword.GET, boundary: Boundary.WORD }, + { text: 'lấy', type: Keyword.GET, boundary: Boundary.WORD }, + { text: 'set', type: Keyword.SET, boundary: Boundary.WORD }, + { text: 'gán', type: Keyword.SET, boundary: Boundary.WORD }, + { text: 'null', type: Keyword.NULL, boundary: Boundary.WORD }, + { text: 'rỗng', type: Keyword.NULL, boundary: Boundary.WORD }, + { text: 'NaN', type: Keyword.NAN, boundary: Boundary.WORD }, + { text: 'Infinity', type: Keyword.INFINITY, boundary: Boundary.WORD }, + { text: 'vô cực', type: Keyword.INFINITY, boundary: Boundary.WORD }, + { text: 'undefined', type: Keyword.UNDEFINED, boundary: Boundary.WORD }, + { text: 'không xác định', type: Keyword.UNDEFINED, boundary: Boundary.WORD }, + { text: 'true', type: Keyword.BOOLEAN, boundary: Boundary.WORD }, + { text: 'false', type: Keyword.BOOLEAN, boundary: Boundary.WORD }, + { text: 'đúng', type: Keyword.BOOLEAN, boundary: Boundary.WORD }, + { text: 'sai', type: Keyword.BOOLEAN, boundary: Boundary.WORD }, +] + +class TrieNode { + children = new Map() + type: Keyword | null = null + boundary: BoundaryKind = Boundary.WORD +} + +const KEYWORD_TRIE: TrieNode = (() => { + const root = new TrieNode() + for (const entry of KEYWORDS) { + let node = root + for (let i = 0; i < entry.text.length; i++) { + const code = entry.text.charCodeAt(i) + let child = node.children.get(code) + if (!child) { + child = new TrieNode() + node.children.set(code, child) + } + node = child + } + if (node.type === null) { + node.type = entry.type + node.boundary = entry.boundary + } + } + return root +})() + +// Operator longest-match trie. Walking char-by-char, we keep the deepest +// node that is itself a valid operator end. +const OPERATORS: readonly string[] = [ + '...', + '.', + '>>>=', + '>>>', + '>>=', + '>>', + '<<=', + '<<', + '<=', + '>=', + '<', + '>', + '===', + '!==', + '==', + '!=', + '=>', + '**=', + '**', + '*=', + '*', + '++', + '+=', + '+', + '--', + '-=', + '-', + '/=', + '/', + '%=', + '%', + '&&=', + '&&', + '&=', + '&', + '||=', + '||', + '|=', + '|', + '^=', + '^', + '~', + '!', + '??=', + '??', + '?.', + '?', + '=', + '[', + ']', + '(', + ')', + '{', + '}', + ';', + ',', + ':', + '#', +] + +class OperatorNode { + children = new Map() + value: string | null = null +} + +const OPERATOR_TRIE: OperatorNode = (() => { + const root = new OperatorNode() + for (const op of OPERATORS) { + let node = root + for (let i = 0; i < op.length; i++) { + const code = op.charCodeAt(i) + let child = node.children.get(code) + if (!child) { + child = new OperatorNode() + node.children.set(code, child) + } + node = child + } + node.value = op + } + return root +})() + +// Tokens after which a `/` should be parsed as a regex literal start +// rather than a division operator. const REGEX_PRECEDING_TOKENS = new Set([ '(', '[', @@ -69,6 +308,54 @@ const REGEX_PRECEDING_TOKENS = new Set([ Keyword.DEFAULT, ]) +function isWhitespace(code: number): boolean { + return code === 0x20 /* space */ + || code === 0x09 /* tab */ + || code === 0x0A /* LF */ + || code === 0x0D /* CR */ + || code === 0x0B /* VT */ + || code === 0x0C /* FF */ + || code === 0xA0 /* NBSP */ +} + +function isAsciiLetter(code: number): boolean { + return (code >= 0x41 && code <= 0x5A) || (code >= 0x61 && code <= 0x7A) +} + +function isVietnameseLetter(code: number): boolean { + return code >= 0x00C0 && code <= 0x1EF9 +} + +function isIdentStart(code: number): boolean { + return isAsciiLetter(code) || isVietnameseLetter(code) +} + +function isDigit(code: number): boolean { + return code >= 0x30 && code <= 0x39 +} + +function isIdentCont(code: number): boolean { + return isIdentStart(code) || isDigit(code) +} + +function isHexDigit(code: number): boolean { + return isDigit(code) + || (code >= 0x41 && code <= 0x46) + || (code >= 0x61 && code <= 0x66) +} + +function isOctalDigit(code: number): boolean { + return code >= 0x30 && code <= 0x37 +} + +function isBinaryDigit(code: number): boolean { + return code === 0x30 || code === 0x31 +} + +function isAsciiWordChar(code: number): boolean { + return isAsciiLetter(code) || isDigit(code) || code === 0x5F /* _ */ +} + export class Tokenizer { private parser: Parser @@ -81,12 +368,14 @@ export class Tokenizer { this.cursor = 0 } + public getCursor(): number { + return this.cursor + } + public rollback(step: number): number { if (this.parser.lookahead) this.parser.lookahead.end -= step - this.cursor -= step - return this.cursor } @@ -99,67 +388,83 @@ export class Tokenizer { } public getNextToken(): Token | null { - if (!this.hasMoreTokens()) { - return null - } - - const whitespaceMatch = /^\s+/.exec(this.parser.syntax.slice(this.cursor)) - if (whitespaceMatch) { - this.cursor += whitespaceMatch[0].length - return this.getNextToken() - } - - const string = this.parser.syntax.slice(this.cursor) + const source = this.parser.syntax + const length = source.length - if (string[0] === '`') { - const tok = this.scanTemplateLiteral() - this.lastTokenType = tok.type as string - return tok - } + while (this.cursor < length) { + const start = this.cursor + const code = source.charCodeAt(this.cursor) - if (string[0] === '/' && string[1] !== '/' && string[1] !== '*' && this.isRegexExpected()) { - const tok = this.scanRegexLiteral() - if (tok) { - this.lastTokenType = tok.type as string - return tok + if (isWhitespace(code)) { + this.cursor++ + continue } - } - for (const [regexp, tokenType] of Specs) { - const tokenValue = this.match(regexp, string) + // Line comment: //... + if (code === 0x2F && source.charCodeAt(this.cursor + 1) === 0x2F) { + this.cursor += 2 + while (this.cursor < length && source.charCodeAt(this.cursor) !== 0x0A) { + this.cursor++ + } + continue + } - if (tokenValue === null) { + // Block comment: /* ... */ + if (code === 0x2F && source.charCodeAt(this.cursor + 1) === 0x2A) { + this.cursor += 2 + while (this.cursor < length) { + if (source.charCodeAt(this.cursor) === 0x2A + && source.charCodeAt(this.cursor + 1) === 0x2F) { + this.cursor += 2 + break + } + this.cursor++ + } continue } - if (tokenType === null) { - return this.getNextToken() + // Template literal + if (code === 0x60) { + return this.scanTemplateLiteral(start) } - if (tokenType === Keyword.IDENTIFIER && tokenValue.includes(' ')) { - const truncated = this.truncateBeforeEmbeddedKeyword(tokenValue) - if (truncated.length !== tokenValue.length) { - this.cursor -= tokenValue.length - truncated.length - this.lastTokenType = tokenType as string - return { - type: tokenType, - value: truncated, - start: this.cursor - truncated.length, - end: this.cursor, - } + // Regex literal (context-sensitive) + if (code === 0x2F && this.isRegexExpected()) { + const tok = this.scanRegexLiteral(start) + if (tok !== null) { + this.lastTokenType = tok.type as string + return tok } } - this.lastTokenType = tokenType as string - return { - type: tokenType, - value: tokenValue, - start: this.cursor - String(tokenValue).length, - end: this.cursor, + // String literals + if (code === 0x22 || code === 0x27) { + return this.scanString(start, code) + } + + // Numeric literals: digits, or `.` followed by digit + if (isDigit(code)) { + return this.scanNumber(start) + } + if (code === 0x2E /* . */ && isDigit(source.charCodeAt(this.cursor + 1))) { + return this.scanNumber(start) + } + + // Identifier / keyword + if (isIdentStart(code)) { + return this.scanIdentifierOrKeyword(start) } + + // Operator (longest match via trie) + const opTok = this.scanOperator(start) + if (opTok !== null) { + return opTok + } + + throw new SyntaxError(`Unexpected token: "${source[this.cursor]}"`) } - throw new SyntaxError(`Unexpected token: "${string[0]}"`) + return null } private isRegexExpected(): boolean { @@ -168,50 +473,334 @@ export class Tokenizer { return REGEX_PRECEDING_TOKENS.has(this.lastTokenType) } - private truncateBeforeEmbeddedKeyword(value: string): string { - const words = value.split(/\s+/) - let truncated = words[0] - for (let i = 1; i < words.length; i++) { - const rest = words.slice(i).join(' ') - for (const [regexp, tokenType] of Specs) { - if (tokenType === null || tokenType === Keyword.IDENTIFIER) + private scanString(start: number, quote: number): Token { + const source = this.parser.syntax + const length = source.length + let i = start + 1 + while (i < length) { + const ch = source.charCodeAt(i) + if (ch === 0x5C /* \ */) { + i += 2 + continue + } + if (ch === quote) { + i++ + const value = source.slice(start, i) + this.cursor = i + this.lastTokenType = Keyword.STRING + return { + type: Keyword.STRING, + value, + start, + end: i, + } + } + i++ + } + throw new SyntaxError(`Unterminated string literal at ${start}`) + } + + private scanNumber(start: number): Token { + const source = this.parser.syntax + const length = source.length + let i = start + + // Leading dot decimal: .5, .5e2 + if (source.charCodeAt(i) === 0x2E /* . */) { + i++ + while (i < length) { + const c = source.charCodeAt(i) + if (isDigit(c) || c === 0x5F) { + i++ continue - const m = regexp.exec(`${rest};`) - if (m && m.index === 0 && /^[A-Za-z\u00C0-\u1EF9]/.test(m[0])) { - return truncated } + break + } + i = this.consumeExponent(i) + i = this.consumeBigIntSuffix(i) + return this.emitNumber(start, i) + } + + // 0x / 0o / 0b + if (source.charCodeAt(i) === 0x30) { + const next = source.charCodeAt(i + 1) + if (next === 0x78 || next === 0x58) { + i += 2 + while (i < length) { + const c = source.charCodeAt(i) + if (isHexDigit(c) || c === 0x5F) { + i++ + continue + } + break + } + i = this.consumeBigIntSuffix(i) + return this.emitNumber(start, i) + } + if (next === 0x6F || next === 0x4F) { + i += 2 + while (i < length) { + const c = source.charCodeAt(i) + if (isOctalDigit(c) || c === 0x5F) { + i++ + continue + } + break + } + i = this.consumeBigIntSuffix(i) + return this.emitNumber(start, i) + } + if (next === 0x62 || next === 0x42) { + i += 2 + while (i < length) { + const c = source.charCodeAt(i) + if (isBinaryDigit(c) || c === 0x5F) { + i++ + continue + } + break + } + i = this.consumeBigIntSuffix(i) + return this.emitNumber(start, i) + } + } + + // Decimal: digits[_digits]*[.digits[_digits]*]? + while (i < length) { + const c = source.charCodeAt(i) + if (isDigit(c) || c === 0x5F) { + i++ + continue + } + break + } + if (source.charCodeAt(i) === 0x2E /* . */) { + i++ + while (i < length) { + const c = source.charCodeAt(i) + if (isDigit(c) || c === 0x5F) { + i++ + continue + } + break + } + } + i = this.consumeExponent(i) + i = this.consumeBigIntSuffix(i) + return this.emitNumber(start, i) + } + + private consumeExponent(i: number): number { + const source = this.parser.syntax + const length = source.length + const c = source.charCodeAt(i) + if (c !== 0x65 && c !== 0x45 /* e/E */) { + return i + } + let j = i + 1 + const sign = source.charCodeAt(j) + if (sign === 0x2B || sign === 0x2D /* + or - */) { + j++ + } + if (!isDigit(source.charCodeAt(j))) { + return i + } + j++ + while (j < length) { + const cc = source.charCodeAt(j) + if (isDigit(cc) || cc === 0x5F) { + j++ + continue } - truncated += ` ${words[i]}` + break } - return value + return j } - private scanRegexLiteral(): Token | null { + private consumeBigIntSuffix(i: number): number { + if (this.parser.syntax.charCodeAt(i) === 0x6E /* n */) { + return i + 1 + } + return i + } + + private emitNumber(start: number, end: number): Token { + const value = this.parser.syntax.slice(start, end) + this.cursor = end + this.lastTokenType = Keyword.NUMBER + return { + type: Keyword.NUMBER, + value, + start, + end, + } + } + + private scanIdentifierOrKeyword(start: number): Token { + // Try keyword trie first; on a hit with passing boundary we emit the keyword. + // Otherwise fall back to identifier scanning (multi-word with embedded + // keyword detection). + const kw = this.matchKeyword(start) + if (kw !== null) { + this.cursor = kw.end + this.lastTokenType = kw.type as string + return { + type: kw.type, + value: this.parser.syntax.slice(start, kw.end), + start, + end: kw.end, + } + } + return this.scanIdentifier(start) + } + + private matchKeyword(start: number): { type: Keyword, end: number } | null { + const source = this.parser.syntax + const length = source.length + let node: TrieNode | undefined = KEYWORD_TRIE + let i = start + let bestType: Keyword | null = null + let bestEnd = -1 + + while (i < length && node !== undefined) { + const code = source.charCodeAt(i) + const next = node.children.get(code) + if (next === undefined) + break + i++ + node = next + if (node.type !== null && this.boundaryOk(node.boundary, i)) { + bestType = node.type + bestEnd = i + } + } + + if (bestType !== null && bestEnd !== -1) { + return { type: bestType, end: bestEnd } + } + return null + } + + private boundaryOk(kind: BoundaryKind, end: number): boolean { + if (kind === Boundary.NONE) + return true + if (end >= this.parser.syntax.length) + return true + const code = this.parser.syntax.charCodeAt(end) + if (kind === Boundary.WORD) { + return !isAsciiWordChar(code) + } + return !isIdentStart(code) + } + + private scanIdentifier(start: number): Token { const source = this.parser.syntax - const start = this.cursor + const length = source.length + let i = start + if (!isIdentStart(source.charCodeAt(i))) { + throw new SyntaxError(`Unexpected token: "${source[i]}"`) + } + i++ + while (i < length && isIdentCont(source.charCodeAt(i))) { + i++ + } + + // Multi-word identifier: consume ` ` repeatedly, but stop before + // a word that itself begins a keyword (bounded backtrack via peek). + while (i < length) { + if (source.charCodeAt(i) !== 0x20 /* space */) + break + const wordStart = i + 1 + if (wordStart >= length) + break + const wordCode = source.charCodeAt(wordStart) + if (!isIdentStart(wordCode)) + break + if (this.matchKeyword(wordStart) !== null) { + break + } + i = wordStart + 1 + while (i < length && isIdentCont(source.charCodeAt(i))) { + i++ + } + } + + const value = source.slice(start, i) + this.cursor = i + this.lastTokenType = Keyword.IDENTIFIER + return { + type: Keyword.IDENTIFIER, + value, + start, + end: i, + } + } + + private scanOperator(start: number): Token | null { + const source = this.parser.syntax + const length = source.length + let node: OperatorNode | undefined = OPERATOR_TRIE + let i = start + let bestEnd = -1 + let bestValue: string | null = null + + while (i < length && node !== undefined) { + const code = source.charCodeAt(i) + const next = node.children.get(code) + if (next === undefined) + break + i++ + node = next + if (node.value !== null) { + bestEnd = i + bestValue = node.value + } + } + + if (bestValue === null || bestEnd === -1) { + return null + } + this.cursor = bestEnd + this.lastTokenType = bestValue + return { + type: bestValue, + value: bestValue, + start, + end: bestEnd, + } + } + + private scanRegexLiteral(start: number): Token | null { + const source = this.parser.syntax + const length = source.length let i = start + 1 let inCharClass = false - while (i < source.length) { - const ch = source[i] - if (ch === '\\') { + while (i < length) { + const ch = source.charCodeAt(i) + if (ch === 0x5C /* \ */) { i += 2 continue } - if (ch === '[') { + if (ch === 0x5B /* [ */) { inCharClass = true i++ continue } - if (ch === ']') { + if (ch === 0x5D /* ] */) { inCharClass = false i++ continue } - if (ch === '/' && !inCharClass) { + if (ch === 0x2F /* / */ && !inCharClass) { i++ - while (i < source.length && /[a-z]/.test(source[i])) { - i++ + while (i < length) { + const fc = source.charCodeAt(i) + if (fc >= 0x61 && fc <= 0x7A) { + i++ + continue + } + break } const value = source.slice(start, i) this.cursor = i @@ -222,32 +811,32 @@ export class Tokenizer { end: i, } } - if (ch === '\n') { + if (ch === 0x0A /* \n */) { return null } i++ } - return null } - private scanTemplateLiteral(): Token { + private scanTemplateLiteral(start: number): Token { const source = this.parser.syntax - const start = this.cursor + const length = source.length let i = start + 1 - while (i < source.length) { - const ch = source[i] + while (i < length) { + const ch = source.charCodeAt(i) - if (ch === '\\') { + if (ch === 0x5C /* \ */) { i += 2 continue } - if (ch === '`') { + if (ch === 0x60 /* ` */) { i++ const value = source.slice(start, i) this.cursor = i + this.lastTokenType = 'TemplateLiteral' return { type: 'TemplateLiteral', value, @@ -256,46 +845,48 @@ export class Tokenizer { } } - if (ch === '$' && source[i + 1] === '{') { + if (ch === 0x24 /* $ */ && source.charCodeAt(i + 1) === 0x7B /* { */) { i += 2 let depth = 1 - while (i < source.length && depth > 0) { - const inner = source[i] - if (inner === '\\') { + while (i < length && depth > 0) { + const inner = source.charCodeAt(i) + if (inner === 0x5C) { i += 2 continue } - if (inner === '"' || inner === '\'') { + if (inner === 0x22 || inner === 0x27) { const quote = inner i++ - while (i < source.length && source[i] !== quote) { - if (source[i] === '\\') + while (i < length && source.charCodeAt(i) !== quote) { + if (source.charCodeAt(i) === 0x5C) i++ i++ } i++ continue } - if (inner === '`') { + if (inner === 0x60 /* ` */) { i++ - while (i < source.length) { - if (source[i] === '\\') { + while (i < length) { + if (source.charCodeAt(i) === 0x5C) { i += 2 continue } - if (source[i] === '$' && source[i + 1] === '{') { + if (source.charCodeAt(i) === 0x24 + && source.charCodeAt(i + 1) === 0x7B) { i += 2 let innerDepth = 1 - while (i < source.length && innerDepth > 0) { - if (source[i] === '{') + while (i < length && innerDepth > 0) { + const ic = source.charCodeAt(i) + if (ic === 0x7B) innerDepth++ - else if (source[i] === '}') + else if (ic === 0x7D) innerDepth-- i++ } continue } - if (source[i] === '`') { + if (source.charCodeAt(i) === 0x60) { i++ break } @@ -303,9 +894,9 @@ export class Tokenizer { } continue } - if (inner === '{') + if (inner === 0x7B) depth++ - else if (inner === '}') + else if (inner === 0x7D) depth-- i++ } @@ -317,16 +908,4 @@ export class Tokenizer { throw new SyntaxError(`Template literal không đóng, bắt đầu tại vị trí ${start}`) } - - private match(regexp: RegExp, syntax: string): string | null { - const formattedSyntax = syntax.split(';') - const matched = regexp.exec(formattedSyntax[0].concat(';')) - - if (matched && matched.index === 0) { - this.cursor += matched[0].length - return matched[0] - } - - return null - } } diff --git a/packages/shared/index.ts b/packages/shared/index.ts index 4f4a72f..396b263 100644 --- a/packages/shared/index.ts +++ b/packages/shared/index.ts @@ -1,5 +1,4 @@ export * from './parser/keyword.enum' export * from './parser/node.interface' export * from './parser/operator.type' -export * from './parser/spec.type' export * from './parser/token.type' diff --git a/packages/shared/parser/spec.type.ts b/packages/shared/parser/spec.type.ts deleted file mode 100644 index 88daf54..0000000 --- a/packages/shared/parser/spec.type.ts +++ /dev/null @@ -1,4 +0,0 @@ -import type { Keyword } from './keyword.enum' -import type { Operator } from './operator.type' - -export type Spec = [RegExp, Keyword | Operator | null]