diff --git a/README.md b/README.md index 0cd763a..62410ef 100644 --- a/README.md +++ b/README.md @@ -4,102 +4,28 @@ ![Coverage](https://codecov.io/gh/JinBa1/java-query-engine/branch/main/graph/badge.svg) ![Dependencies](https://img.shields.io/badge/dependencies-up%20to%20date-brightgreen) -An in-memory relational query engine built on the Volcano/iterator model. Parses SQL via JSqlParser, builds an operator tree, and executes queries tuple-by-tuple against CSV data. +**A self-hosted gateway that gives AI agents safe, read-only, budgeted SQL access to your CSV files — no database required.** -## Architecture +Everyone has CSVs — exports, dumps, logs — and AI agents increasingly need to query them. Embedding a database in every agent environment hands over raw file access; what you actually want is a *guarded window* onto the data: an endpoint that is read-only by construction, resource-budgeted, and auditable. cuckooDB is that gateway, built on a from-scratch query engine and exposed over both a **REST API** and the **Model Context Protocol (MCP)**, so an agent can discover tables, preview data, check a query's cost, and run SQL — without writing SQL blind or bypassing the guardrails. -``` -SQL → JSqlParser → QueryPlanner → QueryPlanOptimizer → Operator Tree → Results -``` - -**Core components:** - -| Component | Role | -|-----------|------| -| `QueryPlanner` | Parses SQL and builds the operator pipeline | -| `QueryPlanOptimizer` | Selection pushdown, trivial operator removal | -| `DBCatalog` | Schema and table metadata (singleton) | -| `Value` | Typed tuple values (sealed interface: `IntValue`, `StringValue`) | -| `ExpressionEvaluator` | Evaluates WHERE/HAVING conditions per tuple | -| `ExpressionPreprocessor` | Resolves column references to indices | -| `ConditionSplitter` | Separates join predicates from selection predicates | - -**Operator hierarchy** (all extend `Operator`): - -`ScanOperator` → `SelectOperator` → `ProjectOperator` → `JoinOperator` / `HashJoinOperator` → `SortOperator` → `AggregateOperator` → `DuplicateEliminationOperator` → `LimitOperator` - -## Feature Matrix - -| Feature | Status | -|---------|--------| -| `SELECT *` / projection | ✅ Supported | -| `WHERE` predicates | ✅ Supported | -| Inner joins (nested-loop) | ✅ Supported | -| Hash join (auto-selected for equi-joins) | ✅ Supported | -| `ORDER BY` | ✅ Supported | -| `GROUP BY` + `SUM`, `COUNT`, `AVG`, `MIN`, `MAX` | ✅ Supported | -| `LIMIT n` | ✅ Supported | -| `DISTINCT` | ✅ Supported | -| Nested arithmetic/comparison expressions | ✅ Supported | -| Query optimisation (selection pushdown) | ✅ Supported | -| Typed columns (int, string) | ✅ Supported | -| CSV header support | ✅ Supported | -| Query budgets (`--max-tuples`, `--timeout-ms`) | ✅ Supported | -| `EXPLAIN` plan inspection | ✅ Supported | -| Indexes | ❌ Not supported | -| Transactions | ❌ Not supported | -| INSERT / UPDATE / DELETE | ❌ Not supported | -| Concurrency | ❌ Not supported | -| Persistence | ❌ Not supported | -| Full SQL dialect | ❌ Not supported | - -## Scope - -This engine supports **SQL-over-CSV query execution**: read-only queries against tables stored as CSV files. It does not support transactions, indexes, data modification (INSERT/UPDATE/DELETE), concurrency, persistence, or a full SQL dialect. Values are typed int or string, inferred per column from the data. Tables are discovered from CSV files with header rows; no separate schema file. - -Supported SQL features include `SELECT`/`FROM`/`WHERE`, `GROUP BY` with `SUM`, `COUNT`, `AVG`, `MIN`, and `MAX` aggregates, `ORDER BY`, `DISTINCT`, inner joins, and `LIMIT n`. - -The focus is on demonstrating query planning, optimisation, and the Volcano iterator execution model. - -### Aggregate and LIMIT semantics - -| Case | Behavior | -|---|---| -| `AVG` of ints | truncated integer division (toward zero) | -| Aggregate over empty input, no `GROUP BY` | zero rows (header only) — deviates from SQL's NULL row | -| `COUNT(col)` | equals `COUNT(*)` — the engine has no NULLs | -| `SUM`/`AVG` on a string column | error | -| `MIN`/`MAX` on strings | lexicographic | -| `SUM` past int range | error | -| `LIMIT 0` | header-only output | -| `OFFSET`, `LIMIT ALL` | not supported (error) | - -## Quick Start - -**Prerequisites:** Java 17, Maven (or use the included Maven Wrapper). - -```bash -# Clone -git clone https://github.com/JinBa1/java-query-engine.git -cd java-query-engine +## Features -# Build fat JAR (engine module) -./mvnw -pl engine -DskipTests clean package -``` - -**Run a query:** - -```bash -java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \ - com.github.jinba1.cuckoodb.CuckooDB \ - database_dir input_file output_file [--max-tuples=N] [--timeout-ms=N] -``` - -Both `--max-tuples` and `--timeout-ms` are optional and independent. Omit either to impose no limit on that dimension. +| Capability | | +|---|:--:| +| Read-only SQL over CSV — `SELECT` / `WHERE` / `JOIN` / `GROUP BY` / `ORDER BY` / `LIMIT` / `DISTINCT` | ✅ | +| Aggregates — `COUNT` / `SUM` / `AVG` / `MIN` / `MAX` | ✅ | +| Hash + nested-loop joins (planner auto-selects) | ✅ | +| Typed columns (int / string), CSV headers | ✅ | +| `EXPLAIN` plan inspection | ✅ | +| Tuple + time budgets, fail-closed | ✅ | +| **REST API** + OpenAPI / Swagger | ✅ | +| **MCP server** — five agent tools, Streamable-HTTP | ✅ | +| Runs as a container (published to GHCR) | ✅ | +| Writes / transactions / indexes / persistence | ❌ read-only by design | -### Run the server as a container +## Quick start -The Spring Boot gateway (REST + MCP) ships as a container image, so you can run it next to your data with no Java toolchain. Put your CSV files in a folder and mount it as the catalog's data directory: +Run the gateway next to your data — no Java toolchain needed. Put your CSVs in a folder and mount it: ```bash docker run --rm -p 8080:8080 \ @@ -107,198 +33,94 @@ docker run --rm -p 8080:8080 \ ghcr.io/jinba1/cuckoodb:latest ``` -- **REST:** `POST http://localhost:8080/queries`, `GET /tables`, `GET /tables/{name}` (OpenAPI at `/swagger-ui.html`). -- **MCP:** Streamable-HTTP endpoint at `http://localhost:8080/mcp` — point an MCP client at it to query your CSVs with `list_tables` / `describe_table` / `sample_rows` / `explain_query` / `query`. - -The image is published to GHCR on each merge to `main`. To build it locally instead: `docker build -t cuckoodb .` - -### Query budgets - -The engine enforces **total-work semantics**: every tuple emitted by any operator in the tree counts against the budget, including intermediate tuples that are later filtered or joined. A cross-product explosion that never produces output rows will still hit the tuple limit. The timeout clock starts lazily at the first tuple emission. - -When a budget is exceeded: -- The partial output file is deleted. -- `Error: ` is written to stderr. -- The process exits with code 1. - -Both flags are optional and independent — you can use one, both, or neither. - -### EXPLAIN - -Prefix any query with `EXPLAIN` to inspect the query plan without executing it: - -```sql -EXPLAIN SELECT Student.B, SUM(Student.C) FROM Student, Enrolled -WHERE Student.D > 30 AND Student.A = Enrolled.A -GROUP BY Student.B; -``` - -The output file receives a two-section plan: +Query over REST: -``` -=== Plan (as written) === -Aggregate[group by: Student.B; calls: SUM(Student.c)] - Project[Enrolled.A, Student.A, Student.B, Student.C, Student.D] - Select[Student.D > 30] - Join[Student.A = Enrolled.A] - Scan[Student] - Scan[Enrolled] +```bash +curl -s localhost:8080/tables +# ["People"] -=== Plan (optimized) === -Aggregate[group by: Student.B; calls: SUM(Student.c)] - Project[Enrolled.A, Student.A, Student.B, Student.C, Student.D] - Join[Student.A = Enrolled.A] - Select[Student.D > 30] - Scan[Student] - Project[Enrolled.A] - Scan[Enrolled] +curl -s localhost:8080/queries -H 'Content-Type: application/json' \ + -d '{"sql":"SELECT * FROM People LIMIT 5"}' +# {"columns":[{"name":"id","type":"INT"},...],"rows":[[1,"alice"],...],"rowCount":5,"truncated":true,"hint":"..."} ``` -No query execution occurs for EXPLAIN queries. - -## Join algorithms - -The engine supports two join algorithms; the planner selects between them automatically. +…or connect an AI agent over MCP (below). To use the engine directly from the command line instead, see the **[engine README](engine/README.md)**. -### Nested-loop join +## For agents: MCP -`JoinOperator` implements a classic nested-loop join: for every outer tuple the inner child is rewound and scanned in full. It handles any join condition (equality, inequality, arbitrary expression, or cross product with no condition). `EXPLAIN` shows it as `Join[]`. +The server exposes a Model Context Protocol endpoint at `http://localhost:8080/mcp` (Streamable-HTTP). Point an MCP client (e.g. Claude Desktop) at it and the agent gets five tools: -### Hash join - -`HashJoinOperator` extends `JoinOperator` with an in-memory hash join. The inner (build) side is drained once into a `HashMap` keyed by the equality conjuncts; the outer (probe) side then streams through once. After a hash-table lookup, the full original condition is re-evaluated on every candidate, so residual non-equality conjuncts (e.g. `A.x = B.x AND A.y > 3`) work correctly. Output order — outer-major, inner order preserved within each key bucket — is identical to the nested-loop join. `EXPLAIN` shows it as `HashJoin[]`. - -**Auto-selection rule:** the planner chooses hash join when `Constants.useHashJoin` is `true` (the default) **and** the join condition contains at least one column-to-column equality conjunct (e.g. `Student.A = Enrolled.A`). Cross products (no condition) and pure non-equi joins (e.g. `A.x > B.y` only) always use nested-loop join. - -**Toggle:** set `Constants.useHashJoin = false` at program start (or in tests) to force nested-loop for all joins. - -### Benchmarks - -Performance was measured with a JMH 1.37 benchmark suite in the `bench/` package (`engine/src/test/java/com/github/jinba1/cuckoodb/bench/`). The suite is compiled in CI but never run there; run it locally with: - -```bash -./mvnw -pl engine -q test-compile exec:exec -Dexec.executable=java -Dexec.classpathScope=test \ - "-Dexec.args=-cp %classpath org.openjdk.jmh.Main .*Benchmark" -``` - -**Results** (OpenJDK 21.0.5, Intel Core i9-13900HX, 32 logical cores, Linux under WSL2): - -| Benchmark | matchesPerKey | rowsPerSide | useHashJoin | Mode | Cnt | Score | Error | Units | -|-----------|--------------|-------------|-------------|------|-----|-------|-------|-------| -| EndToEndJoinBenchmark.planAndDrain | N/A | N/A | true | avgt | 3 | 1.028 | ± 0.288 | ms/op | -| EndToEndJoinBenchmark.planAndDrain | N/A | N/A | false | avgt | 3 | 315.523 | ± 36.021 | ms/op | -| JoinAlgorithmBenchmark.hashJoin | 1 | 1000 | N/A | avgt | 5 | 0.270 | ± 0.011 | ms/op | -| JoinAlgorithmBenchmark.hashJoin | 1 | 5000 | N/A | avgt | 5 | 1.382 | ± 0.154 | ms/op | -| JoinAlgorithmBenchmark.hashJoin | 10 | 1000 | N/A | avgt | 5 | 2.160 | ± 0.109 | ms/op | -| JoinAlgorithmBenchmark.hashJoin | 10 | 5000 | N/A | avgt | 5 | 10.661 | ± 0.840 | ms/op | -| JoinAlgorithmBenchmark.nestedLoopJoin | 1 | 1000 | N/A | avgt | 5 | 202.621 | ± 25.313 | ms/op | -| JoinAlgorithmBenchmark.nestedLoopJoin | 1 | 5000 | N/A | avgt | 5 | 5027.912 | ± 370.564 | ms/op | -| JoinAlgorithmBenchmark.nestedLoopJoin | 10 | 1000 | N/A | avgt | 5 | 194.786 | ± 4.916 | ms/op | -| JoinAlgorithmBenchmark.nestedLoopJoin | 10 | 5000 | N/A | avgt | 5 | 4785.620 | ± 212.683 | ms/op | +| Tool | What it does | +|---|---| +| `list_tables` | list the available tables | +| `describe_table` | a table's column names and types | +| `sample_rows` | preview rows without writing SQL | +| `explain_query` | preview a query's plan and cost before running it | +| `query` | run a read-only `SELECT`, budget-bounded | -`EndToEndJoinBenchmark` joins two 1 000-row CSV tables through the full planner pipeline; nested-loop re-parses the inner CSV once per outer row, so the gap (≈ 307×) reflects both the algorithmic difference and I/O cost. `JoinAlgorithmBenchmark` uses in-memory `CachedOperator` inputs to isolate the join algorithm itself; at 5 000 rows/side the operator-level gap is ≈ 3 600×. +Every tool routes through the same guarded execution path as the REST API, so agent traffic inherits the read-only guarantee, the tuple/time budget, and concurrency limits (with audit hooks in place) — there is no way to bypass them. -Benchmarks are compiled in CI but never executed there. +## REST API -## Demo +| Endpoint | | +|---|---| +| `POST /queries` | plan + execute one read-only query → JSON columns/rows, or an `EXPLAIN` plan | +| `GET /tables` | list table names | +| `GET /tables/{name}` | a table's typed schema | +| `/swagger-ui.html` | interactive OpenAPI docs | -**Input table** (`engine/samples/db/data/Student.csv`): +Queries are **budget-bounded and fail-closed**: the server always attaches a budget, so an unbounded query is unreachable. A result that would exceed the tuple budget returns `429` (retry with a tighter `LIMIT`); one that exceeds the time budget returns `504`. -``` -A, B, C, D -1, 200, 50, 33 -2, 200, 200, 44 -3, 100, 105, 44 -4, 100, 50, 11 -5, 100, 500, 22 -6, 300, 400, 11 -``` +### EXPLAIN -**Query** (`engine/samples/input/query4.sql`): +Any query can be planned without executing it — prefix `EXPLAIN` over REST, or call the `explain_query` tool. The plan is shown as written and after optimisation: -```sql -SELECT * FROM Student WHERE Student.A < 3; ``` +=== Plan (as written) === +Project[Student.B, Student.C] + Select[Student.D > 30] + HashJoin[Student.A = Enrolled.A] + Scan[Student] + Scan[Enrolled] -**Command:** - -```bash -java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \ - com.github.jinba1.cuckoodb.CuckooDB \ - engine/samples/db engine/samples/input/query4.sql output.csv +=== Plan (optimized) === +Project[Student.B, Student.C] + HashJoin[Student.A = Enrolled.A] + Select[Student.D > 30] + Scan[Student] + Project[Enrolled.A] + Scan[Enrolled] ``` -To limit resource usage, add optional budget flags: +The optimiser pushes the `Select` below the join (selection pushdown) and projects the inner scan down to just the key it needs; the planner picked a hash join for the equi-condition. See the [engine README](engine/README.md#explain) for the full treatment. -```bash -java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \ - com.github.jinba1.cuckoodb.CuckooDB \ - engine/samples/db engine/samples/input/query4.sql output.csv --max-tuples=10000 --timeout-ms=5000 -``` +## How it works -**Output** (`output.csv`): - -``` -a,b,c,d -1,200,50,33 -2,200,200,44 ``` - -## Running Examples - -The `engine/samples/` directory ships with 20 queries and a small dataset (Student, Course, Enrolled, Staff tables). Expected output lives in `engine/samples/expected_output/`. - -Run all 20 through the bundled runner, which diffs each result against the expected output and reports pass/fail. It is launched via `exec:exec` (not `exec:java`) so it runs with the engine module as the working directory — `exec:java` would keep the working directory at the reactor root and fail to find `samples/`: - -```bash -./mvnw -pl engine -q test-compile exec:exec -Dexec.executable=java -Dexec.classpathScope=test \ - "-Dexec.args=-cp %classpath com.github.jinba1.cuckoodb.SampleQueryRunner" +SQL → JSqlParser → QueryPlanner → optimizer → operator tree → results ``` -Or run each query through the CLI and diff manually: - -```bash -# Run all sample queries and diff against expected output -for i in $(seq 1 20); do - java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \ - com.github.jinba1.cuckoodb.CuckooDB \ - engine/samples/db "engine/samples/input/query${i}.sql" "/tmp/out${i}.csv" - diff "engine/samples/expected_output/query${i}.csv" "/tmp/out${i}.csv" && echo "query${i}: OK" -done -``` +The engine is a from-scratch Volcano/iterator executor — typed values, hash and nested-loop joins, selection pushdown, tuple/time budgets. The server wraps it behind a single `QueryService` choke point that applies the budget, a concurrency permit, and audit; **both** the REST controllers and the MCP tools go through it, so the guardrails can't be bypassed and apply uniformly. Engine internals — architecture, join algorithms, benchmarks, CLI — are in the **[engine README](engine/README.md)**. -## Testing +## Build and test ```bash -./mvnw test +./mvnw clean verify # builds + tests both modules: engine (419 tests) + server (90 tests) ``` -The test suite covers individual operators, the query planner, the optimiser, expression evaluation, query budgets, EXPLAIN, hash join, and end-to-end integration scenarios (339 tests). +The 20 sample queries are a golden-output regression gate (see the engine README to run them). CI builds, tests, and publishes the container image to GHCR on every merge to `main`. -## Project Structure +## Project structure ``` -├── pom.xml # Parent POM (aggregator: engine + server; Java 17, dep/plugin management) -├── engine/ # Pure query engine — zero Spring dependencies -│ ├── pom.xml # cuckoodb-engine (JSqlParser 4.7, commons-csv 1.14.1, JMH 1.37 test-scope) -│ ├── src/main/java/com/github/jinba1/cuckoodb/ # Core engine (35 files) -│ │ └── operator/ # Volcano operators (11 files, incl. HashJoinOperator) -│ ├── src/test/java/com/github/jinba1/cuckoodb/ # JUnit 5 tests (339 tests across 33 files) -│ └── samples/ -│ ├── db/data/ # CSV data files (header row + data rows) -│ ├── input/query[1-20].sql # Sample queries -│ └── expected_output/query[1-20].csv # Expected results -├── server/ # cuckoodb-server — Spring Boot REST + MCP gateway over the engine -│ ├── pom.xml # Spring Boot 4 (web MVC), springdoc/OpenAPI, Spring AI MCP server -│ └── src/main/java/com/github/jinba1/cuckoodb/server/ # web/ controllers, query/ service, catalog/ facade, mcp/ agent tools, config -├── mvnw / mvnw.cmd # Maven Wrapper -└── LICENSE +├── engine/ # pure query engine — Java 17, zero Spring (see engine/README.md) +└── server/ # Spring Boot 4 gateway — REST + MCP over the engine ``` ## Background -Originally built as a university project for the Advanced Database Systems course at the University of Edinburgh, subsequently extended with additional query optimisation and expanded test coverage. +Originally built as a university project for the Advanced Database Systems course at the University of Edinburgh, then extended into a guarded, agent-facing gateway — REST and MCP interfaces, query budgets, and additional optimisation and test coverage. ## License -This project is released under the MIT License. See [LICENSE](LICENSE) for details. +Released under the MIT License. See [LICENSE](LICENSE). diff --git a/engine/README.md b/engine/README.md new file mode 100644 index 0000000..9e93b01 --- /dev/null +++ b/engine/README.md @@ -0,0 +1,166 @@ +# cuckooDB — query engine + +The query engine under the [cuckooDB gateway](../README.md): an in-memory relational query engine on the Volcano/iterator model. It parses SQL via JSqlParser, builds an operator tree, optimises it, and executes tuple-by-tuple against CSV files. Pure Java 17, **zero Spring dependencies**. + +## Architecture + +``` +SQL → JSqlParser → QueryPlanner → QueryPlanOptimizer → Operator Tree → Results +``` + +| Component | Role | +|-----------|------| +| `QueryPlanner` | Parses SQL and builds the operator pipeline | +| `QueryPlanOptimizer` | Selection pushdown, trivial operator removal | +| `DBCatalog` | Schema and table metadata (singleton) | +| `Value` | Typed tuple values (sealed interface: `IntValue`, `StringValue`) | +| `ExpressionEvaluator` | Evaluates WHERE/HAVING conditions per tuple | +| `ExpressionPreprocessor` | Resolves column references to indices | +| `ConditionSplitter` | Separates join predicates from selection predicates | + +**Operator hierarchy** (all extend `Operator`): + +`ScanOperator` → `SelectOperator` → `ProjectOperator` → `JoinOperator` / `HashJoinOperator` → `SortOperator` → `AggregateOperator` → `DuplicateEliminationOperator` → `LimitOperator` + +## Scope + +Read-only SQL-over-CSV: `SELECT`/`FROM`/`WHERE`, inner joins, `GROUP BY` with `SUM`/`COUNT`/`AVG`/`MIN`/`MAX`, `ORDER BY`, `DISTINCT`, `LIMIT n`, and nested arithmetic/comparison expressions. Values are typed int or string, inferred per column from the data. Tables are discovered from CSV files with header rows; no separate schema file. No transactions, indexes, data modification, persistence, or full SQL dialect — the focus is query planning, optimisation, and the Volcano execution model. + +## Build and run (CLI) + +Run from the repository root (uses the Maven Wrapper): + +```bash +./mvnw -pl engine -DskipTests clean package + +java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \ + com.github.jinba1.cuckoodb.CuckooDB \ + [--max-tuples=N] [--timeout-ms=N] +``` + +`` is a directory containing a `data/` subdir of `.csv` tables. `--max-tuples` and `--timeout-ms` are optional and independent — use one, both, or neither. + +### Demo + +**Input** (`engine/samples/db/data/Student.csv`): + +``` +A, B, C, D +1, 200, 50, 33 +2, 200, 200, 44 +3, 100, 105, 44 +4, 100, 50, 11 +5, 100, 500, 22 +6, 300, 400, 11 +``` + +**Command** (`engine/samples/input/query4.sql` is `SELECT * FROM Student WHERE Student.A < 3;`): + +```bash +java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \ + com.github.jinba1.cuckoodb.CuckooDB \ + engine/samples/db engine/samples/input/query4.sql output.csv +``` + +**Output** (`output.csv`): + +``` +a,b,c,d +1,200,50,33 +2,200,200,44 +``` + +## Query budgets + +The engine enforces **total-work semantics**: every tuple emitted by any operator counts against the budget, including intermediate tuples later filtered or joined. A cross-product explosion that never produces output rows still hits the tuple limit. The timeout clock starts lazily at the first tuple emission. On a breach the partial output file is deleted, `Error: ` is written to stderr, and the process exits 1. + +## EXPLAIN + +Prefix any query with `EXPLAIN` to inspect the plan without executing it. The output has two sections — as written, then after optimisation: + +``` +=== Plan (as written) === +Aggregate[group by: Student.B; calls: SUM(Student.c)] + Project[Enrolled.A, Student.A, Student.B, Student.C, Student.D] + Select[Student.D > 30] + HashJoin[Student.A = Enrolled.A] + Scan[Student] + Scan[Enrolled] + +=== Plan (optimized) === +Aggregate[group by: Student.B; calls: SUM(Student.c)] + Project[Enrolled.A, Student.A, Student.B, Student.C, Student.D] + HashJoin[Student.A = Enrolled.A] + Select[Student.D > 30] + Scan[Student] + Project[Enrolled.A] + Scan[Enrolled] +``` + +The optimiser pushes the `Select` below the join (selection pushdown) and inserts a projection on the inner scan; the planner picked a hash join for the equi-condition. No execution occurs for `EXPLAIN`. + +## Join algorithms + +The planner selects between two join algorithms automatically. + +**Nested-loop** (`JoinOperator`): for every outer tuple the inner child is rewound and scanned in full. Handles any condition — equality, inequality, arbitrary expression, or cross product. Shown in `EXPLAIN` as `Join[]`. + +**Hash** (`HashJoinOperator extends JoinOperator`): the inner (build) side is drained once into a `HashMap` keyed by the equality conjuncts; the outer (probe) side streams through once. After a lookup, the full original condition is re-evaluated on every candidate, so residual non-equality conjuncts (e.g. `A.x = B.x AND A.y > 3`) work. Output order is identical to nested-loop (outer-major, inner order preserved per key bucket). Shown as `HashJoin[]`. + +**Auto-selection:** hash join is used when `Constants.useHashJoin` is `true` (default) **and** the condition has at least one column-to-column equality conjunct. Cross products and pure non-equi joins always use nested-loop. Set `Constants.useHashJoin = false` to force nested-loop everywhere. + +### Benchmarks + +A JMH 1.37 suite lives in `engine/src/test/java/com/github/jinba1/cuckoodb/bench/`. It is compiled in CI but never run there; run it locally from the repository root: + +```bash +./mvnw -pl engine -q test-compile exec:exec -Dexec.executable=java -Dexec.classpathScope=test \ + "-Dexec.args=-cp %classpath org.openjdk.jmh.Main .*Benchmark" +``` + +**Results** (OpenJDK 21.0.5, Intel Core i9-13900HX, 32 logical cores, Linux under WSL2): + +| Benchmark | matchesPerKey | rowsPerSide | useHashJoin | Mode | Cnt | Score | Error | Units | +|-----------|--------------|-------------|-------------|------|-----|-------|-------|-------| +| EndToEndJoinBenchmark.planAndDrain | N/A | N/A | true | avgt | 3 | 1.028 | ± 0.288 | ms/op | +| EndToEndJoinBenchmark.planAndDrain | N/A | N/A | false | avgt | 3 | 315.523 | ± 36.021 | ms/op | +| JoinAlgorithmBenchmark.hashJoin | 1 | 1000 | N/A | avgt | 5 | 0.270 | ± 0.011 | ms/op | +| JoinAlgorithmBenchmark.hashJoin | 1 | 5000 | N/A | avgt | 5 | 1.382 | ± 0.154 | ms/op | +| JoinAlgorithmBenchmark.hashJoin | 10 | 1000 | N/A | avgt | 5 | 2.160 | ± 0.109 | ms/op | +| JoinAlgorithmBenchmark.hashJoin | 10 | 5000 | N/A | avgt | 5 | 10.661 | ± 0.840 | ms/op | +| JoinAlgorithmBenchmark.nestedLoopJoin | 1 | 1000 | N/A | avgt | 5 | 202.621 | ± 25.313 | ms/op | +| JoinAlgorithmBenchmark.nestedLoopJoin | 1 | 5000 | N/A | avgt | 5 | 5027.912 | ± 370.564 | ms/op | +| JoinAlgorithmBenchmark.nestedLoopJoin | 10 | 1000 | N/A | avgt | 5 | 194.786 | ± 4.916 | ms/op | +| JoinAlgorithmBenchmark.nestedLoopJoin | 10 | 5000 | N/A | avgt | 5 | 4785.620 | ± 212.683 | ms/op | + +`EndToEndJoinBenchmark` joins two 1 000-row CSV tables through the full pipeline; nested-loop re-parses the inner CSV once per outer row, so the ≈ 307× gap reflects both algorithm and I/O. `JoinAlgorithmBenchmark` uses in-memory inputs to isolate the algorithm; at 5 000 rows/side the operator-level gap is ≈ 3 600×. + +## Sample queries + +`engine/samples/` ships 20 queries and a small dataset (Student, Course, Enrolled, Staff). The bundled runner diffs each result against `engine/samples/expected_output/` — the golden-output regression gate. Launch via `exec:exec` (not `exec:java`) so it runs with the engine module as the working directory: + +```bash +./mvnw -pl engine -q test-compile exec:exec -Dexec.executable=java -Dexec.classpathScope=test \ + "-Dexec.args=-cp %classpath com.github.jinba1.cuckoodb.SampleQueryRunner" +``` + +## Testing + +```bash +./mvnw -pl engine test +``` + +419 tests across operators, the planner, the optimiser, expression evaluation, query budgets, EXPLAIN, hash join, and end-to-end integration scenarios. + +## Layout + +``` +engine/ +├── src/main/java/com/github/jinba1/cuckoodb/ # core engine (45 files) +│ └── operator/ # Volcano operators (11 files, incl. HashJoinOperator) +├── src/test/java/com/github/jinba1/cuckoodb/ # JUnit 5 tests (419 across 41 files) +└── samples/ + ├── db/data/ # CSV tables (header row + data rows) + ├── input/query[1-20].sql + └── expected_output/query[1-20].csv +```