Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
318 changes: 70 additions & 248 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,301 +4,123 @@
![Coverage](https://codecov.io/gh/JinBa1/java-query-engine/branch/main/graph/badge.svg)
![Dependencies](https://img.shields.io/badge/dependencies-up%20to%20date-brightgreen)

An in-memory relational query engine built on the Volcano/iterator model. Parses SQL via JSqlParser, builds an operator tree, and executes queries tuple-by-tuple against CSV data.
**A self-hosted gateway that gives AI agents safe, read-only, budgeted SQL access to your CSV files — no database required.**

## Architecture
Everyone has CSVs — exports, dumps, logs — and AI agents increasingly need to query them. Embedding a database in every agent environment hands over raw file access; what you actually want is a *guarded window* onto the data: an endpoint that is read-only by construction, resource-budgeted, and auditable. cuckooDB is that gateway, built on a from-scratch query engine and exposed over both a **REST API** and the **Model Context Protocol (MCP)**, so an agent can discover tables, preview data, check a query's cost, and run SQL — without writing SQL blind or bypassing the guardrails.

```
SQL → JSqlParser → QueryPlanner → QueryPlanOptimizer → Operator Tree → Results
```

**Core components:**

| Component | Role |
|-----------|------|
| `QueryPlanner` | Parses SQL and builds the operator pipeline |
| `QueryPlanOptimizer` | Selection pushdown, trivial operator removal |
| `DBCatalog` | Schema and table metadata (singleton) |
| `Value` | Typed tuple values (sealed interface: `IntValue`, `StringValue`) |
| `ExpressionEvaluator` | Evaluates WHERE/HAVING conditions per tuple |
| `ExpressionPreprocessor` | Resolves column references to indices |
| `ConditionSplitter` | Separates join predicates from selection predicates |

**Operator hierarchy** (all extend `Operator`):

`ScanOperator` → `SelectOperator` → `ProjectOperator` → `JoinOperator` / `HashJoinOperator` → `SortOperator` → `AggregateOperator` → `DuplicateEliminationOperator` → `LimitOperator`

## Feature Matrix

| Feature | Status |
|---------|--------|
| `SELECT *` / projection | ✅ Supported |
| `WHERE` predicates | ✅ Supported |
| Inner joins (nested-loop) | ✅ Supported |
| Hash join (auto-selected for equi-joins) | ✅ Supported |
| `ORDER BY` | ✅ Supported |
| `GROUP BY` + `SUM`, `COUNT`, `AVG`, `MIN`, `MAX` | ✅ Supported |
| `LIMIT n` | ✅ Supported |
| `DISTINCT` | ✅ Supported |
| Nested arithmetic/comparison expressions | ✅ Supported |
| Query optimisation (selection pushdown) | ✅ Supported |
| Typed columns (int, string) | ✅ Supported |
| CSV header support | ✅ Supported |
| Query budgets (`--max-tuples`, `--timeout-ms`) | ✅ Supported |
| `EXPLAIN` plan inspection | ✅ Supported |
| Indexes | ❌ Not supported |
| Transactions | ❌ Not supported |
| INSERT / UPDATE / DELETE | ❌ Not supported |
| Concurrency | ❌ Not supported |
| Persistence | ❌ Not supported |
| Full SQL dialect | ❌ Not supported |

## Scope

This engine supports **SQL-over-CSV query execution**: read-only queries against tables stored as CSV files. It does not support transactions, indexes, data modification (INSERT/UPDATE/DELETE), concurrency, persistence, or a full SQL dialect. Values are typed int or string, inferred per column from the data. Tables are discovered from CSV files with header rows; no separate schema file.

Supported SQL features include `SELECT`/`FROM`/`WHERE`, `GROUP BY` with `SUM`, `COUNT`, `AVG`, `MIN`, and `MAX` aggregates, `ORDER BY`, `DISTINCT`, inner joins, and `LIMIT n`.

The focus is on demonstrating query planning, optimisation, and the Volcano iterator execution model.

### Aggregate and LIMIT semantics

| Case | Behavior |
|---|---|
| `AVG` of ints | truncated integer division (toward zero) |
| Aggregate over empty input, no `GROUP BY` | zero rows (header only) — deviates from SQL's NULL row |
| `COUNT(col)` | equals `COUNT(*)` — the engine has no NULLs |
| `SUM`/`AVG` on a string column | error |
| `MIN`/`MAX` on strings | lexicographic |
| `SUM` past int range | error |
| `LIMIT 0` | header-only output |
| `OFFSET`, `LIMIT ALL` | not supported (error) |

## Quick Start

**Prerequisites:** Java 17, Maven (or use the included Maven Wrapper).

```bash
# Clone
git clone https://github.com/JinBa1/java-query-engine.git
cd java-query-engine
## Features

# Build fat JAR (engine module)
./mvnw -pl engine -DskipTests clean package
```

**Run a query:**

```bash
java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \
com.github.jinba1.cuckoodb.CuckooDB \
database_dir input_file output_file [--max-tuples=N] [--timeout-ms=N]
```

Both `--max-tuples` and `--timeout-ms` are optional and independent. Omit either to impose no limit on that dimension.
| Capability | |
|---|:--:|
| Read-only SQL over CSV — `SELECT` / `WHERE` / `JOIN` / `GROUP BY` / `ORDER BY` / `LIMIT` / `DISTINCT` | ✅ |
| Aggregates — `COUNT` / `SUM` / `AVG` / `MIN` / `MAX` | ✅ |
| Hash + nested-loop joins (planner auto-selects) | ✅ |
| Typed columns (int / string), CSV headers | ✅ |
| `EXPLAIN` plan inspection | ✅ |
| Tuple + time budgets, fail-closed | ✅ |
| **REST API** + OpenAPI / Swagger | ✅ |
| **MCP server** — five agent tools, Streamable-HTTP | ✅ |
| Runs as a container (published to GHCR) | ✅ |
| Writes / transactions / indexes / persistence | ❌ read-only by design |

### Run the server as a container
## Quick start

The Spring Boot gateway (REST + MCP) ships as a container image, so you can run it next to your data with no Java toolchain. Put your CSV files in a folder and mount it as the catalog's data directory:
Run the gateway next to your data no Java toolchain needed. Put your CSVs in a folder and mount it:

```bash
docker run --rm -p 8080:8080 \
-v /path/to/your/csvs:/cuckoodb/data \
ghcr.io/jinba1/cuckoodb:latest
```

- **REST:** `POST http://localhost:8080/queries`, `GET /tables`, `GET /tables/{name}` (OpenAPI at `/swagger-ui.html`).
- **MCP:** Streamable-HTTP endpoint at `http://localhost:8080/mcp` — point an MCP client at it to query your CSVs with `list_tables` / `describe_table` / `sample_rows` / `explain_query` / `query`.

The image is published to GHCR on each merge to `main`. To build it locally instead: `docker build -t cuckoodb .`

### Query budgets

The engine enforces **total-work semantics**: every tuple emitted by any operator in the tree counts against the budget, including intermediate tuples that are later filtered or joined. A cross-product explosion that never produces output rows will still hit the tuple limit. The timeout clock starts lazily at the first tuple emission.

When a budget is exceeded:
- The partial output file is deleted.
- `Error: <message>` is written to stderr.
- The process exits with code 1.

Both flags are optional and independent — you can use one, both, or neither.

### EXPLAIN

Prefix any query with `EXPLAIN` to inspect the query plan without executing it:

```sql
EXPLAIN SELECT Student.B, SUM(Student.C) FROM Student, Enrolled
WHERE Student.D > 30 AND Student.A = Enrolled.A
GROUP BY Student.B;
```

The output file receives a two-section plan:
Query over REST:

```
=== Plan (as written) ===
Aggregate[group by: Student.B; calls: SUM(Student.c)]
Project[Enrolled.A, Student.A, Student.B, Student.C, Student.D]
Select[Student.D > 30]
Join[Student.A = Enrolled.A]
Scan[Student]
Scan[Enrolled]
```bash
curl -s localhost:8080/tables
# ["People"]

=== Plan (optimized) ===
Aggregate[group by: Student.B; calls: SUM(Student.c)]
Project[Enrolled.A, Student.A, Student.B, Student.C, Student.D]
Join[Student.A = Enrolled.A]
Select[Student.D > 30]
Scan[Student]
Project[Enrolled.A]
Scan[Enrolled]
curl -s localhost:8080/queries -H 'Content-Type: application/json' \
-d '{"sql":"SELECT * FROM People LIMIT 5"}'
# {"columns":[{"name":"id","type":"INT"},...],"rows":[[1,"alice"],...],"rowCount":5,"truncated":true,"hint":"..."}
```

No query execution occurs for EXPLAIN queries.

## Join algorithms

The engine supports two join algorithms; the planner selects between them automatically.
…or connect an AI agent over MCP (below). To use the engine directly from the command line instead, see the **[engine README](engine/README.md)**.

### Nested-loop join
## For agents: MCP

`JoinOperator` implements a classic nested-loop join: for every outer tuple the inner child is rewound and scanned in full. It handles any join condition (equality, inequality, arbitrary expression, or cross product with no condition). `EXPLAIN` shows it as `Join[<condition>]`.
The server exposes a Model Context Protocol endpoint at `http://localhost:8080/mcp` (Streamable-HTTP). Point an MCP client (e.g. Claude Desktop) at it and the agent gets five tools:

### Hash join

`HashJoinOperator` extends `JoinOperator` with an in-memory hash join. The inner (build) side is drained once into a `HashMap` keyed by the equality conjuncts; the outer (probe) side then streams through once. After a hash-table lookup, the full original condition is re-evaluated on every candidate, so residual non-equality conjuncts (e.g. `A.x = B.x AND A.y > 3`) work correctly. Output order — outer-major, inner order preserved within each key bucket — is identical to the nested-loop join. `EXPLAIN` shows it as `HashJoin[<condition>]`.

**Auto-selection rule:** the planner chooses hash join when `Constants.useHashJoin` is `true` (the default) **and** the join condition contains at least one column-to-column equality conjunct (e.g. `Student.A = Enrolled.A`). Cross products (no condition) and pure non-equi joins (e.g. `A.x > B.y` only) always use nested-loop join.

**Toggle:** set `Constants.useHashJoin = false` at program start (or in tests) to force nested-loop for all joins.

### Benchmarks

Performance was measured with a JMH 1.37 benchmark suite in the `bench/` package (`engine/src/test/java/com/github/jinba1/cuckoodb/bench/`). The suite is compiled in CI but never run there; run it locally with:

```bash
./mvnw -pl engine -q test-compile exec:exec -Dexec.executable=java -Dexec.classpathScope=test \
"-Dexec.args=-cp %classpath org.openjdk.jmh.Main .*Benchmark"
```

**Results** (OpenJDK 21.0.5, Intel Core i9-13900HX, 32 logical cores, Linux under WSL2):

| Benchmark | matchesPerKey | rowsPerSide | useHashJoin | Mode | Cnt | Score | Error | Units |
|-----------|--------------|-------------|-------------|------|-----|-------|-------|-------|
| EndToEndJoinBenchmark.planAndDrain | N/A | N/A | true | avgt | 3 | 1.028 | ± 0.288 | ms/op |
| EndToEndJoinBenchmark.planAndDrain | N/A | N/A | false | avgt | 3 | 315.523 | ± 36.021 | ms/op |
| JoinAlgorithmBenchmark.hashJoin | 1 | 1000 | N/A | avgt | 5 | 0.270 | ± 0.011 | ms/op |
| JoinAlgorithmBenchmark.hashJoin | 1 | 5000 | N/A | avgt | 5 | 1.382 | ± 0.154 | ms/op |
| JoinAlgorithmBenchmark.hashJoin | 10 | 1000 | N/A | avgt | 5 | 2.160 | ± 0.109 | ms/op |
| JoinAlgorithmBenchmark.hashJoin | 10 | 5000 | N/A | avgt | 5 | 10.661 | ± 0.840 | ms/op |
| JoinAlgorithmBenchmark.nestedLoopJoin | 1 | 1000 | N/A | avgt | 5 | 202.621 | ± 25.313 | ms/op |
| JoinAlgorithmBenchmark.nestedLoopJoin | 1 | 5000 | N/A | avgt | 5 | 5027.912 | ± 370.564 | ms/op |
| JoinAlgorithmBenchmark.nestedLoopJoin | 10 | 1000 | N/A | avgt | 5 | 194.786 | ± 4.916 | ms/op |
| JoinAlgorithmBenchmark.nestedLoopJoin | 10 | 5000 | N/A | avgt | 5 | 4785.620 | ± 212.683 | ms/op |
| Tool | What it does |
|---|---|
| `list_tables` | list the available tables |
| `describe_table` | a table's column names and types |
| `sample_rows` | preview rows without writing SQL |
| `explain_query` | preview a query's plan and cost before running it |
| `query` | run a read-only `SELECT`, budget-bounded |

`EndToEndJoinBenchmark` joins two 1 000-row CSV tables through the full planner pipeline; nested-loop re-parses the inner CSV once per outer row, so the gap (≈ 307×) reflects both the algorithmic difference and I/O cost. `JoinAlgorithmBenchmark` uses in-memory `CachedOperator` inputs to isolate the join algorithm itself; at 5 000 rows/side the operator-level gap is ≈ 3 600×.
Every tool routes through the same guarded execution path as the REST API, so agent traffic inherits the read-only guarantee, the tuple/time budget, and concurrency limits (with audit hooks in place) — there is no way to bypass them.

Benchmarks are compiled in CI but never executed there.
## REST API

## Demo
| Endpoint | |
|---|---|
| `POST /queries` | plan + execute one read-only query → JSON columns/rows, or an `EXPLAIN` plan |
| `GET /tables` | list table names |
| `GET /tables/{name}` | a table's typed schema |
| `/swagger-ui.html` | interactive OpenAPI docs |

**Input table** (`engine/samples/db/data/Student.csv`):
Queries are **budget-bounded and fail-closed**: the server always attaches a budget, so an unbounded query is unreachable. A result that would exceed the tuple budget returns `429` (retry with a tighter `LIMIT`); one that exceeds the time budget returns `504`.

```
A, B, C, D
1, 200, 50, 33
2, 200, 200, 44
3, 100, 105, 44
4, 100, 50, 11
5, 100, 500, 22
6, 300, 400, 11
```
### EXPLAIN

**Query** (`engine/samples/input/query4.sql`):
Any query can be planned without executing it — prefix `EXPLAIN` over REST, or call the `explain_query` tool. The plan is shown as written and after optimisation:

```sql
SELECT * FROM Student WHERE Student.A < 3;
```
=== Plan (as written) ===
Project[Student.B, Student.C]
Select[Student.D > 30]
HashJoin[Student.A = Enrolled.A]
Scan[Student]
Scan[Enrolled]

**Command:**

```bash
java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \
com.github.jinba1.cuckoodb.CuckooDB \
engine/samples/db engine/samples/input/query4.sql output.csv
=== Plan (optimized) ===
Project[Student.B, Student.C]
HashJoin[Student.A = Enrolled.A]
Select[Student.D > 30]
Scan[Student]
Project[Enrolled.A]
Scan[Enrolled]
```

To limit resource usage, add optional budget flags:
The optimiser pushes the `Select` below the join (selection pushdown) and projects the inner scan down to just the key it needs; the planner picked a hash join for the equi-condition. See the [engine README](engine/README.md#explain) for the full treatment.

```bash
java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \
com.github.jinba1.cuckoodb.CuckooDB \
engine/samples/db engine/samples/input/query4.sql output.csv --max-tuples=10000 --timeout-ms=5000
```
## How it works

**Output** (`output.csv`):

```
a,b,c,d
1,200,50,33
2,200,200,44
```

## Running Examples

The `engine/samples/` directory ships with 20 queries and a small dataset (Student, Course, Enrolled, Staff tables). Expected output lives in `engine/samples/expected_output/`.

Run all 20 through the bundled runner, which diffs each result against the expected output and reports pass/fail. It is launched via `exec:exec` (not `exec:java`) so it runs with the engine module as the working directory — `exec:java` would keep the working directory at the reactor root and fail to find `samples/`:

```bash
./mvnw -pl engine -q test-compile exec:exec -Dexec.executable=java -Dexec.classpathScope=test \
"-Dexec.args=-cp %classpath com.github.jinba1.cuckoodb.SampleQueryRunner"
SQL → JSqlParser → QueryPlanner → optimizer → operator tree → results
```

Or run each query through the CLI and diff manually:

```bash
# Run all sample queries and diff against expected output
for i in $(seq 1 20); do
java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \
com.github.jinba1.cuckoodb.CuckooDB \
engine/samples/db "engine/samples/input/query${i}.sql" "/tmp/out${i}.csv"
diff "engine/samples/expected_output/query${i}.csv" "/tmp/out${i}.csv" && echo "query${i}: OK"
done
```
The engine is a from-scratch Volcano/iterator executor — typed values, hash and nested-loop joins, selection pushdown, tuple/time budgets. The server wraps it behind a single `QueryService` choke point that applies the budget, a concurrency permit, and audit; **both** the REST controllers and the MCP tools go through it, so the guardrails can't be bypassed and apply uniformly. Engine internals — architecture, join algorithms, benchmarks, CLI — are in the **[engine README](engine/README.md)**.

## Testing
## Build and test

```bash
./mvnw test
./mvnw clean verify # builds + tests both modules: engine (419 tests) + server (90 tests)
```

The test suite covers individual operators, the query planner, the optimiser, expression evaluation, query budgets, EXPLAIN, hash join, and end-to-end integration scenarios (339 tests).
The 20 sample queries are a golden-output regression gate (see the engine README to run them). CI builds, tests, and publishes the container image to GHCR on every merge to `main`.

## Project Structure
## Project structure

```
├── pom.xml # Parent POM (aggregator: engine + server; Java 17, dep/plugin management)
├── engine/ # Pure query engine — zero Spring dependencies
│ ├── pom.xml # cuckoodb-engine (JSqlParser 4.7, commons-csv 1.14.1, JMH 1.37 test-scope)
│ ├── src/main/java/com/github/jinba1/cuckoodb/ # Core engine (35 files)
│ │ └── operator/ # Volcano operators (11 files, incl. HashJoinOperator)
│ ├── src/test/java/com/github/jinba1/cuckoodb/ # JUnit 5 tests (339 tests across 33 files)
│ └── samples/
│ ├── db/data/ # CSV data files (header row + data rows)
│ ├── input/query[1-20].sql # Sample queries
│ └── expected_output/query[1-20].csv # Expected results
├── server/ # cuckoodb-server — Spring Boot REST + MCP gateway over the engine
│ ├── pom.xml # Spring Boot 4 (web MVC), springdoc/OpenAPI, Spring AI MCP server
│ └── src/main/java/com/github/jinba1/cuckoodb/server/ # web/ controllers, query/ service, catalog/ facade, mcp/ agent tools, config
├── mvnw / mvnw.cmd # Maven Wrapper
└── LICENSE
├── engine/ # pure query engine — Java 17, zero Spring (see engine/README.md)
└── server/ # Spring Boot 4 gateway — REST + MCP over the engine
```

## Background

Originally built as a university project for the Advanced Database Systems course at the University of Edinburgh, subsequently extended with additional query optimisation and expanded test coverage.
Originally built as a university project for the Advanced Database Systems course at the University of Edinburgh, then extended into a guarded, agent-facing gateway — REST and MCP interfaces, query budgets, and additional optimisation and test coverage.

## License

This project is released under the MIT License. See [LICENSE](LICENSE) for details.
Released under the MIT License. See [LICENSE](LICENSE).
Loading
Loading