[RFC] PPL `rest` command

## Problem Statement

PPL can read documents from indices, but it has no way to bring a cluster's
operational and topology state into a query pipeline. Information such as
cluster health, node resource usage, shard placement, cluster settings,
installed plugins, index resolution, and the caller's identity lives behind
dedicated management and `_cat` REST endpoints. To inspect that state today a
user must leave PPL, call the endpoint directly, and post-process the JSON by
hand. There is no way to filter, sort, aggregate, or project management-endpoint
data with the rest of the language (`where`, `stats`, `sort`, `fields`).

## Current State

- On the Calcite path, PPL row sources are limited to index scans
  (`visitRelation`), index subqueries, and literal values.
- `describe <index>` and `show datasources` already expose metadata as tabular
  rows through a reserved-name system source that resolves via `visitRelation`.
  They cover only an index's field metadata and the datasource connection
  catalog; neither reaches cluster or `_cat` operational endpoints.
- The Calcite table-function seam (`visitTableFunction`) is unsupported on the
  primary path: it throws `CalciteUnsupportedException`. A `source=fn(...)`
  style therefore cannot introduce a new row source today.
- Net result: operational endpoints are reachable only outside PPL, and their
  responses cannot be composed with downstream pipeline operators or inspected
  with `EXPLAIN`.

## Long-Term Goals

**Ideal outcome.** A first-class, leading PPL command that turns a curated set
of read-only management endpoints into fixed-schema rows that compose naturally
with the rest of the language.

**Primary objectives.**
1. Read-only, safe access to operational and topology endpoints from within a
   PPL pipeline.
2. A fixed, plan-time-known schema per endpoint, so downstream `where`, `stats`,
   `sort`, and `fields` work and `EXPLAIN` is meaningful.
3. A default-deny allow-list, so only vetted read-only endpoints are reachable.
4. Caller-context authorization and secret-field redaction.

**Sustainability and scalability.** Endpoints are expressed as data in a
registry, so adding one is a reviewed registry entry rather than new operators
or grammar. The command rides the existing system-row-source seam, so it
inherits the optimizer and execution machinery already used by `describe` and
the system-index family.

**Does it address the root problem?** Yes. It closes the gap that operational
data is not query-able in PPL, without standing up a new engine or execution
model.

**Confidence.** High. It reuses a shipped, proven seam (the reserved-name
system source behind `describe`/`show`) and a fixed-schema scan; both the
mechanism and the execution path are already established in the codebase.

## Proposal

Add a leading command:

```
| rest <endpoint-path> [count=<int>] [<arg>=<value> ...]
```

It resolves an allow-listed, read-only management endpoint into a fixed-schema
table on the Calcite path, modeled as a system row source. It supports
row-count capping, per-endpoint server-side filter arguments with explicit
value validation, and deterministic plan-time validation that produces
client-side (400-class) errors. Endpoint responses are normalized into flat,
typed rows.

Example:

```
| rest '/_cat/indices' | where health = 'yellow' | sort index | fields index, health, pri
```

## Approach

- **Grammar and AST.** A new `REST` lexer token and `restCommand` parser rule;
  a `RestRelation` AST node. The AST builder validates the endpoint spec and
  encodes it into a single reserved table name.
- **Resolution.** The storage engine decodes the reserved name into a
  `RestSourceTable`, exactly as `describe` resolves to a system index. This
  bridges through `visitRelation` and never reaches the unsupported
  table-function seam.
- **Execution.** `RestSourceTable` -> `CalciteLogicalRestScan` ->
  `CalciteEnumerableRestScan`. A central registry maps each endpoint to its
  read-only transport action, its fixed output schema, the query arguments it
  accepts (with allowed value domains), and a secret-field filter. Dispatch runs
  under the caller's security context through the node client (with a standalone
  REST-client path for the datasource mode).
- **Initial endpoint set.** `/_cluster/health`, `/_cluster/state`,
  `/_cluster/settings`, `/_cat/indices`, `/_cat/nodes`, `/_cat/cluster_manager`,
  `/_cat/plugins`, `/_cat/shards`, `/_resolve/index`,
  `/_plugins/_security/authinfo`.
- **Arguments.** `count` caps emitted rows. Server-side filter arguments are
  applied per endpoint with explicit value validation: `local` on
  `/_cluster/health`, `health` on `/_cat/indices`, `expand_wildcards` on
  `/_resolve/index`. A `timeout` token is reserved in the grammar but rejected
  with a 400 in the initial release, since a single timeout cannot map uniformly
  across the endpoints.
- **Output shaping.** Each endpoint normalizes its response into the fixed
  schema: numeric type normalization, identifier-to-name resolution (for example
  the cluster-manager node id rendered as its node name), role-name expansion,
  structural flattening of nested responses into uniform rows, secret-field
  filtering, and graceful null-valued rows when an optional plugin is absent.

## Alternative

- **Table-function source `source=rest('<endpoint>')`.** Cleaner if the endpoint
  set later becomes open-ended and value-parameterized, and it would unify with
  other parameterized sources under one seam. It requires first building a
  generic Calcite table-function-source capability, because `visitTableFunction`
  is unsupported on the primary path. Deferred; the leading-command form delivers
  the same result today without that prerequisite.
- **Dynamic, response-driven schema (`_MAP` / schema-on-read).** Needed only for
  endpoints whose response shape is keyed by data values. Deferred to a future
  release that depends on the schema-on-read work; the initial endpoint set is
  fully representable with fixed schemas.
- **Raw JSON pass-through.** Returning the endpoint's JSON unchanged is rejected:
  a non-fixed shape cannot publish a plan-time row type, which breaks downstream
  operators and `EXPLAIN`.

## Implementation Discussion

- **Capability gate: table-function source.** `visitTableFunction` throws on the
  Calcite path, so introducing a row source through a function call is not
  available today. The leading-command plus reserved-name approach sidesteps this
  by reusing the system-index scan seam, which is already supported.
- **Capability gate: plan-time schema.** A Calcite scan must publish its row type
  before execution. This is why each endpoint declares a fixed schema in the
  registry, and why dynamic, response-driven endpoints are out of the initial
  scope.
- **Security.** Default-deny allow-list; only read-only endpoints are
  registered; dispatch runs under the caller's security context; secret-bearing
  fields are filtered during row shaping; and user-supplied argument values are
  validated against per-argument domains rather than passed unchecked into the
  underlying request.
- **No pushdown.** Rows originate from a management transport action, not a
  Lucene index, so `count` caps rows after the call and downstream `where`,
  `sort`, and `stats` run in Calcite. `EXPLAIN` shows the scan with downstream
  operators composed above it.
- **Testing.** Unit tests cover the registry (declared schema, allow-list,
  argument key and value validation, type coercion). Integration tests cover each
  endpoint plus negative cases that assert a 400 for a non-allow-listed endpoint,
  an empty path, a disallowed argument, an out-of-domain argument value, and a
  negative count.
- **Deferred items.** Dynamic, response-driven endpoints (`_MAP`); the
  `include_defaults` argument on `/_cluster/settings` and the `level` argument on
  `/_cluster/health`, which need additional plumbing or a different output shape;
  and a generic table-function source. Each is a follow-on, not a blocker for the
  initial command.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] PPL `rest` command #5597

Problem Statement

Current State

Long-Term Goals

Proposal

Approach

Alternative

Implementation Discussion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] PPL rest command #5597

Description

Problem Statement

Current State

Long-Term Goals

Proposal

Approach

Alternative

Implementation Discussion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[RFC] PPL `rest` command #5597