feat(core): add Substrait dialect support#861
Conversation
Add a typed model in io.substrait.dialect for creating and consuming Substrait dialect YAML files, faithful to substrait/text/dialect_schema.yaml (introduced in substrait v0.76.0). The model mirrors the SimpleExtension pattern: a @Value.Enclosing Dialect holder with nested Immutables types and Jackson (de)serialization. The three polymorphic unions (supported_types/relations/expressions) use an enum-tag class per category with typed config fields, and custom (de)serializers that collapse config-free entries to bare enum strings and expand configured ones to objects. Config sub-enums are dialect-local to keep the dialect vocabulary decoupled from the algebra model. Schema validation is test-scope only (networknt json-schema-validator), so the published core jar gains no new runtime dependency. Tests cover a schema-validated round-trip, bare-string collapse, parsing the published spark_dialect.yaml, and the per-section dialect fixtures from the spec repo.
bestbeforetoday
left a comment
There was a problem hiding this comment.
The programmatic building of a Dialect looks really nice. A whole load of inline comments; mostly very minor suggestions or queries.
| private static ObjectMapper objectMapper() { | ||
| return new ObjectMapper(new YAMLFactory()) | ||
| .registerModule(new Jdk8Module()) | ||
| .enable(DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY) | ||
| // Omit absent Optionals and empty collections so that unset sections are not emitted. | ||
| // The custom (de)serializers for the polymorphic unions write their fields explicitly and | ||
| // are unaffected by this inclusion setting. | ||
| .setDefaultPropertyInclusion(JsonInclude.Include.NON_EMPTY); | ||
| } |
There was a problem hiding this comment.
ObjectMapper is threadsafe and designed for reuse. Perhaps a single shared instance should be used instead of creating a new one each time.
| * #toYaml(DialectDocument)}; parse one with {@link #load(String)} and friends. | ||
| */ | ||
| @Value.Enclosing | ||
| public class Dialect { |
There was a problem hiding this comment.
Dialect seems to be only used to provide an enclosing namespace for the nested classes. The io.substrait.dialect package is already providing a namespace for all the dialect handling. The enclosing Dialect class results in naming like io.substrait.dialect.Dialect.DialectDocument, which has a lot of repetition. Would it be practical to promote the nested classes to top-level classes within the dialect package? Perhaps DialectDocument could then be renamed to simply Dialect (or Document). The handful of static factory methods on Dialect that currently return Dialect.DialectDocument instances would then become factory methods on Dialect and return Dialect. This seems like a more typical static factory method pattern.
| public static DialectDocument load(InputStream stream) { | ||
| try (Scanner scanner = new Scanner(stream, StandardCharsets.UTF_8.name())) { | ||
| scanner.useDelimiter(READ_WHOLE_FILE); | ||
| String content = scanner.hasNext() ? scanner.next() : ""; | ||
| return load(content); | ||
| } | ||
| } |
There was a problem hiding this comment.
ObjectMapper can read values directly from an InputStream. We do not need to read the entire InputStream content into a String first.
|
|
||
| /** Parse a dialect from a YAML stream. */ | ||
| public static DialectDocument load(InputStream stream) { | ||
| try (Scanner scanner = new Scanner(stream, StandardCharsets.UTF_8.name())) { |
There was a problem hiding this comment.
I think the close of the Scanner at the end of the try-with-resources will also close the underlying InputStream. The caller might not expect this.
| if (metadata == null || metadata.isNull()) { | ||
| return null; | ||
| } | ||
| return ((ObjectMapper) p.getCodec()).convertValue(metadata, MAP_TYPE); |
There was a problem hiding this comment.
The unchecked cast has the potential to cause unexpected runtime failures if other parts of the code are changed and the codec ends up not being an ObjectMapper. If all the dialect code used a single shared default (package) scoped ObjectMapper instance, perhaps that could be accessed directly by this code instead of calling getCodec on the parser.
There was a problem hiding this comment.
Similar pattern in SupportedTypeDeserializer.
| * Serializes a {@code supported_expressions} entry as a bare enum string when it carries no | ||
| * configuration, or as a configuration object otherwise. | ||
| */ | ||
| public class SupportedExpressionSerializer extends JsonSerializer<SupportedExpression> { |
There was a problem hiding this comment.
Perhaps could be default (package) scope instead of exposing in the public API.
| * Deserializes a {@code supported_relations} entry, which is either a bare enum string (e.g. {@code | ||
| * FILTER}) or a configuration object (e.g. {@code {relation: JOIN, join_types: [INNER]}}). | ||
| */ | ||
| public class SupportedRelationDeserializer extends JsonDeserializer<SupportedRelation> { |
There was a problem hiding this comment.
Perhaps could be default (package) scope instead of exposing in the public API.
| * Serializes a {@code supported_relations} entry as a bare enum string when it carries no | ||
| * configuration, or as a configuration object otherwise. | ||
| */ | ||
| public class SupportedRelationSerializer extends JsonSerializer<SupportedRelation> { |
There was a problem hiding this comment.
Perhaps could be default (package) scope instead of exposing in the public API.
| * Deserializes a {@code supported_types} entry, which is either a bare enum string (e.g. {@code | ||
| * BOOL}) or a configuration object (e.g. {@code {type: PRECISION_TIMESTAMP, max_precision: 9}}). | ||
| */ | ||
| public class SupportedTypeDeserializer extends JsonDeserializer<SupportedType> { |
There was a problem hiding this comment.
Perhaps could be default (package) scope instead of exposing in the public API.
| * Serializes a {@code supported_types} entry as a bare enum string when it carries no | ||
| * configuration, or as a configuration object otherwise. | ||
| */ | ||
| public class SupportedTypeSerializer extends JsonSerializer<SupportedType> { |
There was a problem hiding this comment.
Perhaps could be default (package) scope instead of exposing in the public API.
What
Adds a typed model in the
coreJava SDK for creating and consuming Substrait dialect YAML files, faithful tosubstrait/text/dialect_schema.yaml(introduced in substrait v0.76.0). Until now the only producer of a dialect in this repo was the ScalaDialectGeneratorin thesparkmodule, whose ad-hoc model isn't reusable.Usage
Build a dialect, serialize it to YAML, and parse it back:
The configuration-free entries serialize as bare enum strings and the configured ones as mappings:
A fuller example covering every union and configuration option lives in
DialectRoundTripTest.Design
io.substrait.dialectpackage with a@Value.EnclosingDialectholder and nested Immutables types, mirroring the existingSimpleExtensionpattern (Jackson +@Value.Immutable, staticload(...)/toYaml(...)helpers).supported_types/supported_relations/supported_expressions) areoneOf [bare-enum-string | config-object]in the schema. Each is modeled as one enum-tag class per category (SupportedType/SupportedRelation/SupportedExpression) carrying a dialect-local kind enum plus typed config fields. Custom Jackson (de)serializers collapse config-free entries to a bare enum string and expand configured ones to objects.JoinType,SetOperation, ...) are dialect-local with exactly the schema's constants, keeping the dialect vocabulary decoupled from the relational-algebra model (whoseJoin.JoinType/Set.SetOpcarry extraUNKNOWN/deprecated values).Type/Rel/Expressionhierarchies model full algebra instances — the wrong abstraction level for capability tags — so they are intentionally not reused for the kind enums.Validation & tests
networknt json-schema-validator); the publishedcorejar gains no new runtime dependency.processTestResourcescopies the dialect schema, the publishedspark_dialect.yaml, and the spec's per-section dialect fixtures onto the test classpath.types,relations,expressions,functions,execution_behavior).The
sparkmodule is left unchanged; migrating itsDialectGeneratoronto this model is a natural follow-up.🤖 Generated with AI