From a1d34c919c3b27a06134262f85bba88d48ae0bd4 Mon Sep 17 00:00:00 2001 From: Jason Grey Date: Fri, 10 Apr 2026 20:03:33 -0500 Subject: [PATCH 1/3] fix: repair PHP syntax error in admin/class-content-signing-admin-author-profiles.php The render_page() method was missing its closing sequence (a ?>, , render_authors_list($authors); } -/** + ?> + + - - Date: Fri, 10 Apr 2026 22:15:35 -0500 Subject: [PATCH 2/3] docs(html-protocol): align with amended spec Updates the HTML signature protocol documentation to match the amended paper specification: - Canonicalization section: add explicit text-only scoping, enumerate the semantic attacks this leaves open (element rewrapping, link swap, surrounding media manipulation), and document the layered response (domain binding alerts readers at unexpected origins; research and reputation path traces signed content back to canonical origin and flags imposter copies). - Signature data format: replace old {contentHash}:{domain}:{authorId} binding with new {content-hash}:{claims-hash}:{domain}:{signed-at}. Drop authorId (redundant with keyid resolution), add claims-hash for tamper-evident metadata, add signed-at timestamp. - Hash encoding: switch to unpadded Base64 and note the open feedback invitation on encoding alternatives (hex, Base32). - Verification flow: restructure into two layers -- cryptographic verification (local, deterministic) and trust decision (client policy). Add detail on keyid resolution methods (DID, direct URL, trust directory reference). Add note that optional directory queries enrich trust decision but are never required for verification. Tracks the 2026-04-10 design decisions committed in the paper at bb3dc5a, 1187b2d, 6d0511c, 271a455, ca8cc3b. --- docs/html-protocol.md | 86 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 71 insertions(+), 15 deletions(-) diff --git a/docs/html-protocol.md b/docs/html-protocol.md index 0de34eb..2b93022 100644 --- a/docs/html-protocol.md +++ b/docs/html-protocol.md @@ -96,39 +96,95 @@ Both forms are valid. Verifying clients should handle either case. Before hashing, content MUST be canonicalized: -1. Strip all HTML tags (extract text content only) -2. Collapse all whitespace sequences to a single space -3. Trim leading and trailing whitespace -4. Encode as UTF-8 +1. Parse the HTML and extract text nodes in document order +2. Strip all HTML markup (tags and attributes); only the text content contributes to the hash +3. Collapse all whitespace sequences to a single space +4. Trim leading and trailing whitespace +5. Apply the text normalization defined by the `@htmltrust/canonicalization` library (NFKC, quote normalization, dash normalization, invisible character stripping) +6. Encode as UTF-8 -The resulting string is hashed with SHA-256 and prefixed: `sha256:`. +The resulting string is hashed with SHA-256 and expressed as `sha256:`, where `` is the unpadded Base64 encoding of the 32-byte digest. + +### Text-only scope + +The canonicalization hashes **text content only**, not the HTML markup or attributes that surround it. This means an adversary with possession of signed text MAY: + +- Rewrap the text in misleading block elements (e.g., change an `

` to a `` strikethrough) +- Alter link destinations (`href` values) on `` elements surrounding the signed text +- Introduce, remove, or swap images and other media around the signed text + +These are **semantic integrity concerns**, not cryptographic ones. HTMLTrust addresses them through a layered design: + +1. **Domain binding** (see Signature Data Format below): signatures bind the content to a specific publication origin. A reader or crawler encountering signed content at an unexpected origin is alerted by signature check failure. +2. **Research and reputation path**: crawlers and researchers can trace signed content back to its canonical publication origin through the trust directory, flag imposter copies, and mark manipulated surrounding context. Over time the directory's reputation and reports surface altered copies to any consumer whose trust policy considers them. + +The layered design keeps cryptographic verification simple and portable across language implementations, while delegating semantic-integrity detection to the research ecosystem where it can evolve without breaking existing signatures. + +**Open design question**: a future revision MAY extend the hash to cover particularly meaningful attributes, especially `href` on `` elements (since link-swap within the original publication origin is a phishing vector that domain-binding and research cannot address alone). Feedback on which attributes to cover is explicitly welcome. ## Signature Data Format -The signature binds three values, concatenated with `:` separators: +The signature binds four values, concatenated with `:` separators: ``` -{contentHash}:{domain}:{authorId} +{content-hash}:{claims-hash}:{domain}:{signed-at} ``` +- `content-hash` — hash of the canonicalized text content (see above) +- `claims-hash` — SHA-256 hash of the canonical serialization of all inner `` claim elements, ordered lexically by name (ensures tamper-evident claim metadata) +- `domain` — the origin where the content is authoritatively published (anti-theft binding) +- `signed-at` — the ISO-8601 timestamp from the `` element + For example: ``` -sha256:a591a6d40bf420404a...146e:example.com:123e4567-e89b-12d3-a456-426614174000 +sha256:RAyBCvKT...:sha256:eFgHiJkL...:example.com:2025-05-01T10:30:00Z ``` -This string is signed with the author's private key using the specified algorithm. +The author's identity is **not** included in the binding because it is implicit in the keyid resolution step: any attempt to claim a signature under a different identity would resolve to a different public key and fail verification. This string is signed with the author's private key using the algorithm declared in the `algorithm` attribute. + +**Hash encoding (open feedback)**: hashes are encoded as unpadded Base64, which is shorter than hexadecimal by roughly one-third. Community feedback on alternative encodings (hex, Base32) for ecosystem alignment is welcome. ## Verification Flow -A verifying client (browser extension, crawler, etc.) performs these steps: +HTMLTrust separates verification into two distinct layers, per the specification: + +### Layer 1: Cryptographic verification (local, deterministic) + +A verifying client (browser extension, crawler, library) performs these steps **locally**, with no network calls beyond the key resolution step: 1. **Discover** `` elements in the page DOM 2. **Read** the `signature`, `keyid`, `algorithm`, and `content-hash` attributes -3. **Fetch** the author's public key from the URL in `keyid` -4. **Canonicalize** the adjacent or wrapped content and compute its SHA-256 hash -5. **Compare** the computed hash with `content-hash` (integrity check) -6. **Verify** the cryptographic signature against the public key (authenticity check) -7. **Optionally** query a trust directory for the author's reputation and endorsements +3. **Resolve** the `keyid` to a public key. The `keyid` may be a DID (e.g., `did:web:author.example`), a direct URL to a public key JSON document, or a trust directory reference. Implementations MUST accept multiple resolution methods. +4. **Canonicalize** the inner text content per the rules above and compute its hash +5. **Compare** the computed hash with the `content-hash` attribute (content integrity check) +6. **Compute** the `claims-hash` from the canonical serialization of inner `` claim elements +7. **Construct** the binding string `{content-hash}:{claims-hash}:{domain}:{signed-at}` +8. **Verify** the cryptographic signature over the binding string using the resolved public key and the declared `algorithm` + +This layer produces a deterministic yes/no result: either the signature is cryptographically valid or it is not. No server or directory is required for this step beyond whatever key resolution demands. + +### Layer 2: Trust decision (client policy) + +Given a cryptographically valid signature, the client then applies the **user's trust policy** to decide how to present the content. This layer is entirely client-side and may draw on: + +- A personal list of trusted keyids (option A) +- Trusted origin domains (option B) +- Endorsements from designated third parties (fetched from trust directories and independently verified) +- Reputation scores from one or more user-selected trust directories +- Local or cached revocation state +- Any combination of the above, weighted as the user configures + +The output is a trust score or ranking, **not** a binary verdict. User interfaces SHOULD present the outcome as a graduated signal (for example a red/yellow/green score) with hover or detail views exposing which inputs contributed to the final score. + +### Optional directory queries + +In addition to the two layers above, a client MAY query one or more trust directories for: + +- Author reputation (signer-level trust, ongoing curatorial opinion) +- Content endorsements (point-in-time attestations from third parties) +- Key revocation and reports + +These queries enrich the trust decision but are never required for signature verification itself. ## Multiple Signatures From 047a15b574291cd1e307eef457185906b461e7ef Mon Sep 17 00:00:00 2001 From: Jason Grey Date: Tue, 28 Apr 2026 23:12:59 -0500 Subject: [PATCH 3/3] =?UTF-8?q?docs(html-protocol):=20align=20with=20spec?= =?UTF-8?q?=20=C2=A72.1,=20=C2=A72.2,=20and=20add=20resolver=20section?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brings docs/html-protocol.md fully into line with the amended spec (htmltrust.tex §2.1, §2.2, §2.5): - Add an "Identity and Key Resolution" section mirroring spec §2.2: the three pluggable resolvers (DID, direct URL, trust directory), none of which is privileged by the protocol. - Add a "Canonical Content Extraction" section mirroring spec §2.1: explicit two-stage pipeline (HTML extraction → text normalization), enumerating excluded elements, block-element boundary handling, and the six text-normalization phases. - Restate text-only scoping with explicit "what is NOT covered by the hash" callout tying back to the existing Text-only scope discussion. - Tighten the Required Attributes table: spec §2.1 ordering (keyid, signature, content-hash, algorithm) with clearer language pointing at the new resolver and binding sections. - Add an explicit "Optional Attributes" subsection: there are none on the wrapper itself in this revision; claim metadata goes in inner . style/class belong in user-agent CSS, not inline on the wrapper (resolves a P1 from TODO-Cleanup.md). - Document algorithm default: ed25519 when omitted, though producers SHOULD always emit it explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/html-protocol.md | 68 ++++++++++++++++++++++++++++++++++--------- 1 file changed, 54 insertions(+), 14 deletions(-) diff --git a/docs/html-protocol.md b/docs/html-protocol.md index 2b93022..79649ef 100644 --- a/docs/html-protocol.md +++ b/docs/html-protocol.md @@ -10,18 +10,26 @@ Signed content uses the `` custom HTML element, as defined in th ### Required Attributes +Per spec §2.1, the wrapper element carries exactly four required attributes: + | Attribute | Description | Example | |---|---|---| -| `signature` | Base64-encoded cryptographic signature of the content hash + domain + author ID | `signature="aBcDeF123..."` | -| `keyid` | URL where the author's public key can be fetched, or a DID | `keyid="https://api.example.com/authors/123/public-key"` | -| `algorithm` | Cryptographic algorithm used for the signature | `algorithm="ed25519"` | -| `content-hash` | Hash of the canonicalized content, prefixed with the algorithm | `content-hash="sha256:abc123def456..."` | +| `keyid` | Identifies the signer; resolved per the rules in **Identity and Key Resolution** below. May be a DID, a direct URL to a public key document, or a trust-directory reference. | `keyid="did:web:author.example"` | +| `signature` | Base64-encoded (unpadded) cryptographic signature over the canonical binding string defined in **Signature Data Format** | `signature="aBcDeF123..."` | +| `content-hash` | Hash of the canonicalized text content, prefixed with the hash algorithm | `content-hash="sha256:abc123def456..."` | +| `algorithm` | Signature algorithm. Required by the spec; implementations MAY default to `ed25519` when the attribute is omitted, but producers SHOULD always emit it explicitly. | `algorithm="ed25519"` | + +### Optional Attributes + +There are **no** optional attributes on the `` wrapper itself in this revision. All claim and contextual metadata (author name, signed-at timestamp, license, content type, AI assistance, etc.) belongs in inner `` elements as documented under **Inner Metadata** below. This keeps the wrapper's attribute surface narrow and easy to validate. + +Presentational attributes such as `style` and `class` SHOULD NOT be set inline on ``. Styling is the user agent's responsibility (see the **CSS** section at the bottom of this document); inline presentational attributes mix concerns and are unnecessary for protocol conformance. ### Supported Algorithms | Value | Description | |---|---| -| `ed25519` | Ed25519 (recommended) | +| `ed25519` | Ed25519 (recommended; the default if the `algorithm` attribute is omitted) | | `rsa` | RSA with SHA-256 | | `ecdsa` | ECDSA with secp256k1 | @@ -92,18 +100,50 @@ Or appear as a **standalone marker** alongside content (e.g., when added by a CM Both forms are valid. Verifying clients should handle either case. -## Content Canonicalization +## Identity and Key Resolution + +The `keyid` attribute identifies the signer but the resolution mechanism is deliberately **pluggable** (spec §2.2). Implementations MUST accept multiple resolution methods and SHOULD treat none as canonical or privileged. Three forms are defined: + +| Form | Example | How it resolves | +|---|---|---| +| **Decentralized Identifier (DID)** | `did:web:author.example` | The user agent fetches the DID document at the author's origin (`https://author.example/.well-known/did.json` for `did:web`) and extracts the public key. Places no dependency on any third party. | +| **Direct URL to a public key** | `https://author.example/key.json` | The user agent fetches the URL and parses the response as either a JSON `{ publicKey, algorithm }` object or raw PEM. Simple to host as a static file with no extra tooling. | +| **Trust directory reference** | `https://directory.example/keys/abc123` | The user agent fetches the URL from a federated trust directory acting as a convenience key registry. Useful for less-technical authors who prefer a sign-up workflow over self-hosted identity publication. | + +**No resolver is privileged by the protocol.** Authors freely choose a resolution mechanism, and verifiers freely choose which methods they accept. Verifiers typically compose the three resolvers as a fallback chain in whatever order suits their threat model. The `keyid` is opaque to the signature protocol itself; only the resolved public key matters for cryptographic verification, and that verification is a local operation in the user agent that never requires contacting a directory. + +User agents MAY cache resolved keys (with appropriate freshness and revocation handling) so that signature verification scales to pages with many signed sections without repeated network calls. + +## Canonical Content Extraction + +The hash that the signature covers is taken from the **text content** of the signed region, after the extraction and normalization process described below (spec §2.1). This is performed in two stages: HTML extraction, then text normalization. + +### Stage 1: HTML extraction + +Given the inner contents of a `` element: + +1. **Strip excluded elements** entirely, including their text content: `