Skip to content

RDoc-3786 Add strict canonical verification#2404

Open
poissoncorp wants to merge 11 commits intoravendb:mainfrom
poissoncorp:RDoc-3786-2
Open

RDoc-3786 Add strict canonical verification#2404
poissoncorp wants to merge 11 commits intoravendb:mainfrom
poissoncorp:RDoc-3786-2

Conversation

@poissoncorp
Copy link
Copy Markdown
Contributor

Issue link

RDoc-3786 Rewrite canonical URLs between versions if there's an entry in redirects

Additional description

(didn't use ai to write it, so don't scan - read 😂)

This PR introduces a guarantee that every <link rel="canonical"> won't cause 404 across all pages.
It introduces DOCUSAURUS_STRICT_CANONICALS env flag - when enabled, and if there's any invalid redirect or canonical mismatch, the build fails.

New features:

  • Canonical redirects plugin - rewrites canonicals in every emitted HTML file using the redirect map from scripts/redirects.json, then verifies each rewritten canonical against the actual docs structure
  • npm run validate-redirects CLI - comprehensive validation of redirects.json file. Checks if schema is correct and validates the data, runnable without a full build - added as a CI step
  • Cloudfront function (scripts/handle_redirects.js) now handles the hops to pair it with plugin validation. Added parity tests and sprinkled with unit tests that check if the scripts works as expected and handles all corner cases before pushing to prod
  • Legacy versions - unified the logic of skipping legacy (EOL) docs versions (exclude from sitemap, block robots, etc.) within version-policy.js which is a single source of truth. Whenever RavenDB version becomes legacy, simply adding version to this list cascades all necessary SEO ops.
  • templates-noindex-plugin which hides /templates/ from search engines

Redirects JSON fixes:

  • The verify flow found 26 missing redirects for moved docs, that caused 404 canonicals - fixed by backfilling the redirects map
  • Fixed one malformed pre-exsiting entry (encrypted-backup)

So, new infra consists of:

  • 2 plugins - for rewriting canonicals and hiding templates, as mentioned above
  • handle_redirects as it was before, now just tested before pushing to AWS :)
  • validate-redirects CI tool
  • version-policy.js as mentioned to control the legacy docs ops
  • generate-robots.js renders robots.txt from template on every build, keeping legacy disallows in sync
  • split-sitemap ported to TypeScript to make it testable

Sprinkled it with docs for agents, added .svg-s with UML diagrams on how the build flow works, and how redirect chains are resolved (the akward situation when we moved the document more than once between versions).

Once merged, let's remember to set strict canonicals env on CD.

Type of change

  • Content - docs
  • Content - cloud
  • Content - guides
  • Content - start pages/other
  • New docs feature (consider updating /templates or readme)
  • Bug fix
  • Optimization
  • Other

Changes in docs URLs

  • No changes in docs URLs
  • Articles are restructured, URLs will change, mapping is required (update /scripts/redirects.json file, set Documents Moved PR label)

Changes in UX/UI

  • No changes in UX/UI
  • Changes in UX/UI (include screenshots and description)

Comment thread src/plugins/templates-noindex-plugin/index.ts Outdated
Comment thread scripts/lib/version-policy.js
poissoncorp added 11 commits May 1, 2026 01:50
Introduce the canonical-redirects Docusaurus plugin. loadContent reads
scripts/redirects.json, validates schema + cycles, and builds the redirect
map. postBuild walks the emitted HTML, rewrites every <link rel="canonical">
to the current-version equivalent (legacy versions get a self-canonical), and
verifies each rewritten canonical against the Docusaurus route universe.

- CLI: `npm run validate-redirects` (scripts/validate-redirects.ts) runs the
  same schema + cycle checks standalone.
- CI: DOCUSAURUS_STRICT_CANONICALS=true gates strict builds in build-on-pr.yml.
- Handles both pretty-printed and minified HTML output from Docusaurus.
…s; hide templates

Move version handling into a single source of truth and generate SEO assets
from it.

- scripts/lib/version-policy.js exports CURRENT_VERSION + LEGACY_VERSIONS.
  docusaurus.config.ts, the canonical rewriter, the edge handler, and the
  generators all import from here.
- scripts/generate-robots.js renders scripts/robots-templates/*.template.txt
  with the current legacy list, writing build/robots.txt on every build.
- scripts/split-sitemap.ts replaces split-sitemap.js; core logic lives in
  src/lib/split-sitemap with unit tests.
- src/plugins/templates-noindex-plugin injects noindex,nofollow into
  /templates/* so doc-authoring scaffolding doesn't surface in search.
- 6.0 and 7.0 are marked legacy in this commit's version-policy update.
…uild time

Make the edge function and build-time resolver equivalent, and move cycle
detection upstream.

- scripts/handle_redirects.js gains a bounded chain-collapse loop so N-hop
  chains become exactly one 301 at the edge.
- CloudFront Functions runtime is single-file and can't resolve
  project-local imports, so two values are inlined in handle_redirects.js
  and guarded by parity tests against drift: compareVersions (mirror of the
  plugin's TS copy at lib/compare-versions.ts) and the CURRENT_VERSION
  literal (mirror of scripts/lib/version-policy.js). The parity test at
  __tests__/compare-versions-parity.test.ts reads handle_redirects.js as
  text, extracts each mirrored value, and asserts behavioural / literal
  equality with the authoritative source.
- validateNoCycles runs as part of npm run validate-redirects and the
  plugin's loadContent, making cycles unreachable at runtime.
- With cycles impossible upstream, both resolveChain and the edge chain loop
  drop their runtime visited set.
- __tests__/handle-redirects.test.ts covers static-asset pass-through, the
  /templates + /guides + /cloud branches, versioned / versionless URIs,
  minimumVersion gating, and chain collapse.
…ckup target

Backfill 26 redirects the strict verifier surfaced the moment it was turned
on:

- 19 from the initial strict-mode pass (pages moved or renamed in 7.2 without
  accompanying redirects.json updates).
- 7 more from the compare-exchange and Studio revisions restructures.
- One pre-existing malformed target (/encrypted-backup missing the leading
  slash).

CI is flipped to DOCUSAURUS_STRICT_CANONICALS=true in the same commit — the
gate can't come on before the data is complete.
…ble verifier errors

Tighten the schema and polish the failure-mode surface.

- validateRedirects now requires minimumVersion on versioned (docs-area)
  keys; /guides and /cloud are versionless content areas so they're exempt.
  Stray minimumVersion on versionless entries isn't flagged — PR review
  catches it.
- validateTargetsExist checks each targetUrl resolves to a real .md / .mdx
  file (or an index.* under a directory). Bare directories backed only by
  _category_.json are rejected — redirects should always land on a concrete
  page.
- Legacy-version pages get <meta name="robots" content="noindex,follow">
  injected idempotently in addition to the self-canonical, so search engines
  don't keep old pages in the index even when a crawler follows an inbound
  link directly.
- Verifier errors ship a fix: block with a ready-to-paste redirects.json
  entry and the npm run validate-redirects command, collapsing the
  diagnose → fix loop.
…iagrams

Add the plugin's public README with two pre-rendered SVG diagrams:

- data-flow.svg — the loadContent + postBuild sequence from routesPaths to
  canonical rewrite to verification.
- resolve-chain.svg — the resolveChain flowchart (gate check → terminal
  return, no runtime cycle guard).

Source .mmd files live alongside the SVGs for regeneration via
@mermaid-js/mermaid-cli. Prettier formatting is applied across the plugin
source tree so everything ships consistent.
Missing <link rel="canonical"> on a versioned page previously only warned
— it wasn't added to the verifier's issues[] and so strict mode wouldn't
fail. Merge it into the unified issues pipeline so every canonical problem
surfaces through the same gate. README failure-mode bullet updated to
match.
Auto-fix 40 curly-brace and unused-eslint-disable warnings across the
touched plugin + split-sitemap sources. For the 4 no-console warnings in
build-time plugin loggers, swap console.log/warn for @docusaurus/logger —
the idiomatic Docusaurus plugin logger writes through process.stdout /
stderr with colored level prefixes and doesn't trip no-console.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants