Conversation
Refactor HTML-to-Markdown converter to eliminate navigation chrome and noise artifacts, improving content quality for AI consumption.
✅ Deploy Preview for bitcoin-design-site ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR refactors the Netlify Edge HTML-to-Markdown conversion to reduce non-content “chrome” in the generated markdown and improve the content signal for AI/agent consumption.
Changes:
- Introduces a shared selector list to prune common non-content elements (nav/header/footer, sidebars, anchors, etc.) before conversion.
- Updates conversion to render only a selected “primary content” root (e.g., article/main) instead of always converting the full
<body>. - Improves the fallback (regex-based) converter by stripping more noise and refining link/image rendering.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
this is about over my head. i'm not confident i can give an accurate / valuable approval / review. |
|
@swedishfrenchpress all good. Interestingly, the best entity to give feedback on this PR are AI agents, since it's about improving how they can understand the site content. |
It's probably not over your agent's head |
There was a problem hiding this comment.
tl;dr - this is a huge improvement, though there are edge cases you could defend against
My first reaction was "LGTM".
I asked BOLTy, and he sees it as a clear improvement over the one before.
However, there are some edge cases where some of the stripping might be too aggressive. For example, code samples in inline SVG and pages that don't follow the structured article format.
I don't actually think the guide contains any of those three things, so I think you're good to go. However, if you wanted to be cautious in case such things existed in the guide in the future, you could feed both these feedback below into your coding agent to have it fixed.
BOLTy's Feedback
This is a clear improvement for agent-readability. I compared the deploy preview against live, including https://bitcoin.design/guide/daily-spending-wallet/, and the preview is much better at surfacing actual page content instead of nav/sidebar noise.
A few things worth feeding back to the agent before merge:
Fallback path may break HTML code samples
In the regex fallback flow, noise stripping happens before <pre> blocks are protected. If a guide page includes HTML examples with tags like <nav>, <header>, or <footer>, those examples could get mangled.
SVG stripping may be overly aggressive
Removing inline <svg> is probably fine for chrome/icons, but it could drop meaningful content if any pages use inline SVG diagrams or illustrations. Worth checking a few representative pages.
Content-root selection should be spot-checked on different page types
The new root-selection logic is directionally good, but it’s opinionated. I’d sanity-check homepage, guide article pages, and section/index pages to make sure it’s not narrowing too aggressively or skipping useful content.
Overall: strong improvement, just worth validating those edge cases.
Refactor HTML-to-Markdown converter to eliminate navigation chrome and noise artifacts, improving content quality for AI consumption.
Based on BOLTy's feedback in this comment.