fix(doc-collector): two SPA-robustness fixes (recoverable nav error, page-state repopulation)#34
Open
gololdf1sh wants to merge 2 commits into
Open
Conversation
…overable Playwright throws "page.content: Unable to retrieve content because the page is navigating and changing the content" on heavy SPAs whose client-side router rewrites the DOM mid-action (Ember, React Router, etc.). The explorer was catching only net::ERR_ABORTED / screenshot-timeout / waiting-for-fonts as recoverable; this new phrase fell through to FATAL_BROWSER_ERRORS and killed the whole crawl on the first navigation race. Add the phrase to RECOVERABLE_NAVIGATION_ERRORS so the explorer re-queues the action instead of aborting. Repro: collect docs against a Testomat.io page hosted in beta (Ember-based SPA). Without the fix, ~30% of pages fail with the fatal error on the first action. With the fix, those pages complete normally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d it
After a navigation completes, ExplorBot's framenavigated handler
overwrites the full ActionResult (with html/links/aria) with a
stripped-down WebPageState that has only { url, title, statusCode }.
The doc-collector then reads getCurrentState() and gets a state with
state.html === undefined and state.links === [].
Consequences:
- Documentarian receives empty html -> page documentation degrades
to a near-empty stub.
- extractNextPaths() sees an empty links array -> the subtree crawl
stops at the entry page even when many followable links exist.
Two targeted fixes:
1. In the main collect loop, if state.html is falsy, force a
capturePageState (with screenshots if configured). This is cheap
compared to the AI documentation step that follows.
2. In extractNextPaths, if state.links is empty but state.html is
present, fall back to extractLinks(state.html) so subtree
traversal still finds child paths.
Repro: collect against a Testomat.io project page. Before:
"Pages documented: 1". After: full subtree (3-7 pages depending
on the entry).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b45fdaa to
5c37bcd
Compare
16857af to
5c37bcd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent fixes that came out of running
explorbot docs collectagainst a real-world SPA (Testomat.io beta). Each is in its own commit.1.
fix(explorer): treat "navigating and changing the content" as recoverablePlaywright throws
page.content: Unable to retrieve content because the page is navigating and changing the contenton heavy SPAs whose client router rewrites the DOM mid-action (Ember, React Router, etc.). The current regex coverednet::ERR_ABORTED, screenshot timeout, and font-wait — this new phrase fell through toFATAL_BROWSER_ERRORSand killed the whole crawl on the first race. Added toRECOVERABLE_NAVIGATION_ERRORSso the explorer retries instead.2.
fix(doc-collector): repopulate page state when framenavigated stripped itAfter navigation, the
framenavigatedhandler overwrites the richActionResult(html / links / aria) with a strippedWebPageStatecarrying only{ url, title, statusCode }.doc-collectorthen readsgetCurrentState()and getsstate.html === undefined,state.links === []. Two consequences:Documentarianreceives empty html → page docs degrade to a near-empty stub.extractNextPathssees an empty links array → subtree crawl stops at the entry page even when many followable links exist.Targeted fixes:
state.htmlis falsy, forcecapturePageStatebefore passing to the AI documenter.extractNextPaths: ifstate.linksis empty butstate.htmlis present, fall back toextractLinks(state.html).Repro (combined effect)
Running
explorbot docs collect /projects/{slug}/runs/{id}on Testomat.io beta: