Skip to content

refactor(crawler): tighten CrawlerThread hot path#168

Open
marevol wants to merge 1 commit into
masterfrom
refactor/crawler-thread-hot-path
Open

refactor(crawler): tighten CrawlerThread hot path#168
marevol wants to merge 1 commit into
masterfrom
refactor/crawler-thread-hot-path

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 16, 2026

Summary

Reduce stream/lambda allocations on per-crawl hot paths and small init code, mirroring the same direction as fess#3134. No behavior changes.

Changes

  • CrawlerThread.storeChildUrls: Convert the stream().filter(...).map(...).collect(Collectors.toList()) pipeline into a single for loop. Pre-size both the dedup HashSet and the resulting ArrayList from childUrlList.size(). Reuses the URL value via a local variable instead of three repeated d.getUrl() calls. This runs once per crawled page that yields children.
  • CrawlerClientCreator.register(...): Replace clientMap.entrySet().stream().forEach(...) and clientFactoryList.forEach(...) with enhanced for loops. Both methods are synchronized and run during client registration; the change just removes lambda/stream object allocations.
  • UrlFilterImpl.init: Replace cachedXxxSet.stream().collect(Collectors.toList()) with new ArrayList<>(set). Drops the java.util.stream.Collectors import.

What was intentionally NOT changed

  • The set/list types (LinkedHashSet, HashSet, ArrayList) are kept the same; only the way they are populated changes.
  • No public/protected method signatures change.
  • The redirect/HEAD logic in CrawlerThread is untouched.

Test plan

  • mvn compile — succeeds
  • mvn test -Dtest=UrlFilterImplTest,CrawlerThreadTest — 34/34 pass (no CrawlerClientCreatorTest exists in this repo)
  • mvn formatter:format && mvn license:format — applied, clean
  • Manual crawl smoke test against a small site

🤖 Generated with Claude Code

Replace stream/lambda boilerplate with explicit iteration in places
called per crawl, reducing allocations and improving readability.

- CrawlerThread.storeChildUrls: convert filter/map/collect pipeline to
  a single for-loop; pre-size HashSet and ArrayList.
- CrawlerClientCreator.register: replace stream().forEach and
  Collection.forEach lambdas with enhanced for-loops.
- UrlFilterImpl.init: drop stream().collect(Collectors.toList()) in
  favor of new ArrayList<>(set).

No behavior changes; existing UrlFilterImplTest and CrawlerThreadTest
suites pass unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant