Skip to content

Replace shell-based process runner with open3_safe#9

Open
vandric wants to merge 1 commit into
masterfrom
vandric-use-open3safe
Open

Replace shell-based process runner with open3_safe#9
vandric wants to merge 1 commit into
masterfrom
vandric-use-open3safe

Conversation

@vandric
Copy link
Copy Markdown

@vandric vandric commented Apr 20, 2026

Summary

  • Replaces ExternalProcess#run's bash-pipe + system timeout/gtimeout binary approach with Open3Safe.capture3_safe, which handles stdout/stderr capture, timeout, and RSS-based memory limiting without spawning an intermediate shell
  • Adds max_rss: keyword argument throughout the public API (extract_text_with_timeouts, extract_images_with_timeouts) and propagates it down through TextExtractor, ImageExtractor, and into each run call
  • Converts env parameter from a raw string (e.g. "MAGICK_TMPDIR=... OMP_NUM_THREADS=2") to a proper Hash as required by Open3
  • Removes 2>&1 redirects from all command strings — stderr is now captured separately and merged with stdout after the fact, with the same blank-line and consecutive-duplicate filtering as before (guards against memory bloat from corrupt PDF warning floods — see silverfin/issues/1998)
  • Adds open3_safe gem dependency (Gemfile, Gemfile.lock, docsplit.gemspec)

Test plan

  • Run existing text extraction against a known-good PDF and verify output is unchanged
  • Run image extraction and verify pages are rasterized correctly
  • Trigger a timeout scenario (e.g. with a very low timeout value) and confirm TimeoutError is raised
  • Trigger an OOM scenario (low max_rss: value) and confirm TimeoutError is raised
  • Feed a corrupt PDF that generates repetitive warnings and confirm no memory bloat
  • Verify pdffonts path in contains_text? works correctly end-to-end

🤖 Generated with Claude Code

@vandric vandric force-pushed the vandric-use-open3safe branch 4 times, most recently from c7e7eac to 0a81aca Compare April 27, 2026 07:07
@vandric vandric force-pushed the vandric-use-open3safe branch 2 times, most recently from e5b63ee to 59c73e6 Compare May 6, 2026 09:10
Comment thread lib/docsplit/external_process.rb
Replaces the bash pipe + system timeout binary approach in ExternalProcess#run
with Open3Safe.capture3_safe, which provides proper stdout/stderr capture,
timeout handling, and RSS-based memory limiting without spawning a shell.

- Add open3_safe gem dependency
- Propagate max_rss: keyword arg through public API and all extractors
- Convert env strings to Hashes for Open3 compatibility
- Remove 2>&1 redirects (stderr now captured separately)
- Duplicate/blank-line filtering preserved in the new implementation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vandric vandric force-pushed the vandric-use-open3safe branch from 59c73e6 to 7878767 Compare May 12, 2026 13:32
@peter-toth-silverfin
Copy link
Copy Markdown

Is the "Test plan" in the desc for me to test or it's more for AI?

@vandric
Copy link
Copy Markdown
Author

vandric commented May 12, 2026

Is the "Test plan" in the desc for me to test or it's more for AI?

From AI for AI, we will test it in Silverfin.

@peter-toth-silverfin
Copy link
Copy Markdown

PR is Approved. (didn't see anything on the UI for approval)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants