code cleanup and reorg#116
Merged
Merged
Conversation
Phase 1 – Split CometDataInternal.h monolith (1554 lines) into three focused headers: core/Constants.h – all #define constants and DbType enum core/Params.h – parameter structs (StaticParams, Options, etc.) core/Types.h – runtime types (Query, Results, index structs, externs) CometDataInternal.h becomes a pure forwarding shim; no .cpp files changed. Phase 2 – Extract SearchMemoryPool class (threading/SearchMemoryPool.h/.cpp) Owns the duplicate-fragment scratch arrays previously managed via raw bool* statics in CometSearch. Uses its own mutex+CV, keeping the FASTA_DB SearchThreadProc availability array (_pbSearchMemoryPool) as a separate allocation. FI/PI paths call acquireSlot()/releaseSlot() directly. Phase 3 – IResultWriter interface + 5 adapter wrappers (output/) Replaces ~316-line if (bOutputXxx) dispatch chain in DoSearch() with a factory that builds vector<unique_ptr<IResultWriter>> and calls open(), write(), close() polymorphically. Each writer owns its file handle(s); close(bSucceeded, bEmpty) handles format footers, fclose, and remove-on- empty-search. SQT writer is inserted last to preserve the existing invariant that WriteSqt runs after all other formats. All 17 unit tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ble batch globals Introduces SearchSession struct (search/SearchSession.h) that owns all mutable per-run state for the batch path: queries, ms1Queries, and queriesMutex. Removes g_pvQuery and g_pvQueryMS1 global vectors from CometSearchManager.cpp and all extern declarations from core/Types.h. Key changes: - CometSearchManager: creates SearchSession per input-file iteration; wires queries/ms1Queries through all pipeline calls (LoadAndPreprocess, FusedLoadAndSearch, RunSearch, RunSpecLibSearch, PostAnalysis); sets wwctx.pQueries and woctx.bIdxNoFasta on writer contexts - CometPreprocess: threads SearchSession& through LoadAndPreprocessSpectra, FusedLoadAndSearchSpectra, FusedSearchSpectrum, PreprocessSpectrum, and PreprocessThreadData; uses session.queriesMutex (std::mutex) instead of g_pvQueryMutex for query-vector pushes - CometSearch: sets _pQueries member in SearchThreadProc; updates RunSpecLibSearch, RunMS1Search, SearchMS1Library, SearchPeptideIndex, BinarySearchMass, and all member functions to use _pQueries or explicit queries params instead of g_pvQuery/g_pvQueryMS1 - All writer classes (Txt, Sqt, PepXML, MzIdentML, Percolator) and CometMassSpecUtils/CometPostAnalysis accept const vector<Query*>& instead of reading g_pvQuery - CometFragmentIndex and CometPeptideIndex pass emptyQueries to index- creation RunSearch calls (bCreateFragmentIndex path skips query access) - All 17 unit tests pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…replace DoSearch per-file loop Extract FiStrategy, FastaStrategy, and PiStrategy as concrete ISearchStrategy implementations; Pipeline drives the per-file loop and writer lifecycle. DoSearch() reduces from ~700 lines to ~50: index-build early returns, strategy selection, writer factory, and pipeline.run(). SearchUtils.h adds inline utilities (GetInputType, UpdateInputFile, SetMSLevelFilter, AllocateResultsMem, comparators) shared across strategies without circular includes. Verified: 17/17 unit tests pass; HeLa FI_DB batch parity zero PSM diff at 1% and 5% FDR vs pre-Phase5 baseline (49,747 spectra, phospho + oxidation). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ctory sources CometSearch.vcxproj was missing all subdirectory sources added in Phases 2-5: - ClCompile: threading/SearchMemoryPool.cpp, search/Fasta/Fi/PiStrategy.cpp, search/Pipeline.cpp - ClInclude: core/, output/, threading/, search/ headers - AdditionalIncludeDirectories: add $(ProjectDir) so subdirectory .cpp files can find root-level headers (Common.h, CometDataInternal.h, etc.) MSToolkit/include/zconf.h was generated by ./configure on Linux, which hardcodes "#if 1 define Z_HAVE_UNISTD_H" unconditionally, triggering inclusion of <unistd.h> on Windows (C1083). Add !defined(_WIN32) guard so MSVC skips the POSIX include. Removed from .gitignore so this fix is retained. Verified: full Comet.sln Release x64 build succeeds (Comet.exe + CometWrapper.dll + RealtimeSearch.exe); HeLa FI_DB search produces 16,578 xcorr PSMs @ 1% FDR and 18,458 evalue PSMs @ 1% FDR (evalue count identical to Linux build; xcorr 19-PSM delta is floating-point margin noise at the cutoff boundary). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
zconf.h is generated by ./configure and should not be tracked. The project requires a clean between Linux and Windows builds when switching platforms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add -Wno-unknown-pragmas, -Wno-char-subscripts, and -Wno-unused-result to both Makefiles. The last flag is needed because GCC 13 does not honor (void) casts for warn_unused_result functions such as fread/fgets. Source fixes across CometSearch and Comet.cpp: - CometInterfaces.h: [[maybe_unused]] on static _tp member - Unused ThreadPool* and other parameters anonymized with /*name*/ in Threading, CometSearch (x8), CometPreprocess (x2), CometFragmentIndex, CometSpecLib (x2), FastaStrategy - CometSearch.cpp SearchPeptideIndex: removed size_t tTmp declaration, changed all 7 fread assignments to (void)fread; removed dead pSearchThreadPool variable - CometSearchManager.cpp: removed AllocateResultsMemMS1() whose body was entirely commented out - CometFragmentIndex.cpp: removed 8 auto tClear timing variables whose printf lines were already commented out; cast 3 bulk fread calls - CometPeptideIndex.cpp: removed tTmpRead, changed 5 assignments to (void)fread - CometSpecLib.cpp: removed bDoneProcessingAllSpectra and its 5 assignments (each immediately followed by break); anonymized two unused string parameters - core/Types.h: added = 0 initializer to DBIndex::siVarModProteinFilter to fix -Wmaybe-uninitialized on local DBIndex sDBTmp Build now produces zero warnings from Comet-owned source files. Third-party warnings in MSToolkit/expat/zlib are unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Major restructuring of the search engine into a Strategy/Pipeline pattern: - Extract core types into CometSearch/core/ (Types.h, Params.h, Constants.h) to break the monolithic CometDataInternal.h header - Add ISearchStrategy interface with FastaStrategy, FiStrategy, and PiStrategy concrete implementations, each owning file I/O and lifecycle for its db type - Add Pipeline orchestrator: drives the strategy init -> per-file loop -> writer loop -> finalize sequence, with correct cleanup on init or open failure - Add SearchSession to carry per-run mutable state (query list, status ref, search-mode flags) through the call chain without globals - Add IResultWriter base class with BuildNames() helper; add concrete writers (MzIdentMlWriter, PepXmlWriter, SqtWriter, TxtWriter, PercolatorWriter) that own their own file handles and lifecycle - Add SearchMemoryPool to own the per-thread duplicate-fragment scratch arrays; route all paths (FASTA batch and RTS) through a single s_pool authority, removing the legacy _pbSearchMemoryPool / g_searchMemoryPoolMutex / g_searchPoolCV split-tracking system - Extract RunSearchAndPostAnalysis() helper in SearchUtils.h; refactor all three strategy executeBatch() methods to use it, eliminating ~110 lines of duplication - Move BuildNames() extTargetCrux parameter to last position with nullptr default so non-CRUX callers do not need to supply a CRUX-specific argument Bug fixes included in this branch: - StorePeptideI: honour iDecoySearch==2 by storing scored decoy peptides in _pDecoys[] with correct threshold tracking (was silently discarding them) - SearchMS1Library: use pMS1Query->accessMutex instead of the global g_pvQueryMutex, restoring per-query thread safety in the RTS MS1 path - MzIdentMlWriter::OpenTmp: close() the fd returned by mkstemp() before calling fopen(), preventing one fd leak per mzIdentML batch on Linux - MzIdentMlWriter::FinalizeOne: report error when the temp file cannot be reopened for merge (was silently dropping the error) - AllocateResultsMem: move iXcorrHistogram memset before the per-result-slot loop so it executes once per query rather than iNumStored times - StorePeptideI: use short loop variable for decoy index recomputation to match StorePeptide() and avoid implicit int-to-short narrowing - WithinMassTolerancePeff: use dCalcPepMass + dMassAddition as the seek-back reference mass (was using dCalcPepMass alone, missing the PEFF delta) - Pipeline::run: call finalize() on strategy-init failure; call close() on all writers when a writer open() fails mid-list - SearchSession: remove dead fields bPlainPeptideIndexRead and bSpecLibRead (globals are authoritative; these were never wired up) - FastaStrategy::initialize: remove empty if-block left over from earlier draft - CometWritePercolator: pass protein name vectors by const reference to avoid copying on every search-hit output call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ish operator= and SearchUtils cleanup Fix pass for the architecture_update Strategy/Pipeline refactor: replace drift-prone hand-written operator= bodies in Params.h with = default, extract SearchUtils.h's non-trivial functions into SearchUtils.cpp plus a shared executeBatchLegacy() used by all three strategies, and correct the FusedLoadAndSearchSpectra batch-size counter to reflect processed rather than queued spectra. Also closes a gap found while re-reviewing that fix pass: the original SlotGuard added only to SearchThreadProc has been replaced with a shared SearchMemoryPoolSlotGuard (threading/SearchMemoryPool.h) applied at all five AcquirePoolSlot()/releaseSlot() sites in CometSearch.cpp, so an exception thrown out of any FI/PI search body -- not just the FASTA batch path -- can no longer leak a pool slot and stall the next acquireSlot() caller for up to 240s. The batch-FI per-query thread-pool lambda now also surfaces an AcquirePoolSlot() failure via bSucceeded instead of silently dropping the query. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move AScorePro init/teardown from CometSearchManager::DoSearch() into Pipeline::run(), after strategy initialize() and around finalize(), so SetAScoreOptions() reads variableModParameters after FiStrategy has already overwritten it from the .idx file's VariableMod: header instead of stale/default values. - Fix CometSearch::SearchPeptideIndex(ThreadPool*, vector<Query*>&) never assigning _pQueries, left over from the Strategy/Pipeline refactor; this left BinarySearchMass() dereferencing a null pointer and silently crashing every PI_DB (-j) batch search on the first scored candidate. - Add t19 (FI_DB + AScorePro ordering) and t20 (PI_DB regression) tests, each empirically verified to fail pre-fix and pass post-fix. - Tidy up: writer close()-after-failed-open contract documented, isIndexBased() doc tightened, stale SearchSession.h migration comment resolved, redundant Params.h operator= declarations removed. - Update DataStructures.md/GlobalVariables.md/RealTimeSearch.md to match current SearchSession fields, SearchMemoryPool (replacing g_searchMemoryPoolMutex), and the _pQueries discipline; record both fixes in docs/20260617_codereview3.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…code - FiStrategy::finalize() now resets g_iFragmentIndex/g_iFragmentIndexOffset/ g_bIndexPrecursors to nullptr and g_bPlainPeptideIndexRead to false after freeing, so a second DoSearch() in the same process (or RTS-then-batch in one process) rebuilds the index instead of using freed pointers. - WriterOpenCtx::pStatus is now set via a mandatory constructor parameter instead of a defaulted-nullable field, turning "forgot to set status" into a compile error instead of a potential null-deref in any writer's open() failure path. - Delete the dead CometSearch::RunSearch(ThreadPool*, vector<Query*>&) overload and its misleading "called by DoSingleSpectrumSearchMultiResults" comment (it had zero callers). - Fix a stale IResultWriter.h comment referencing the removed g_pvQuery global. - Replace throwaway heap allocations (new/delete CometFragmentIndex, CometSearch) with stack locals in RunSearch(int,int,ThreadPool*,vector<Query*>&). - Replace SearchMemoryPool's O(n) linear scan with an O(1) free-list stack for acquireSlot()/releaseSlot(); benchmarked before/after (see docs/20260618_mutexserialization.md for methodology and results). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 20260612_producerConsumerQueue.md: producer/consumer queue design for the fused batch FI_DB path. - 20260616_codereview1.md, 20260616_codereview2.md: architecture_update branch code reviews. - 20260617_codereview.md -> 20260617_codereview1.md: renamed for consistent numbering alongside the other dated review docs. - 20260618_mutexserialization.md: problem description and phased plan for SearchMemoryPool's mutex serialization, including benchmark methodology and results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pport test Extend run_regression.py with decoy_search=1/2 params variants (fasta/pi only, fi unsupported) and decoy-file comparison. Add test_raw_vs_mzxml.py to confirm the Windows binary's direct .raw reading matches .mzXML search results across all 5 output formats. Document both in tests/tests.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors Comet’s batch-search architecture to make query ownership explicit (via SearchSession), improve thread-safety/RAII around shared search scratch memory, modernize some I/O call sites, and expand test+documentation coverage (including new regression test tooling for internal-decoy variants and Windows .raw support).
Changes:
- Introduces
SearchSession+ strategy-drivenPipelinefor batch searches and extracts shared batch helpers intosearch/SearchUtils.{h,cpp}. - Adds
SearchMemoryPool(and RAII guard) to centralize pool locking/lifetime and reduce exception-related slot-leak risk. - Updates writers/utilities to accept an explicit
queriesvector and adds/extends test fixtures + regression runners/docs.
Reviewed changes
Copilot reviewed 75 out of 88 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/data/t19_ascore_fidb.ms2 | Adds MS2 fixture for AScore+FI_DB regression coverage. |
| tests/unit/data/t19_ascore_fidb.fasta | Adds FASTA fixture paired with the T19 MS2 data. |
| tests/tests.md | New documentation summarizing test suites and individual test intent. |
| tests/regression/test_raw_vs_mzxml.py | New Windows .raw vs .mzXML parity test across output formats. |
| tests/regression/run_regression.py | Extends regression runner to test multiple internal-decoy variants and compare separate decoy outputs. |
| MSToolkit/.gitignore | Ignores generated headers/artifacts within MSToolkit. |
| Makefile | Tweaks build flags (incl. warning suppression) for top-level build. |
| docs/RealTimeSearch.md | Updates RTS documentation to reflect new architecture/state ownership and CSR index layout. |
| docs/GlobalVariables.md | Updates global-state documentation (SearchSession, CSR fragment index, protein name cache, pool notes). |
| docs/20260617_codereview2.md | Adds historical code review notes/documentation. |
| docs/20260616_codereview2.md | Adds historical code review notes/documentation. |
| docs/20260616_codereview1.md | Adds historical code review notes/documentation. |
| docs/20260615_multiple_rts_instances.md | Adds design notes for multiple concurrent RTS instances. |
| docs/20260612_producerConsumerQueue.md | Adds design notes for a bounded producer/consumer queue in fused FI batch path. |
| data/comet_small.params | Removes this params file from the repo. |
| data/comet_phospho_internaldecoy2.params | Updates/annotates params for internal decoy separate variant. |
| data/comet_phospho_internaldecoy1.params | Updates/annotates params for internal decoy concatenated variant. |
| CometSearch/threading/SearchMemoryPool.h | Adds SearchMemoryPool interface + slot guard RAII type. |
| CometSearch/threading/SearchMemoryPool.cpp | Implements pool allocation, acquire/release, and error handling. |
| CometSearch/Threading.cpp | Minor signature cleanup to silence unused warnings. |
| CometSearch/search/SearchUtils.h | Adds shared batch helpers and query comparators for pipeline/strategies. |
| CometSearch/search/SearchUtils.cpp | Implements shared batch helpers (input type detection, batch execution, result allocation, etc.). |
| CometSearch/search/SearchSession.h | Introduces per-batch mutable state container for batch searches. |
| CometSearch/search/PiStrategy.h | Adds PI_DB batch strategy interface. |
| CometSearch/search/PiStrategy.cpp | Implements PI_DB batch strategy using shared legacy batch helper. |
| CometSearch/search/Pipeline.h | Adds pipeline orchestrator interface for batch search execution. |
| CometSearch/search/Pipeline.cpp | Implements pipeline orchestration, writer lifecycle, per-batch cleanup, and AScore init placement. |
| CometSearch/search/ISearchStrategy.h | Adds strategy abstraction for database/search modes. |
| CometSearch/search/FiStrategy.h | Adds FI_DB batch strategy interface. |
| CometSearch/search/FiStrategy.cpp | Implements FI_DB strategy (fused path when possible, fallback legacy path). |
| CometSearch/search/FastaStrategy.h | Adds FASTA_DB batch strategy interface. |
| CometSearch/search/FastaStrategy.cpp | Implements FASTA_DB strategy using shared legacy batch helper. |
| CometSearch/output/TxtWriter.h | New writer class using shared naming and explicit query list output. |
| CometSearch/output/SqtWriter.h | New writer class using shared naming and explicit query list output. |
| CometSearch/output/PercolatorWriter.h | New writer class using shared naming and explicit query list output. |
| CometSearch/output/PepXmlWriter.h | New writer class using shared naming and explicit query list output. |
| CometSearch/output/MzIdentMlWriter.h | New writer class with temp-file merge flow and explicit query list output. |
| CometSearch/output/IResultWriter.h | Adds writer interface and shared filename builder. |
| CometSearch/Makefile | Adds new objects for extracted threading/search components and adjusts warning flags. |
| CometSearch/CometSearch.vcxproj | Updates include dirs and adds new source/header entries for the new modules. |
| CometSearch/CometSearch.h | Updates search API signatures to pass queries explicitly and adjusts thread data structure. |
| CometSearch/CometPreprocess.h | Threads SearchSession through batch preprocess APIs and fused FI helpers. |
| CometSearch/CometPreprocess.cpp | Replaces global query vectors with SearchSession, adopts lock_guard, and updates fused FI path to use session. |
| CometSearch/CometPostAnalysis.h | Threads an explicit queries vector through post-analysis thread data/API. |
| CometSearch/CometPostAnalysis.cpp | Uses explicit queries vector rather than global query state. |
| CometSearch/CometPeptideIndex.cpp | I/O cleanup: replaces unused fread results with (void) casts; updates RunSearch call signature. |
| CometSearch/CometMassSpecUtils.h | Updates GetProteinNameString to accept explicit queries. |
| CometSearch/CometMassSpecUtils.cpp | Implements GetProteinNameString using explicit queries rather than global query state. |
| CometSearch/CometInterfaces.h | Marks an unused static as [[maybe_unused]]. |
| CometSearch/CometFragmentIndex.cpp | I/O cleanup via (void) casts; updates RunSearch call signature; removes some dead locals/comments. |
| Comet.cpp | I/O cleanup: replaces unused fgets return value with (void) cast. |
| CLAUDE.md | Adds a “Code Review Protocol (Copilot Mode)” section and template. |
| .gitignore | Adjusts ignore entries (moves MSToolkit-generated ignores into MSToolkit/.gitignore). |
| CometSearch/CometWriteTxt.h | Updates writer API to accept explicit queries. |
| CometSearch/CometWriteTxt.cpp | Uses explicit queries instead of global query list during TXT output. |
| CometSearch/CometWriteSqt.h | Updates writer API to accept explicit queries. |
| CometSearch/CometWriteSqt.cpp | Uses explicit queries instead of global query list during SQT output. |
| CometSearch/CometWritePercolator.h | Updates writer API to accept explicit queries and avoids protein vector copies. |
| CometSearch/CometWritePercolator.cpp | Uses explicit queries; switches protein vectors to const-ref; minor iterator constness. |
| CometSearch/CometWritePepXML.h | Updates writer API to accept explicit queries. |
| CometSearch/CometWritePepXML.cpp | Uses explicit queries instead of global query list during pepXML output. |
| CometSearch/CometWriteMzIdentML.h | Updates writer API to accept explicit queries and passes idx-no-fasta through merge. |
| CometSearch/CometWriteMzIdentML.cpp | Uses explicit queries; threads idx-no-fasta flag through tmp parse/merge. |
| CometSearch/CometSpecLib.cpp | Minor warning cleanups (unused params/vars) and signature cleanup. |
| CometSearch/core/Constants.h | Adds/relocates constants and DbType enum into new core header. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Guard suffix pointer arithmetic in GetInputType()/UpdateInputFile() so short filenames can't read before the buffer, drop the dead RunSpecLibSearch(ThreadPool*) stub overload, stream pepXML/mzIdentML/pin record counters in test_raw_vs_mzxml.py instead of loading whole files, and fix the macOS Makefile rule to mkdir the target directory before compiling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several improvements to code quality, thread safety, and maintainability across the codebase, particularly in the handling of query data and file I/O. The most significant changes include making query access explicit and thread-safe, modernizing file reading practices, and cleaning up unused parameters and code. Additionally, documentation has been improved with a new code review protocol.
Thread safety and query handling:
Refactored functions in
CometPreprocess,CometPostAnalysis, and related classes to passqueriesexplicitly rather than relying on global variables, and replaced manual mutex locking withstd::lock_guardfor safer, clearer thread synchronization. (CometSearch/CometPreprocess.cpp[1] [2] [3] [4] [5];CometSearch/CometPostAnalysis.cpp[6] [7];CometSearch/CometPostAnalysis.h[8] [9]Updated
CometMassSpecUtils::GetProteinNameStringand its usages to accept thequeriesvector as a parameter, removing reliance on the globalg_pvQuery. (CometSearch/CometMassSpecUtils.cpp[1] [2] [3];CometSearch/CometMassSpecUtils.h[4]File I/O modernization:
freadandfgetswith explicit(void)casts to suppress warnings and clarify intent, improving code clarity in multiple index and parameter reading functions. (CometSearch/CometPeptideIndex.cpp[1] [2];CometSearch/CometFragmentIndex.cpp[3] [4] [5];Comet.cpp[6]Code cleanup:
tpin thread procedures, and marked static variables as[[maybe_unused]]to silence compiler warnings. (CometSearch/CometFragmentIndex.cpp[1];CometSearch/CometFragmentIndex.cpp[2];CometSearch/CometFragmentIndex.cpp[3];CometSearch/CometInterfaces.h[4]Documentation:
CLAUDE.md, standardizing the review process and feedback format. (CLAUDE.mdCLAUDE.mdR220-R238)Minor improvements:
CometSearch::RunSearchto require an explicitemptyQueriesvector, reflecting the new interface. (CometSearch/CometFragmentIndex.cpp[1];CometSearch/CometPeptideIndex.cpp[2]These changes collectively improve code safety, maintainability, and clarity, especially in multi-threaded contexts.