Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

Large-File / Streaming Processing

status

Enterprise FundsXML feeds can be hundreds of MB. Loading them into a DOM blows memory; these examples process them streaming, at constant memory — verified: ~16 MiB RSS (Python) / ~2 MiB heap (Java, -Xmx64m) for a 20 000- position file, independent of file size.

Tool What Status
python/make_large_sample.py Generate a big XSD-valid file by streaming writes
python/stream_aggregate.py lxml.iterparse aggregation (clears parsed siblings)
java/StreamAggregate.java StAX pull parser, native Java (no JAXB/DOM)
python/split.py Split into independently XSD-valid chunks
python/delta_diff.py INITIAL-vs-DELTA position diff (added/removed/changed)

FundsXML 4.x has no XML namespace — matchers use the bare Position tag. Parsers disable DTD/external entities (XXE-safe).

Run

# 1. synthetic 50k-position file (~constant-memory writer)
python3 Large_File_Processing/python/make_large_sample.py big.xml 50000

# 2. aggregate — Python and Java give identical totals at flat memory
python3 Large_File_Processing/python/stream_aggregate.py big.xml
# Java via the committed Maven Wrapper (mvnw.cmd on Windows); a small heap
# proves the constant-memory claim.
MAVEN_OPTS=-Xmx64m ./mvnw -q -pl Large_File_Processing/java compile exec:java \
  -Dexec.args="big.xml"

# 3. split into XSD-valid chunks of 10k positions
python3 Large_File_Processing/python/split.py big.xml chunks/ 10000
XSD_Validation/cli/validate.sh \
  https://github.com/fundsxml/schema/releases/download/4.2.9/FundsXML.xsd \
  chunks/chunk-0001.xml   # or a local FundsXML.xsd path

# 4. day-over-day position delta (exit 1 if anything changed)
python3 Large_File_Processing/python/delta_diff.py yesterday.xml today.xml

ETL note

The streaming readers are the natural building block for ETL pipelines (Apache Camel / NiFi / Spring Batch): a split step feeds a parallel stream_aggregate/load stage, and delta_diff drives incremental upserts when a feed alternates DataOperation INITIAL/DELTA. Example Camel route shape:

from("file:in?include=.*\\.xml")
  .to("exec:python3?args=Large_File_Processing/python/split.py ${file} work/")
  .split(...).parallelProcessing()
  .to("exec:...stream_aggregate.py ...")   // or a JDBC load (see Database_Integration/)

CI runs steps 2–4 on a generated file (totals asserted, a chunk XSD-validated).