implement unified source-agnostic ETL pipeline#5
Open
MohamedAliBadawy wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Group Members
Summary
This PR implements a complete ETL (Extract → Transform → Validate → Load) pipeline that transforms heterogeneous bibliographic data from multiple sources into a unified WoS-style schema. The pipeline replaces the legacy procedural formatting logic with a declarative, extensible architecture.
Architecture
Declarative Mapping Strategy
Instead of if/else branches per source, column mappings are defined as dictionaries in
SOURCE_MAPPINGS:Adding a new source requires only adding a new dictionary entry — no code changes to the pipeline.
The ETL Dispatcher
To bridge the dashboard's user interface with our unified schema, we implemented a format-aware Dispatcher in get_data.py:
etl_pipeline()inetl.py. Legacy and complex formats (e.g., BibTeX.bib, compressed ZIPs) are processed through legacy formatters before being systematically dispatched to_apply_etl_standardisation()to guarantee strict downstream contract alignment.Pipeline Flow
Type Contracts
list[str][]int640int640str""Files Changed
New Files
www/services/etl.pywww/services/api_retriever.pyfunctions/get_data.pyvalidation.ipynbModified Files (49 Tracked Files Modified)
To make the codebase fully database source-agnostic and bug-free, changes were introduced across 49 tracked files. Below is an exhaustive summary of these changes grouped by technical layers:
1. Core Framework & Dashboard Integration
app.py: Completely redesigned PubMed & OpenAlex sidebars (ui.layout_sidebar); added target Search Field selection, Year limits input range, and removed legacy key warning banners.functions/get_data.py: Centralized standardizer entry-point routing to direct files throughetl_pipelineand API collections through_apply_etl_standardisation().www/services/__init__.py: Registered newetlandapi_retrievermodules for clean workspace imports.2. Network Calculations & Graph Abstractions
www/services/couplingmap.py: standardizes coupling node keys to strings, resolving Louvain & CNM community'float' object is not iterablecrashes.www/services/histnetwork.py: Standardized case-insensitive database matching checks (db.lower()) and patched empty CR sequences.www/services/biblionetwork.py: Patched index arrays and node boundary checks to prevent overflow errors in network calculations.www/services/cocmatrix.py: Wrapped sparse mappings with NaN check-points for null-safe co-occurrence calculations.www/services/networkplot.py: Standardized canvas scaling constraints to correctly center small nodes in Plotly.3. Bibliographic Parsers & Extraction Utilities
www/services/parsers.py: Patched regular expressions in raw text parses to support Cochrane and XML tag attributes.www/services/metatagextraction.py: Added robust defaults and standardized lists conversion duringAU_UNandCOnormalization.www/services/termextraction.py: Wrapped terms split arrays with empty list fallbacks to bypass parsing crashes.www/services/format_functions.py: Fixed the criticalNameErrorcrash at line 1626 where variablecolumnswas undefined.www/services/tabletag.py: Standardized tabular alignments and string representations.4. Patched Downstream Tab Modules (UI Calculation Files)
We audited and patched all 33 downstream analytical modules in the
functions/directory to resolve float-iteration crashes, division by zero, empty list expansions, and case-sensitive schema casing. This ensures that every analytical tab behaves with absolute stability regardless of the uploaded database source:A. Production & Productivity Modules:
get_maininformations.py: Replaced the legacy min/max year operations with safe, nan-aware integer casts. Added explicit checks to CAGRs, single-authored docs counts, and co-authorship percentages to preventValueError: cannot convert float NaN to integeron sparse datasets.get_annualproduction.py: Cast publication year values (PY) strictly to 64-bit integers (int64), preventing arithmetic runtime conflicts (e.g.str - strerrors) during chronological expansion.get_authorproductionovertime.py: Replaced case-sensitive database casing lookups with case-insensitivedb.lower()logic. Added safeguards against empty lists in Plotly timeline components.get_affiliationproductionovertime.py: Fixed a critical shape mismatch error during matrix division and integrated automatic extraction of missing institution metadata (AU_UNnormalization) if it is not already loaded.get_sourcesproduction.py: Standardized journal lookups by dynamically converting all journal names (SO) strictly to lowercase strings, ensuring correct aggregation.get_countriesproduction.py&get_countriesproductionovertime.py: Standardized list conversions inside country address fields (C1). When the raw database provides address fields as strings rather than lists, the parser converts them to single-item lists first, preventing character-by-character loops (e.g. iterating USA as "U", "S", "A").B. Citation & References Modules:
get_averagecitations.py: Replaced manual procedural loops with type-guarded pandas vectorized sum averages, preventing type conversion errors on empty/null citations.get_citeddocuments.py&get_citedcountries.py: Added explicit bounds checks to division functions, preventing division-by-zero crashes on datasets with zero citations.get_referencesspectroscopy.py: Safeguarded chronological boundaries. Outlying references (pre-1800 or post current year) are filtered out to keep spectra boundaries stable.get_localcitedauthors.py,get_localciteddocuments.py,get_localcitedreferences.py, &get_localcitedsources.py: Built explicit empty-state guards. Since PubMed and Cochrane exports naturally omit Cited References (CR), these modules intercept the blank lists and display user-friendly placeholder messages instead of crashing with aKeyErroror trying to parseNaNvalues.C. Keywords & N-Grams Modules:
get_trendtopics.py: Corrected the year comparison logic by standardizing thetime_windowparameters to integers. Added empty-state fallbacks for datasets with missing keyword columns.get_frequentwords.py&get_wordfrequency.py: Replaced case-sensitive filters with unified lowercase keyword matchings. Added checks to skip empty keyword arrays.get_wordcloud.py: Guarded weight boundaries inside the word cloud rendering matrix, preventing division-by-zero exceptions when term frequencies are completely uniform.get_treemap.py: Set clear nesting level limits for multi-tier hierachy rendering to ensure Plotly tree maps do not overflow or raise index exceptions.D. Clustering & Advanced Network Modules:
get_clusteringcoupling.py: Wrapped community division routines to isolate small/disconnected nodes, resolving modularity crashes in Louvain communities.get_historiograph.py: Patched the network edge generation routine. Uses safe index lookup guards to bypass missing records, enabling historical maps to compile crash-free on sparse citation collections.get_bradfordlaw.py: Enforced case-insensitive string parsing on journal names to accurately classify Bradford journal core zones.get_lotkalaw.py: Injected default model standard limits and singular-value constraints to prevent SVD convergence failures when analyzing datasets with small author counts.get_cocitation.py,get_correspondingauthorcountries.py,get_factorialanalysis.py,get_relevantaffiliations.py,get_relevantsources.py,get_thematicevolution.py,get_thematicmap.py, &get_threefieldplot.py: Standardized all data matrix lookups inside co-occurrence and thematic mapping calculations. If the underlying collections are empty (e.g. missing cited references or keywords), these modules intercept the blank tables and render clean empty-state dashboard graphics instead of raising system crashes.Bug Fixes
_flatten_wos_records()collapses single-value listsint("S10")ValueErrorIS(ISSN) collides withIP→ISrenamestr - strin downstream functionsint64host_venueno longer existsprimary_location.sourcecolumnsundefinedformat_functions.pyline 1626columns = df.columnshistNetworkDB check"Scopus"vs"SCOPUS"db.lower()comparisonaffiliation_productioncrashAU_UNcolumn missing + shape mismatchtrend_topicscrashlen(int_timespan)TypeError'float' object is not iterablein communitiesTest Results
To demonstrate the robust design of our pipeline, we evaluate the test coverage against two distinct metrics:
CRor Author KeywordsDE).Manual platform exports (like the free tier of Dimensions, PubMed TXT, and Cochrane reviews) do not export Cited References
CRin their download templates. For these sources, citation-based analysis functions (like Co-citation, Historiograph, or Local Citations) will successfully execute without crashing, returning a clean, graceful empty-state message (e.g., "No cited references data available").Validation Audit Table
CR& keywordsDEpresent)CR& keywordsDEpresent)CR(citation data not exported in free tiers)CR(citation data not exported by PubMed)CR(citation data not exported by Cochrane)CRmatches)CR(PubMed API does not serve raw bibliography)How to Test
# Run the Jupyter validation notebook containing the automated tests for all 26 analytical functions and 7 core services jupyter notebook validation.ipynb