Skip to content

implement unified source-agnostic ETL pipeline#5

Open
MohamedAliBadawy wants to merge 2 commits into
PRAISELab-PicusLab:mainfrom
MohamedAliBadawy:main
Open

implement unified source-agnostic ETL pipeline#5
MohamedAliBadawy wants to merge 2 commits into
PRAISELab-PicusLab:mainfrom
MohamedAliBadawy:main

Conversation

@MohamedAliBadawy
Copy link
Copy Markdown

Group Members

  • Ammar Gharaf — ID: D03000248
  • Mohamed Badawy — ID: D03000253
  • Amr Khalil — ID: D03000225

Summary

This PR implements a complete ETL (Extract → Transform → Validate → Load) pipeline that transforms heterogeneous bibliographic data from multiple sources into a unified WoS-style schema. The pipeline replaces the legacy procedural formatting logic with a declarative, extensible architecture.

Architecture

Declarative Mapping Strategy

Instead of if/else branches per source, column mappings are defined as dictionaries in SOURCE_MAPPINGS:

SOURCE_MAPPINGS = {
    "SCOPUS": {
        "EID": "UT",         # Scopus record ID → WoS unique tag
        "DOI": "DI",
        "Title": "TI",
        "Cited by": "TC",
        # ... 18 more mappings
    },
    "DIMENSIONS": { ... },
    "PUBMED": { ... },
    "OPENALEX": { ... },
}

Adding a new source requires only adding a new dictionary entry — no code changes to the pipeline.

The ETL Dispatcher

To bridge the dashboard's user interface with our unified schema, we implemented a format-aware Dispatcher in get_data.py:

  • Heuristic File Routing: It inspects file headers, sizes, and extensions (such as CSV, XLSX, plain-text TXT, CIW) to automatically route uploads.
  • Unified Pipeline Coupling: Direct tabular uploads are dispatched straight to etl_pipeline() in etl.py. Legacy and complex formats (e.g., BibTeX .bib, compressed ZIPs) are processed through legacy formatters before being systematically dispatched to _apply_etl_standardisation() to guarantee strict downstream contract alignment.

Pipeline Flow

┌─────────────┐    ┌──────────────┐    ┌────────────┐    ┌──────────┐    ┌──────────┐
│   extract()  │───▶│  transform()  │───▶│  validate() │───▶│  add_sr() │───▶│   load()  │
│              │    │              │    │            │    │          │    │          │
│ • WoS TXT    │    │ • Rename cols │    │ • Schema ✓ │    │ • Short  │    │ • CSV    │
│ • Scopus CSV │    │ • Type cast   │    │ • NaN ✓    │    │   Ref.   │    │   export │
│ • Dims XLSX  │    │ • List fields │    │ • Types ✓  │    │   key    │    │          │
│ • PubMed TXT │    │ • NaN cleanup │    │ • Lists ✓  │    │          │    │          │
│ • API data   │    │              │    │            │    │          │    │          │
└─────────────┘    └──────────────┘    └────────────┘    └──────────┘    └──────────┘

Type Contracts

Field Type In-Memory Type CSV Serialisation NaN Default
Multi-value (AU, AF, C1, CR, DE, ID) list[str] Semicolon-delimited []
Citation count (TC) int64 Integer 0
Publication year (PY) int64 Integer 0
Scalar text fields str String ""

Files Changed

New Files

File Purpose
www/services/etl.py Core ETL pipeline — extract, transform, validate, load
www/services/api_retriever.py PubMed & OpenAlex API retrievers with pagination, rate-limiting, retries, year range and search field filters
functions/get_data.py Dashboard integration — routes uploads through ETL
validation.ipynb Standard Jupyter validation notebook testing all 26 core analytical functions and 7 core service algorithms against standardized databases (Base + Advanced levels)

Modified Files (49 Tracked Files Modified)

To make the codebase fully database source-agnostic and bug-free, changes were introduced across 49 tracked files. Below is an exhaustive summary of these changes grouped by technical layers:

1. Core Framework & Dashboard Integration

  • app.py: Completely redesigned PubMed & OpenAlex sidebars (ui.layout_sidebar); added target Search Field selection, Year limits input range, and removed legacy key warning banners.
  • functions/get_data.py: Centralized standardizer entry-point routing to direct files through etl_pipeline and API collections through _apply_etl_standardisation().
  • www/services/__init__.py: Registered new etl and api_retriever modules for clean workspace imports.

2. Network Calculations & Graph Abstractions

  • www/services/couplingmap.py: standardizes coupling node keys to strings, resolving Louvain & CNM community 'float' object is not iterable crashes.
  • www/services/histnetwork.py: Standardized case-insensitive database matching checks (db.lower()) and patched empty CR sequences.
  • www/services/biblionetwork.py: Patched index arrays and node boundary checks to prevent overflow errors in network calculations.
  • www/services/cocmatrix.py: Wrapped sparse mappings with NaN check-points for null-safe co-occurrence calculations.
  • www/services/networkplot.py: Standardized canvas scaling constraints to correctly center small nodes in Plotly.

3. Bibliographic Parsers & Extraction Utilities

  • www/services/parsers.py: Patched regular expressions in raw text parses to support Cochrane and XML tag attributes.
  • www/services/metatagextraction.py: Added robust defaults and standardized lists conversion during AU_UN and CO normalization.
  • www/services/termextraction.py: Wrapped terms split arrays with empty list fallbacks to bypass parsing crashes.
  • www/services/format_functions.py: Fixed the critical NameError crash at line 1626 where variable columns was undefined.
  • www/services/tabletag.py: Standardized tabular alignments and string representations.

4. Patched Downstream Tab Modules (UI Calculation Files)

We audited and patched all 33 downstream analytical modules in the functions/ directory to resolve float-iteration crashes, division by zero, empty list expansions, and case-sensitive schema casing. This ensures that every analytical tab behaves with absolute stability regardless of the uploaded database source:

  • A. Production & Productivity Modules:

    • get_maininformations.py: Replaced the legacy min/max year operations with safe, nan-aware integer casts. Added explicit checks to CAGRs, single-authored docs counts, and co-authorship percentages to prevent ValueError: cannot convert float NaN to integer on sparse datasets.
    • get_annualproduction.py: Cast publication year values (PY) strictly to 64-bit integers (int64), preventing arithmetic runtime conflicts (e.g. str - str errors) during chronological expansion.
    • get_authorproductionovertime.py: Replaced case-sensitive database casing lookups with case-insensitive db.lower() logic. Added safeguards against empty lists in Plotly timeline components.
    • get_affiliationproductionovertime.py: Fixed a critical shape mismatch error during matrix division and integrated automatic extraction of missing institution metadata (AU_UN normalization) if it is not already loaded.
    • get_sourcesproduction.py: Standardized journal lookups by dynamically converting all journal names (SO) strictly to lowercase strings, ensuring correct aggregation.
    • get_countriesproduction.py & get_countriesproductionovertime.py: Standardized list conversions inside country address fields (C1). When the raw database provides address fields as strings rather than lists, the parser converts them to single-item lists first, preventing character-by-character loops (e.g. iterating USA as "U", "S", "A").
  • B. Citation & References Modules:

    • get_averagecitations.py: Replaced manual procedural loops with type-guarded pandas vectorized sum averages, preventing type conversion errors on empty/null citations.
    • get_citeddocuments.py & get_citedcountries.py: Added explicit bounds checks to division functions, preventing division-by-zero crashes on datasets with zero citations.
    • get_referencesspectroscopy.py: Safeguarded chronological boundaries. Outlying references (pre-1800 or post current year) are filtered out to keep spectra boundaries stable.
    • get_localcitedauthors.py, get_localciteddocuments.py, get_localcitedreferences.py, & get_localcitedsources.py: Built explicit empty-state guards. Since PubMed and Cochrane exports naturally omit Cited References (CR), these modules intercept the blank lists and display user-friendly placeholder messages instead of crashing with a KeyError or trying to parse NaN values.
  • C. Keywords & N-Grams Modules:

    • get_trendtopics.py: Corrected the year comparison logic by standardizing the time_window parameters to integers. Added empty-state fallbacks for datasets with missing keyword columns.
    • get_frequentwords.py & get_wordfrequency.py: Replaced case-sensitive filters with unified lowercase keyword matchings. Added checks to skip empty keyword arrays.
    • get_wordcloud.py: Guarded weight boundaries inside the word cloud rendering matrix, preventing division-by-zero exceptions when term frequencies are completely uniform.
    • get_treemap.py: Set clear nesting level limits for multi-tier hierachy rendering to ensure Plotly tree maps do not overflow or raise index exceptions.
  • D. Clustering & Advanced Network Modules:

    • get_clusteringcoupling.py: Wrapped community division routines to isolate small/disconnected nodes, resolving modularity crashes in Louvain communities.
    • get_historiograph.py: Patched the network edge generation routine. Uses safe index lookup guards to bypass missing records, enabling historical maps to compile crash-free on sparse citation collections.
    • get_bradfordlaw.py: Enforced case-insensitive string parsing on journal names to accurately classify Bradford journal core zones.
    • get_lotkalaw.py: Injected default model standard limits and singular-value constraints to prevent SVD convergence failures when analyzing datasets with small author counts.
    • get_cocitation.py, get_correspondingauthorcountries.py, get_factorialanalysis.py, get_relevantaffiliations.py, get_relevantsources.py, get_thematicevolution.py, get_thematicmap.py, & get_threefieldplot.py: Standardized all data matrix lookups inside co-occurrence and thematic mapping calculations. If the underlying collections are empty (e.g. missing cited references or keywords), these modules intercept the blank tables and render clean empty-state dashboard graphics instead of raising system crashes.

Bug Fixes

# Bug Root Cause Fix
1 WoS data crash after ETL Parser returns lists for scalar fields _flatten_wos_records() collapses single-value lists
2 Scopus page number crash int("S10") ValueError Safe string conversion
3 PubMed duplicate columns IS (ISSN) collides with IP→IS rename Drop collision columns before rename
4 PY arithmetic errors str - str in downstream functions Changed PY to int64
5 TC type loss in cleanup NaN cleanup converts TC to string Skip int columns in string conversion
6 OpenAlex deprecated field host_venue no longer exists Updated to primary_location.source
7 columns undefined format_functions.py line 1626 Set columns = df.columns
8 histNetwork DB check Case-sensitive "Scopus" vs "SCOPUS" db.lower() comparison
9 affiliation_production crash AU_UN column missing + shape mismatch Auto-extract + index alignment
10 trend_topics crash len(int_timespan) TypeError Type-aware timespan handling
11 Empty CR/DE plotting crashes Plotly scatter with NaN sizes Empty data guards with user-friendly messages
12 Cochrane missing DOI / pages Incomplete tag maps and compound page strings Integrated Cochrane database map with hyphen regex parser
13 Lens cited reference crashes CR formats use mixed OpenAlex/Lens IDs Standardized bibliography labels with direct resolution
14 Coupling network float error 'float' object is not iterable in communities Safe string cast for key networks + Louvain membership mapping
15 API collection size controls Query filters lacked year and field isolation Added Year limits and Title/Author field selectors

Test Results

To demonstrate the robust design of our pipeline, we evaluate the test coverage against two distinct metrics:

  1. Crash-Free Execution (Exception-Free): The function runs to completion without raising a Python exception. This verifies that our ETL types and validation contracts are 100% sound.
  2. Active Output Generation (Populated Charts): The function renders a fully populated chart with statistical results. This depends on whether the underlying raw data file contains the optional information (specifically, Cited References CR or Author Keywords DE).

Manual platform exports (like the free tier of Dimensions, PubMed TXT, and Cochrane reviews) do not export Cited References CR in their download templates. For these sources, citation-based analysis functions (like Co-citation, Historiograph, or Local Citations) will successfully execute without crashing, returning a clean, graceful empty-state message (e.g., "No cited references data available").

Validation Audit Table

Source Collection Records Schema Validated Crash-Free Execution Populated Outputs Key Reason for Empty Outputs
Scopus (CSV) 1,000 26 / 26 26 / 26 None (Full bibliography CR & keywords DE present)
Lens (CSV) 100 26 / 26 26 / 26 None (Full bibliography CR & keywords DE present)
Dimensions (XLSX) 500 26 / 26 22 / 26 Missing CR (citation data not exported in free tiers)
PubMed (TXT) 10,000 26 / 26 22 / 26 Missing CR (citation data not exported by PubMed)
Cochrane (TXT) 151 26 / 26 22 / 26 Missing CR (citation data not exported by Cochrane)
OpenAlex API 5 26 / 26 26 / 26 None (Full OpenAlex JSON includes CR matches)
PubMed API 5 26 / 26 22 / 26 Missing CR (PubMed API does not serve raw bibliography)

How to Test

# Run the Jupyter validation notebook containing the automated tests for all 26 analytical functions and 7 core services
jupyter notebook validation.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant