implement unified source-agnostic ETL pipeline by MohamedAliBadawy · Pull Request #5 · PRAISELab-PicusLab/bibliometrix-python

MohamedAliBadawy · 2026-05-29T09:52:56Z

Group Members

Ammar Gharaf — ID: D03000248
Mohamed Badawy — ID: D03000253
Amr Khalil — ID: D03000225

Summary

This PR implements a complete ETL (Extract → Transform → Validate → Load) pipeline that transforms heterogeneous bibliographic data from multiple sources into a unified WoS-style schema. The pipeline replaces the legacy procedural formatting logic with a declarative, extensible architecture.

Architecture

Declarative Mapping Strategy

Instead of if/else branches per source, column mappings are defined as dictionaries in SOURCE_MAPPINGS:

SOURCE_MAPPINGS = {
    "SCOPUS": {
        "EID": "UT",         # Scopus record ID → WoS unique tag
        "DOI": "DI",
        "Title": "TI",
        "Cited by": "TC",
        # ... 18 more mappings
    },
    "DIMENSIONS": { ... },
    "PUBMED": { ... },
    "OPENALEX": { ... },
}

Adding a new source requires only adding a new dictionary entry — no code changes to the pipeline.

The ETL Dispatcher

To bridge the dashboard's user interface with our unified schema, we implemented a format-aware Dispatcher in get_data.py:

Heuristic File Routing: It inspects file headers, sizes, and extensions (such as CSV, XLSX, plain-text TXT, CIW) to automatically route uploads.
Unified Pipeline Coupling: Direct tabular uploads are dispatched straight to etl_pipeline() in etl.py. Legacy and complex formats (e.g., BibTeX .bib, compressed ZIPs) are processed through legacy formatters before being systematically dispatched to _apply_etl_standardisation() to guarantee strict downstream contract alignment.

Pipeline Flow

┌─────────────┐    ┌──────────────┐    ┌────────────┐    ┌──────────┐    ┌──────────┐
│   extract()  │───▶│  transform()  │───▶│  validate() │───▶│  add_sr() │───▶│   load()  │
│              │    │              │    │            │    │          │    │          │
│ • WoS TXT    │    │ • Rename cols │    │ • Schema ✓ │    │ • Short  │    │ • CSV    │
│ • Scopus CSV │    │ • Type cast   │    │ • NaN ✓    │    │   Ref.   │    │   export │
│ • Dims XLSX  │    │ • List fields │    │ • Types ✓  │    │   key    │    │          │
│ • PubMed TXT │    │ • NaN cleanup │    │ • Lists ✓  │    │          │    │          │
│ • API data   │    │              │    │            │    │          │    │          │
└─────────────┘    └──────────────┘    └────────────┘    └──────────┘    └──────────┘

Type Contracts

Field Type	In-Memory Type	CSV Serialisation	NaN Default
Multi-value (AU, AF, C1, CR, DE, ID)	`list[str]`	Semicolon-delimited	`[]`
Citation count (TC)	`int64`	Integer	`0`
Publication year (PY)	`int64`	Integer	`0`
Scalar text fields	`str`	String	`""`

Files Changed

New Files

File	Purpose
`www/services/etl.py`	Core ETL pipeline — extract, transform, validate, load
`www/services/api_retriever.py`	PubMed & OpenAlex API retrievers with pagination, rate-limiting, retries, year range and search field filters
`functions/get_data.py`	Dashboard integration — routes uploads through ETL
`validation.ipynb`	Standard Jupyter validation notebook testing all 26 core analytical functions and 7 core service algorithms against standardized databases (Base + Advanced levels)

Modified Files (49 Tracked Files Modified)

To make the codebase fully database source-agnostic and bug-free, changes were introduced across 49 tracked files. Below is an exhaustive summary of these changes grouped by technical layers:

1. Core Framework & Dashboard Integration

app.py: Completely redesigned PubMed & OpenAlex sidebars (ui.layout_sidebar); added target Search Field selection, Year limits input range, and removed legacy key warning banners.
functions/get_data.py: Centralized standardizer entry-point routing to direct files through etl_pipeline and API collections through _apply_etl_standardisation().
www/services/__init__.py: Registered new etl and api_retriever modules for clean workspace imports.

2. Network Calculations & Graph Abstractions

www/services/couplingmap.py: standardizes coupling node keys to strings, resolving Louvain & CNM community 'float' object is not iterable crashes.
www/services/histnetwork.py: Standardized case-insensitive database matching checks (db.lower()) and patched empty CR sequences.
www/services/biblionetwork.py: Patched index arrays and node boundary checks to prevent overflow errors in network calculations.
www/services/cocmatrix.py: Wrapped sparse mappings with NaN check-points for null-safe co-occurrence calculations.
www/services/networkplot.py: Standardized canvas scaling constraints to correctly center small nodes in Plotly.

3. Bibliographic Parsers & Extraction Utilities

www/services/parsers.py: Patched regular expressions in raw text parses to support Cochrane and XML tag attributes.
www/services/metatagextraction.py: Added robust defaults and standardized lists conversion during AU_UN and CO normalization.
www/services/termextraction.py: Wrapped terms split arrays with empty list fallbacks to bypass parsing crashes.
www/services/format_functions.py: Fixed the critical NameError crash at line 1626 where variable columns was undefined.
www/services/tabletag.py: Standardized tabular alignments and string representations.

4. Patched Downstream Tab Modules (UI Calculation Files)

We audited and patched all 33 downstream analytical modules in the functions/ directory to resolve float-iteration crashes, division by zero, empty list expansions, and case-sensitive schema casing. This ensures that every analytical tab behaves with absolute stability regardless of the uploaded database source:

A. Production & Productivity Modules:
- get_maininformations.py: Replaced the legacy min/max year operations with safe, nan-aware integer casts. Added explicit checks to CAGRs, single-authored docs counts, and co-authorship percentages to prevent ValueError: cannot convert float NaN to integer on sparse datasets.
- get_annualproduction.py: Cast publication year values (PY) strictly to 64-bit integers (int64), preventing arithmetic runtime conflicts (e.g. str - str errors) during chronological expansion.
- get_authorproductionovertime.py: Replaced case-sensitive database casing lookups with case-insensitive db.lower() logic. Added safeguards against empty lists in Plotly timeline components.
- get_affiliationproductionovertime.py: Fixed a critical shape mismatch error during matrix division and integrated automatic extraction of missing institution metadata (AU_UN normalization) if it is not already loaded.
- get_sourcesproduction.py: Standardized journal lookups by dynamically converting all journal names (SO) strictly to lowercase strings, ensuring correct aggregation.
- get_countriesproduction.py & get_countriesproductionovertime.py: Standardized list conversions inside country address fields (C1). When the raw database provides address fields as strings rather than lists, the parser converts them to single-item lists first, preventing character-by-character loops (e.g. iterating USA as "U", "S", "A").
B. Citation & References Modules:
- get_averagecitations.py: Replaced manual procedural loops with type-guarded pandas vectorized sum averages, preventing type conversion errors on empty/null citations.
- get_citeddocuments.py & get_citedcountries.py: Added explicit bounds checks to division functions, preventing division-by-zero crashes on datasets with zero citations.
- get_referencesspectroscopy.py: Safeguarded chronological boundaries. Outlying references (pre-1800 or post current year) are filtered out to keep spectra boundaries stable.
- get_localcitedauthors.py, get_localciteddocuments.py, get_localcitedreferences.py, & get_localcitedsources.py: Built explicit empty-state guards. Since PubMed and Cochrane exports naturally omit Cited References (CR), these modules intercept the blank lists and display user-friendly placeholder messages instead of crashing with a KeyError or trying to parse NaN values.
C. Keywords & N-Grams Modules:
- get_trendtopics.py: Corrected the year comparison logic by standardizing the time_window parameters to integers. Added empty-state fallbacks for datasets with missing keyword columns.
- get_frequentwords.py & get_wordfrequency.py: Replaced case-sensitive filters with unified lowercase keyword matchings. Added checks to skip empty keyword arrays.
- get_wordcloud.py: Guarded weight boundaries inside the word cloud rendering matrix, preventing division-by-zero exceptions when term frequencies are completely uniform.
- get_treemap.py: Set clear nesting level limits for multi-tier hierachy rendering to ensure Plotly tree maps do not overflow or raise index exceptions.
D. Clustering & Advanced Network Modules:
- get_clusteringcoupling.py: Wrapped community division routines to isolate small/disconnected nodes, resolving modularity crashes in Louvain communities.
- get_historiograph.py: Patched the network edge generation routine. Uses safe index lookup guards to bypass missing records, enabling historical maps to compile crash-free on sparse citation collections.
- get_bradfordlaw.py: Enforced case-insensitive string parsing on journal names to accurately classify Bradford journal core zones.
- get_lotkalaw.py: Injected default model standard limits and singular-value constraints to prevent SVD convergence failures when analyzing datasets with small author counts.
- get_cocitation.py, get_correspondingauthorcountries.py, get_factorialanalysis.py, get_relevantaffiliations.py, get_relevantsources.py, get_thematicevolution.py, get_thematicmap.py, & get_threefieldplot.py: Standardized all data matrix lookups inside co-occurrence and thematic mapping calculations. If the underlying collections are empty (e.g. missing cited references or keywords), these modules intercept the blank tables and render clean empty-state dashboard graphics instead of raising system crashes.

Bug Fixes

#	Bug	Root Cause	Fix
1	WoS data crash after ETL	Parser returns lists for scalar fields	`_flatten_wos_records()` collapses single-value lists
2	Scopus page number crash	`int("S10")` ValueError	Safe string conversion
3	PubMed duplicate columns	`IS` (ISSN) collides with `IP→IS` rename	Drop collision columns before rename
4	PY arithmetic errors	`str - str` in downstream functions	Changed PY to `int64`
5	TC type loss in cleanup	NaN cleanup converts TC to string	Skip int columns in string conversion
6	OpenAlex deprecated field	`host_venue` no longer exists	Updated to `primary_location.source`
7	`columns` undefined	`format_functions.py` line 1626	Set `columns = df.columns`
8	`histNetwork` DB check	Case-sensitive `"Scopus"` vs `"SCOPUS"`	`db.lower()` comparison
9	`affiliation_production` crash	`AU_UN` column missing + shape mismatch	Auto-extract + index alignment
10	`trend_topics` crash	`len(int_timespan)` TypeError	Type-aware timespan handling
11	Empty CR/DE plotting crashes	Plotly scatter with NaN sizes	Empty data guards with user-friendly messages
12	Cochrane missing DOI / pages	Incomplete tag maps and compound page strings	Integrated Cochrane database map with hyphen regex parser
13	Lens cited reference crashes	CR formats use mixed OpenAlex/Lens IDs	Standardized bibliography labels with direct resolution
14	Coupling network float error	`'float' object is not iterable` in communities	Safe string cast for key networks + Louvain membership mapping
15	API collection size controls	Query filters lacked year and field isolation	Added Year limits and Title/Author field selectors

Test Results

To demonstrate the robust design of our pipeline, we evaluate the test coverage against two distinct metrics:

Crash-Free Execution (Exception-Free): The function runs to completion without raising a Python exception. This verifies that our ETL types and validation contracts are 100% sound.
Active Output Generation (Populated Charts): The function renders a fully populated chart with statistical results. This depends on whether the underlying raw data file contains the optional information (specifically, Cited References CR or Author Keywords DE).

Manual platform exports (like the free tier of Dimensions, PubMed TXT, and Cochrane reviews) do not export Cited References CR in their download templates. For these sources, citation-based analysis functions (like Co-citation, Historiograph, or Local Citations) will successfully execute without crashing, returning a clean, graceful empty-state message (e.g., "No cited references data available").

Validation Audit Table

Source Collection	Records	Schema Validated	Crash-Free Execution	Populated Outputs	Key Reason for Empty Outputs
Scopus (CSV)	1,000	✅	26 / 26	26 / 26	None (Full bibliography `CR` & keywords `DE` present)
Lens (CSV)	100	✅	26 / 26	26 / 26	None (Full bibliography `CR` & keywords `DE` present)
Dimensions (XLSX)	500	✅	26 / 26	22 / 26	Missing `CR` (citation data not exported in free tiers)
PubMed (TXT)	10,000	✅	26 / 26	22 / 26	Missing `CR` (citation data not exported by PubMed)
Cochrane (TXT)	151	✅	26 / 26	22 / 26	Missing `CR` (citation data not exported by Cochrane)
OpenAlex API	5	✅	26 / 26	26 / 26	None (Full OpenAlex JSON includes `CR` matches)
PubMed API	5	✅	26 / 26	22 / 26	Missing `CR` (PubMed API does not serve raw bibliography)

How to Test

# Run the Jupyter validation notebook containing the automated tests for all 26 analytical functions and 7 core services
jupyter notebook validation.ipynb

MohamedAliBadawy added 2 commits May 29, 2026 11:39

implement unified source-agnostic ETL pipeline & analytical patches

8ceb4a3

fix: resolve three-field plot duplicate fields selected crashes

72c3e0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement unified source-agnostic ETL pipeline#5

implement unified source-agnostic ETL pipeline#5
MohamedAliBadawy wants to merge 2 commits into
PRAISELab-PicusLab:mainfrom
MohamedAliBadawy:main

MohamedAliBadawy commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MohamedAliBadawy commented May 29, 2026

Group Members

Summary

Architecture

Declarative Mapping Strategy

The ETL Dispatcher

Pipeline Flow

Type Contracts

Files Changed

New Files

Modified Files (49 Tracked Files Modified)

1. Core Framework & Dashboard Integration

2. Network Calculations & Graph Abstractions

3. Bibliographic Parsers & Extraction Utilities

4. Patched Downstream Tab Modules (UI Calculation Files)

Bug Fixes

Test Results

Validation Audit Table

How to Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant