Implement ETL pipeline for OpenAlex data standardization by hadimobini00-ship-it · Pull Request #7 · PRAISELab-PicusLab/bibliometrix-python

hadimobini00-ship-it · 2026-05-30T15:04:00Z

Project Report: ETL Pipeline for Bibliometrix-Python

Student: [Your Name]
Course: Hardware & Software (2nd Semester)
Date: May 2026

1. Architectural Approach

I implemented a modular ETL pipeline (Extract → Transform → Load) with clear separation of concerns:

api_retriever.py – extracts raw data from the OpenAlex REST API.
dispatcher.py – orchestrates extraction of nested fields (authors, affiliations, keywords, concepts, cited references, source, abstract) using a mapping dictionary.
mapping.py – defines the mapping from OpenAlex source fields to standard Web of Science (WoS) tags.
validator.py – enforces data types, converts multi‑value fields to Python lists, replaces NaNs, and computes the Short Reference (SR).
main.py – single entry point to run the pipeline.
test_functions.py – validates that the standardized CSV works with the bibliometrix‑python analytical functions.

This design makes the system maintainable (update one module without breaking others) and scalable (add new data sources by extending the mapping dictionary).

2. Mapping Strategy

The mapping.py dictionary maps OpenAlex fields to WoS tags. For nested fields (e.g., primary_location.source.display_name → SO), the dispatcher contains custom extraction functions that safely traverse the JSON‑like structures.

3. Type Enforcement & Validation

validator.py ensures:

TC (times cited) and PY (publication year) are integers (NaNs become 0).
Multi‑value columns (AU, C1, CR, DE, ID) are stored as Python lists.
SR is calculated following the logic of the original format_sr_column function from the library.

4. Patching Example

The original get_relevantauthors.py failed due to a NaN in max_x. I added a guard clause:

if pd.isna(max_x):
    max_x = 0

5. Evidence

All assigned analytical functions run without errors on the standardized CSV.

Test output:

✅ get_annual_production
✅ get_average_citations
✅ get_relevant_sources
✅ get_relevant_authors

The Jupyter notebook (Bibliometrix_ETL_Documentation.ipynb) contains the full code, output, and explanations.

6. Conclusion

The implemented ETL pipeline meets both the Base Level (API extraction) and Advanced Level (full standardization) requirements of the exam. It produces a clean, WoS‑compliant CSV that passes all tested bibliometrix‑python functions without crashes.

Add ETL pipeline for OpenAlex -> WoS standardization

Add files via upload

75b9bce

Add ETL pipeline for OpenAlex -> WoS standardization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ETL pipeline for OpenAlex data standardization#7

Implement ETL pipeline for OpenAlex data standardization#7
hadimobini00-ship-it wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
hadimobini00-ship-it:main

hadimobini00-ship-it commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hadimobini00-ship-it commented May 30, 2026

Project Report: ETL Pipeline for Bibliometrix-Python

1. Architectural Approach

2. Mapping Strategy

3. Type Enforcement & Validation

4. Patching Example

5. Evidence

6. Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant