Skip to content

Implement ETL pipeline for OpenAlex data standardization#7

Open
hadimobini00-ship-it wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
hadimobini00-ship-it:main
Open

Implement ETL pipeline for OpenAlex data standardization#7
hadimobini00-ship-it wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
hadimobini00-ship-it:main

Conversation

@hadimobini00-ship-it
Copy link
Copy Markdown

Project Report: ETL Pipeline for Bibliometrix-Python

Student: [Your Name]
Course: Hardware & Software (2nd Semester)
Date: May 2026

1. Architectural Approach

I implemented a modular ETL pipeline (Extract → Transform → Load) with clear separation of concerns:

  • api_retriever.py – extracts raw data from the OpenAlex REST API.
  • dispatcher.py – orchestrates extraction of nested fields (authors, affiliations, keywords, concepts, cited references, source, abstract) using a mapping dictionary.
  • mapping.py – defines the mapping from OpenAlex source fields to standard Web of Science (WoS) tags.
  • validator.py – enforces data types, converts multi‑value fields to Python lists, replaces NaNs, and computes the Short Reference (SR).
  • main.py – single entry point to run the pipeline.
  • test_functions.py – validates that the standardized CSV works with the bibliometrix‑python analytical functions.

This design makes the system maintainable (update one module without breaking others) and scalable (add new data sources by extending the mapping dictionary).

2. Mapping Strategy

The mapping.py dictionary maps OpenAlex fields to WoS tags. For nested fields (e.g., primary_location.source.display_nameSO), the dispatcher contains custom extraction functions that safely traverse the JSON‑like structures.

3. Type Enforcement & Validation

validator.py ensures:

  • TC (times cited) and PY (publication year) are integers (NaNs become 0).
  • Multi‑value columns (AU, C1, CR, DE, ID) are stored as Python lists.
  • SR is calculated following the logic of the original format_sr_column function from the library.

4. Patching Example

The original get_relevantauthors.py failed due to a NaN in max_x. I added a guard clause:

if pd.isna(max_x):
    max_x = 0

5. Evidence

All assigned analytical functions run without errors on the standardized CSV.

Test output:

✅ get_annual_production
✅ get_average_citations
✅ get_relevant_sources
✅ get_relevant_authors

The Jupyter notebook (Bibliometrix_ETL_Documentation.ipynb) contains the full code, output, and explanations.

6. Conclusion

The implemented ETL pipeline meets both the Base Level (API extraction) and Advanced Level (full standardization) requirements of the exam. It produces a clean, WoS‑compliant CSV that passes all tested bibliometrix‑python functions without crashes.

Add ETL pipeline for OpenAlex -> WoS standardization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant