Skip to content

ktalpay/CarbonOps-Parser

CarbonOps-Parser

CarbonOps-Parser banner

Auditable public carbon emission factor ingestion and validation for climate-tech data infrastructure.

Status Phase Python .NET PostgreSQL Release Release validation Package License

CarbonOps-Parser is a public, reviewable climate-tech data ingestion project for carbon accounting source data. It focuses on auditable ingestion, parsing, validation, diagnostics, and PostgreSQL operation for public emission factors from GHG Protocol, DEFRA/DESNZ, and IPCC EFDB, with parallel Python and .NET runtime evidence. The Python package includes a configured PostgreSQL ingestion runtime for operator-managed deployments. CarbonOps-Parser is project-level production-ready in the narrow supported scope documented in the final verdict: operator-run/scheduled Python ingestion, PostgreSQL-backed source-specific persistence, and .NET parity evidence through service entrypoint, config/redaction, schema/year-state, source-cycle orchestration, persistence, Docker PostgreSQL E2E, and persisted parity validation. The repository is intentionally conservative: default examples are deterministic and local-only, production runs require explicit configuration and credentials, and the project does not claim production carbon-accounting, legal, compliance, source-owner, or factor correctness.

The project is independent from carbonops-assistant. It is not a continuation, module, plugin, or dependency of that project.

Start Here

Problem Statement

Public carbon emissions workflows often depend on emission factor spreadsheets, databases, and reference documents that change over time and vary by source family. CarbonOps-Parser exists to make carbon factor ingestion reviewable: source identity, version or checksum evidence, parser output, validation issues, persistence readiness, and diagnostics should be visible before any operational use. The project is infrastructure for data ingestion and validation, not an emissions calculator or compliance decision engine.

Current Status

CarbonOps-Parser is in Phase 1 and has a narrow project-level production-ready status for the documented operator path. The repository contains an active Python ingestion runtime, .NET parity evidence, PostgreSQL schema/runtime boundaries, deterministic examples, local dry-run validation, and production operator documentation. The production-ready claim applies only to the scope documented in Final Project Production-Ready Verdict and Production Parity Contract. It is not a published package release.

Explicit Non-Claims

CarbonOps-Parser does not claim to be:

  • A production carbon-accounting calculator or emissions reporting engine.
  • Legal, compliance, audit, or regulatory advice.
  • A source-owner correctness guarantee for GHG Protocol, DEFRA/DESNZ, IPCC EFDB, or any source document.
  • A universal carbon factor model across all source families.
  • A published package, unless release/package files and repository releases prove otherwise.
Area Phase 1 completed capabilities Phase 2 roadmap
Source families Local fixture and contract coverage for GHG Protocol, DEFRA/DESNZ, and IPCC EFDB boundaries. Broader source onboarding rules, fixture policy, and source-family hardening slices.
Python Source acquisition contracts, parser contracts, DEFRA/DESNZ fixture parser path, normalization handoff, persistence previews, diagnostics, and local dry-run CLI. Runtime hardening, richer validation, controlled source expansion, and opt-in execution boundaries.
.NET Service entrypoint, config/redaction, PostgreSQL schema/year-state, source-cycle orchestration, source-specific persistence, Docker PostgreSQL E2E, and persisted parity validation baselines. Runtime parity review where shared behavior changes; package/service promotion remains separately scoped.
PostgreSQL Schema descriptors, DDL preview, additive runtime bootstrap, configured Python source-family writes, idempotent duplicate skipping, and opt-in integration boundaries. Broader migration, rollback, recovery, and operational hardening slices.
Safety posture Local-only examples, non-destructive dry runs, preview-only SQL, no default network calls, and no production credentials. Release-gate expansion and production-readiness reviews before live source or write-path promotion.

Users who clone or fork the repository should be able to inspect either implementation path without relying on production infrastructure.

Phase 1 Scope

Phase 1 focuses on scheduled ingestion and parsing for:

Source family Public discovery value Phase 1 posture
GHG Protocol Greenhouse Gas Protocol tools and factor workbooks used in carbon accounting workflows. Source discovery/download contracts, parser contracts, normalized content parser boundaries, and parity tests.
DEFRA/DESNZ UK government conversion factors used for carbon emissions and greenhouse gas reporting workflows. Deterministic local fixture parser and normalization path plus source discovery/download contracts.
IPCC EFDB IPCC Emission Factor Database source family with heterogeneous emission factor records. Source discovery/download contracts, parser contracts, normalized content parser boundaries, and parity tests.

The intended Phase 1 workflow is:

  1. Read configuration.
  2. Validate the database provider.
  3. Connect to PostgreSQL.
  4. Check whether required tables exist.
  5. Create missing tables if needed.
  6. Initialize source schedules.
  7. Check source version and file hash.
  8. Download a source document when a new version or hash is detected.
  9. Archive the raw source file.
  10. Parse source-specific structures.
  11. Validate parsed records.
  12. Persist shared ingestion metadata and source-specific records.
  13. Store import summaries and validation issues.

Architecture At A Glance

source schedule
  -> version/hash check
  -> download when changed
  -> raw file archive
  -> source-specific parser
  -> validation
  -> PostgreSQL persistence
  -> import summary and validation issues

Phase 1 uses shared ingestion metadata tables plus source-specific master/detail tables. It does not force GHG Protocol, DEFRA/DESNZ, and IPCC EFDB into one canonical factor table. A normalized or search-oriented projection may be considered in a later phase.

The Python path under src/carbonfactor_parser holds the current implementation boundaries for source acquisition, parser execution, normalization, PostgreSQL persistence previews, configured PostgreSQL ingestion, local dry-run composition, and diagnostics. The .NET path under src/dotnet holds shared contract records and parity tests for the same public concepts. PostgreSQL support includes schema descriptors, bootstrap/readiness checks, DDL previews, opt-in integration boundaries, and the Python configured cycle runner. Parity, validation, diagnostics, and non-destructive dry-run behavior are part of the public architecture so reviewers can inspect the handoff from source artifact to parser output to persistence input without connecting to a database or making network calls.

Implementation Options

Python

The Python implementation is the active Phase 1 path for source discovery contracts, parser mapping, validation, normalization handoff, persistence previews, and data engineering workflows.

The active Python runtime path lives under src/carbonfactor_parser and exposes the local dry-run CLI plus the configured carbonops-parser run-ingestion operator command for PostgreSQL-backed source-family ingestion. The initial Python source adapter contracts and in-memory registry live under src/carbonfactor_parser/source_adapters.

.NET

The .NET implementation is an independent Worker Service path that follows the same conceptual workflow with .NET-oriented application structure. The reviewed production scope treats .NET as parity-validated through its service entrypoint, configuration/redaction, PostgreSQL schema/year-state, source-cycle orchestration, source-specific persistence, Docker PostgreSQL E2E, and persisted parity baselines.

See src/dotnet/README.md.

Install And Local Dry-Run Quickstart

From a fresh checkout or local working copy:

git clone <REPOSITORY_URL> CarbonOps-Parser
cd CarbonOps-Parser
python -m pip install -e .

Run the test suite if you want a quick local smoke check:

python -m pytest

Run the checked-in DEFRA/DESNZ fixture through the local dry-run CLI:

carbonops-parser local-dry-run \
  --local-path examples/fixtures/defra_desnz_minimal.csv \
  --source-family defra_desnz \
  --source-id defra-desnz-minimal-fixture \
  --content-type text/csv \
  --format-hint csv

Expected summary:

status=success
parsed_record_count=2
normalization_record_count=2
persistence_input_record_count=2
ddl_preview_present=True
issue_count=0

Run the JSON variant:

carbonops-parser local-dry-run \
  --local-path examples/fixtures/defra_desnz_minimal.csv \
  --source-family defra_desnz \
  --source-id defra-desnz-minimal-fixture \
  --content-type text/csv \
  --format-hint csv \
  --json

Key output fields:

  • status: dry-run outcome such as success, failed, unsupported, or no_records
  • parsed_record_count: records parsed by the minimal local DEFRA/DESNZ fixture parser
  • normalization_record_count: records produced by the minimal fixture normalization mapper
  • persistence_input_record_count: records prepared as PersistenceInput
  • ddl_preview_present: whether review-only PostgreSQL DDL preview text is attached
  • issues: structured local loader, parser, normalization, or persistence-input issues

Optionally include PostgreSQL insert preview data in text output:

carbonops-parser local-dry-run \
  --local-path examples/fixtures/defra_desnz_minimal.csv \
  --source-family defra_desnz \
  --source-id defra-desnz-minimal-fixture \
  --content-type text/csv \
  --format-hint csv \
  --include-postgresql-preview

Trimmed expected preview lines:

postgresql_preview_included=True
postgresql_preview_status=ready
postgresql_preview_only=True
postgresql_preview_sql_execution=False
postgresql_preview_database_connection=False
postgresql_preview_target_table=normalized_records
postgresql_preview_record_count=2
postgresql_preview_sql=INSERT INTO normalized_records (source_family, source_id, record_id, record_index, row_number, normalized_fields, source_reference, source_artifact_reference, source_checksum_sha256, parser_metadata, normalization_metadata, created_at, updated_at) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
postgresql_preview_issue_count=0

Run the JSON PostgreSQL preview variant:

carbonops-parser local-dry-run \
  --local-path examples/fixtures/defra_desnz_minimal.csv \
  --source-family defra_desnz \
  --source-id defra-desnz-minimal-fixture \
  --content-type text/csv \
  --format-hint csv \
  --json \
  --include-postgresql-preview

Trimmed expected JSON preview section:

{
  "postgresql_persistence_preview": {
    "included": true,
    "preview_only": true,
    "sql_execution": false,
    "database_connection": false,
    "status": "ready",
    "target_table": "normalized_records",
    "record_count": 2,
    "ordered_columns": [
      "source_family",
      "source_id",
      "record_id",
      "record_index",
      "row_number",
      "normalized_fields",
      "source_reference",
      "source_artifact_reference",
      "source_checksum_sha256",
      "parser_metadata",
      "normalization_metadata",
      "created_at",
      "updated_at"
    ],
    "idempotency_key_fields": [
      "source_family",
      "source_id",
      "record_id",
      "source_artifact_reference",
      "source_checksum_sha256"
    ],
    "issues": []
  }
}

The postgresql_persistence_preview section is preview-only. It includes the target table, ordered columns, parameter rows, record count, SQL text with placeholders, and idempotency metadata, but it does not execute SQL or persist records. No PostgreSQL server, database configuration, or credentials are required.

This quickstart is local dry-run only. It does not connect to PostgreSQL, write records, execute SQL, run migrations, perform network calls, trigger source acquisition, load config files, or require credentials. It does not make production DEFRA/DESNZ correctness claims.

Production Operator Command

The supported Python production entrypoint is:

carbonops-parser run-ingestion \
  --config /etc/carbonops-parser/ingestion.production.json \
  --cycles 1

Before running it, operators must provide explicit CARBONOPS_POSTGRESQL_* environment values, including the password through an external secret boundary, and validate the config:

carbonops-parser validate-ingestion-config \
  --config /etc/carbonops-parser/ingestion.production.json \
  --cycles 1

See Production Packaging And Operator Runbook for install, configuration, PostgreSQL readiness, cron scheduling, verification SQL, rerun/idempotency checks, and troubleshooting.

This is the supported Python runtime production operator path. Project-level production-ready is limited to the scope in Final Project Production-Ready Verdict and Production Parity Contract.

For boundary details, see Local Dry-Run CLI Boundary, Local File Normalized Persistence Dry-Run Boundary, PostgreSQL Persistence Preview Boundary, and Local Dry-Run Troubleshooting.

To run the packaged Python ingestion cycle against local Docker PostgreSQL with the three checked-in source fixture families, see Python Ingestion Local Runbook.

Developer Tests

Run the lightweight Python test suite from the repository root:

python -m pytest

Pytest configuration is kept in pyproject.toml, including the src package import path used by the tests.

Public API Examples

The carbonfactor_parser.source_adapters package exposes source adapter contracts and lightweight helpers for tests, prototypes, and implementation slices.

Hash source content without reading or downloading files:

from carbonfactor_parser.source_adapters import (
    sha256_hex_from_bytes,
    sha256_hex_from_text,
)

content_hash = sha256_hex_from_bytes(b"sample source content")
note_hash = sha256_hex_from_text("sample metadata note")

Create and validate metadata for an existing local file:

from pathlib import Path

from carbonfactor_parser.source_adapters import (
    SourceFamily,
    build_source_document_from_file,
    validate_source_document_metadata,
)

document = build_source_document_from_file(
    source_family=SourceFamily.DEFRA_DESNZ,
    source_name="Example local factor file",
    file_path=Path("data/raw/example/source.csv"),
)

metadata_issues = validate_source_document_metadata(document)

Create and validate an ingestion summary contract:

from carbonfactor_parser.source_adapters import (
    SourceFamily,
    create_ingestion_run_summary,
    validate_ingestion_run_summary,
)

summary = create_ingestion_run_summary(
    ingestion_id="example-run-001",
    source_family=SourceFamily.DEFRA_DESNZ,
    source_name="Example local factor file",
)

summary_issues = validate_ingestion_run_summary(summary)

Use the artificial-only source acquisition validation pipeline with in-memory metadata:

from carbonfactor_parser import (
    create_artificial_source_acquisition_metadata,
    validate_and_summarize_artificial_source_acquisition_metadata,
)

metadata = create_artificial_source_acquisition_metadata(
    source_family="artificial_source_acquisition",
    logical_source_name="artificial-in-memory-source",
    declared_content_type="text/csv",
    checksum_sha256="a" * 64,
    acquired_at_label="static-artificial-acquisition-label",
)

pipeline_result = validate_and_summarize_artificial_source_acquisition_metadata(
    metadata,
)
issue_count = pipeline_result.summary.total_issue_count

This pipeline is limited to artificial metadata shape checks and deterministic summaries. It does not acquire real sources, read files, validate real source URLs, run parsers or normalization, check factor correctness, or provide compliance/legal or carbon accounting correctness. See docs/artificial-source-acquisition-validation-pipeline.md, docs/artificial-source-acquisition-module-recap.md, and examples/example_artificial_source_acquisition_validation_pipeline.py.

Source acquisition CLI quickstart

Use the carbonops-source-acquisition CLI for local source descriptor checks and acquisition flow previews.

  • Default run mode is noop and offline.
  • HTTP mode is opt-in with --client http.
  • validate checks local descriptor metadata only; it does not verify live URLs.
  • run --dry-run plans targets only and does not acquire content or write files/manifests.
  • Parser execution and database persistence are outside this CLI boundary at this phase.
carbonops-source-acquisition validate
carbonops-source-acquisition list
carbonops-source-acquisition list --source-id defra_desnz
carbonops-source-acquisition run --dry-run --base-directory ./data/source-acquisition
carbonops-source-acquisition run --output-format json
carbonops-source-acquisition run --client http --source-id ghg_protocol
carbonops-source-acquisition run --client http --source-id ghg_protocol --persist-content --base-directory ./data/source-acquisition

For boundary details, see:

See examples/example_acquisition_artifact_parser_input_mapping.py for a deterministic in-memory example of mapping acquisition artifact metadata into a future parser input boundary without executing a parser.

The parser package exposes ParserInputContract, create_parser_input_contract(), validate_parser_input_contract(), ParserFileContentInput, local parser file content loading helpers, parser file content validation helpers, parse_defra_desnz_file_content(), raw parsed record payload contracts, the ParserAdapter protocol, NoopParserAdapter, ArtificialParserAdapter, DefraDesnzParserAdapter, parser adapter registry helpers, parser execution planning and runner helpers, and parser execution result contracts for future parser adapter input handoff. The normalization package exposes parser execution handoff helpers, normalization input helpers for successful parser results with raw payloads, and a minimal DEFRA/DESNZ fixture normalization mapper. The persistence package exposes normalized result persistence input contracts, a logical PostgreSQL schema descriptor, a review-only DDL preview helper, a deterministic insert SQL builder, PostgreSQL persistence preview helpers, repository protocol/result contracts, an explicit caller-provided PostgreSQL options contract, a default-disabled PostgreSQL integration test boundary, and a PostgreSQL repository skeleton that returns unsupported results without database runtime behavior. The pipeline package exposes a local DEFRA/DESNZ fixture dry-run helper that composes those boundaries to produce PersistenceInput plus DDL preview metadata without DB or network behavior. These contracts keep acquisition metadata, already-loaded content, raw parser output, parser output metadata, normalization input, normalization handoff metadata, persistence input metadata, schema metadata, repository options metadata, integration test metadata, preview metadata, and repository result metadata separate; they do not include database connection behavior or full source-specific correctness claims.

Examples And Fixtures

The examples entry point is examples/README.md. It identifies deterministic local examples, including the checked-in DEFRA/DESNZ fixture used by the local dry-run quickstart, and separates real examples from future placeholders.

Source Support

Each Phase 1 source family will have its own schedule, source version/hash check, parser, validation rules, archive layout, and source-specific tables.

Source family Phase 1 role Table group
GHG Protocol Source-specific parser and workbook/tool mapping ghg_*
DEFRA/DESNZ Active checked-in fixture and source-specific ingestion slice defra_*
IPCC EFDB Heterogeneous source discovery and parser mapping ipcc_*

See docs/source-support.md and docs/source-discovery.md.

Configuration Summary

The conceptual configuration model includes:

  • Database provider and connection settings.
  • Raw archive path.
  • Source-specific enabled flags.
  • Source-specific schedules with day, week, month, year, time, and timezone support.

Phase 1 implements only postgres as the database provider. mysql and mssql are recognized as conceptual provider names but are not implemented in Phase 1.

See docs/configuration-model.md.

The shared conceptual example lives at config/carbonops.config.example.yaml.

Database Model Summary

PostgreSQL is the Phase 1 persistence target. The model includes:

  • Shared ingestion metadata tables: carbon_sources, carbon_source_versions, carbon_import_runs, carbon_raw_files, carbon_validation_issues, and carbon_job_locks.
  • DEFRA/DESNZ tables: defra_categories, defra_subcategories, defra_factor_sets, and defra_factor_values.
  • GHG Protocol tables: ghg_tools, ghg_factor_sheets, ghg_factor_groups, and ghg_factor_values.
  • IPCC EFDB tables: ipcc_sectors, ipcc_categories, ipcc_references, ipcc_factor_records, and ipcc_factor_values.

See docs/database-model.md, docs/database-startup.md, and database/postgres/README.md.

PostgreSQL persistence uses shared ingestion metadata plus source-specific master/detail table groups for GHG Protocol, DEFRA/DESNZ, and IPCC EFDB. That layout preserves source-family structure for reviewable carbon emission factor ingestion instead of claiming one universal carbon accounting factor model.

Documentation Map

Roadmap Summary

Near-term work keeps the narrow production-ready scope conservative while separating package publication, infrastructure ownership, live-source expansion, and future runtime promotion into separately reviewed tasks.

See docs/roadmap.md and docs/task-breakdown.md.

Governance

Issues and pull requests are welcome for documentation, examples, parser mappings, source discovery, database schema notes, and implementation improvements.

Non-Goals

CarbonOps-Parser does not:

  • Calculate carbon inventories.
  • Produce emissions reports.
  • Replace source-owner documentation or source files.
  • Guarantee source data correctness.
  • Provide a deployment platform.
  • Normalize all source families into one shared factor table during Phase 1.

License

CarbonOps-Parser is licensed under the Apache License 2.0.

About

Scheduled carbon factor ingestion and parsing reference project with independent Python and .NET implementations for GHG Protocol, DEFRA/DESNZ, and IPCC EFDB datasets.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors