Geospatial and machine learning workflows for DataKind's Kenya farmland analysis work. The repository includes legacy experiment code and a Kedro project for vegetation-index preprocessing, time-series feature engineering, and crop/land-cover classification.
.
├── conf/ # Kedro catalog, parameters, credentials, logging
│ ├── base/
│ └── local/
├── crop_classification/ # Legacy time-series classification experiments
├── data/ # Kedro data layers, not committed
├── notebooks/ # Exploratory notebooks
├── src/
│ ├── classification/ # Legacy/non-Kedro classification helpers
│ ├── datakind_geospatial/ # Kedro project package
│ ├── generate_rasters/ # Raster-generation utilities
│ └── segmentation/ # Segmentation workflow code
├── pyproject.toml
└── uv.lock
Requirements:
- Python
3.13+ uv- Workflow-specific credentials as needed for Earth Engine, AWS/S3/SageMaker, Supabase, or remote MLflow
Install the project:
uv syncInstall notebook extras:
uv sync --extra notebooksRun Kedro commands through the project entrypoint:
uv run datakind-geospatial --helpEquivalent module entrypoint:
uv run python -m datakind_geospatial --helpThe active Kedro pipelines are registered in src/datakind_geospatial/pipeline_registry.py.
| Pipeline | Command | Purpose |
|---|---|---|
vi_preprocessing |
uv run datakind-geospatial run --pipeline vi_preprocessing |
Cleans raw NDVI and NDMI partitioned time series. |
feature_engineering |
uv run datakind-geospatial run --pipeline feature_engineering |
Loads training time series and labels, reindexes panel data, and encodes classes. |
training |
uv run datakind-geospatial run --pipeline training |
Runs feature engineering plus classifier training. |
classification_training |
uv run datakind-geospatial run --pipeline classification_training |
Alias for the full classification training workflow. |
__default__ |
uv run datakind-geospatial run |
Same as classification_training. |
Useful inspection commands:
uv run datakind-geospatial registry list
uv run kedro catalog list
uv run kedro vizThe catalog is defined in conf/base/catalog.yml.
Expected inputs:
data/01_raw/ndvi_series_raw/*.csv
data/01_raw/ndmi_series_raw/*.csv
data/04_train/*.csv
Training currently expects partition names configured in conf/base/parameters.yml:
feature_engineering:
reindex_train_data:
data: Trans_Nzoia_1_ndvi_train
label: Trans_Nzoia_1_label_train
value_column: ndviOutputs:
data/02_clean/ndvi_series_clean/
data/02_clean/ndmi_series_clean/
data/06_models/trained_classifier_pipeline.pkl
data/08_reporting/training_summary.json
Run:
uv run datakind-geospatial run --pipeline vi_preprocessingConfiguration lives under:
ndvi_preprocessing:
ndmi_preprocessing:Both inherit defaults from vi_preprocessing_defaults in conf/base/parameters.yml. The pipeline supports partition and region filtering through:
selected_partitions: null
selected_regions: nullSet either value to a list when you want to process only a subset.
Run the full training workflow:
uv run datakind-geospatial run --pipeline classification_trainingThis workflow performs:
- Load partitioned training data and labels from
data/04_train. - Reindex the time-series frame to an sktime-compatible panel index.
- Encode labels from configured class names to integers.
- Extract Catch22/Catch24 features.
- Run stratified cross-validation.
- Optionally run Optuna hyperparameter search.
- Fit a final sklearn pipeline on all training data.
- Save local Kedro outputs and log MLflow metrics/artifacts/model.
Current class encoding:
Farm: 0
Field: 1
Other: 2
Tree: 3The training code aligns labels to the feature panel UUID order before CV and final fitting. This is important because the panel can be sorted or filtered independently from the label CSV.
Model parameters live under:
training:
active_model: xgboost
classifiers:
xgboost:
lightgbm:Hyperparameter search is controlled by:
training:
hyperparameter_search:
enabled: true
n_trials: 10Disable search when you want to run the base model parameters directly:
training:
hyperparameter_search:
enabled: falseMLflow configuration lives under:
training:
mlflow:
enabled: true
tracking_uri: http://localhost:5000
experiment_name: timeseries_classification_local
artifact_path: timeseries_classifier
registered_model_name: null
log_model: trueStart a local MLflow server from the repo root:
uv run mlflow ui --backend-store-uri ./mlruns --host 127.0.0.1 --port 5000Then run training:
uv run datakind-geospatial run --pipeline classification_trainingWhat gets logged:
- Optuna trial runs log trial parameters and
trial.selection_metric. - The final training run logs summary metrics,
training_summary.json,confusion_matrix.png, andprecision_recall_curves.png. - The final fitted sklearn pipeline is logged as an MLflow model named by
artifact_path.
With MLflow 3, logged models may appear under the run's Models / Outputs section instead of as a normal folder in the run artifact tree. On disk they can appear under:
mlruns/<experiment_id>/models/<model_id>/artifacts/model.pkl
If Kedro or kedro-mlflow already has an active parent run, final artifacts may be logged to that active Kedro experiment while Optuna trials appear in training.mlflow.experiment_name. If the UI shows only xgboost-trial-* runs, check the datakind_geospatial experiment for the parent classification_training run.
To register the model in the MLflow Model Registry, set:
registered_model_name: timeseries_classifierand rerun training.
Run the default classification workflow:
uv run datakind-geospatial runRun only feature engineering:
uv run datakind-geospatial run --pipeline feature_engineeringRun training without MLflow:
training:
mlflow:
enabled: falseThen:
uv run datakind-geospatial run --pipeline classification_trainingRun a faster local smoke test by lowering trial count:
training:
hyperparameter_search:
enabled: true
n_trials: 1Compile the Kedro training package:
uv run python -m compileall -q src/datakind_geospatial/pipelines/trainingRun linting on changed files:
uv run ruff check src/datakind_geospatial confCheck git state before committing:
git status --shortSome workflows are still maintained outside Kedro:
- Legacy classification experiments:
crop_classification/time_series_analyses/mlflow_experiments/classification - Segmentation workflow:
src/segmentation/ - Raster generation utilities:
src/generate_rasters/ - Exploratory workflows:
notebooks/
Use the Kedro pipelines for reproducible preprocessing and training where possible. Use legacy scripts as references for older experiments and comparison runs.
- Shared configuration belongs in
conf/base/. - Local credentials and machine-specific config belong in
conf/local/. - Do not commit secrets.
- The authoritative dependency list is
pyproject.toml.
No final plots in timeseries_classification_local:
- Check whether the final run is in the
datakind_geospatialexperiment. - Trial runs named
xgboost-trial-*do not log plots by default. - Final plots are named
confusion_matrix.pngandprecision_recall_curves.png.
Model not visible in the artifact tree:
- In MLflow 3, check the run's Models / Outputs section.
- On disk, check
mlruns/<experiment_id>/models/<model_id>/artifacts/model.pkl. - Set
registered_model_nameif you need a Model Registry entry.
conf/base/catalog.ymlconf/base/parameters.ymlsrc/datakind_geospatial/pipeline_registry.pysrc/datakind_geospatial/pipelines/src/segmentation/README.md