Skip to content

POST /api/stac/v1/search: persistent timeouts (~70% failure rate) starting 2026-05-02 ~00:00 UTC #476

@stevenpbrumby

Description

@stevenpbrumby

Endpoint: https://planetarycomputer.microsoft.com/api/stac/v1/search
Symptom: ~70% of POST /search calls with realistic spatial+temporal parameters either time out at 30 s with no response, or eventually return HTTP 504 after urllib3 exhausts internal retries.
Status pages: Azure Service Health is green; nothing posted on microsoft/PlanetaryComputer/issues in the prior 48 h.
Reporter: steve@impactobservatory.com (Impact Observatory)

Repro (curl)

Picked a non-trivial AOI (Northern Alberta, S2-L2A summer 2021):

for i in $(seq 1 10); do
  code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 30 \
    -H 'Content-Type: application/json' \
    -X POST -d '{"collections":["sentinel-2-l2a"],
                 "intersects":{"type":"Point","coordinates":[-115.0,57.0]},
                 "datetime":"2021-06-01/2021-08-31","limit":50}' \
    "https://planetarycomputer.microsoft.com/api/stac/v1/search")
  echo -n "$code "
done

Two consecutive runs from a US-East egress, ~02:30–02:55 UTC 2026-05-02:

000 000 200 000 000 200 000 200 000 000   # 3 ok / 7 timeouts
000 000 200 000 000 200                    # 2 ok / 4 timeouts

000 = curl timed out at 30 s with no response received from the LB. Trivial requests (e.g. GET /api/stac/v1/collections and POST /search with {"collections":["sentinel-2-l2a"],"limit":1} and no spatial/temporal filter) succeed reliably and fast (~600 ms), so the degradation appears specific to non-trivial searches that scan multiple partitions.

Application-level traceback (one of many)

From pystac_client==0.7.x driving the same query shape:

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='planetarycomputer.microsoft.com', port=443):
  Max retries exceeded with url: /api/stac/v1/search
  (Caused by ResponseError('too many 504 error responses'))
…
pystac_client.exceptions.APIError: HTTPSConnectionPool(host='planetarycomputer.microsoft.com', port=443):
  Max retries exceeded with url: /api/stac/v1/search (Caused by ResponseError('too many 504 error responses'))

The 504s are observed both before and after urllib3's internal retry budget is exhausted, which suggests origin-side timeouts rather than LB throttling (LB throttling typically returns 429 or 503).

Impact

We hit this on a multi-tile geospatial regression workflow (Sentinel-2 → land-cover predictions). With a 5× retry-with-exponential-backoff wrapper on every pystac_client.Client.search(...) call, success probability per call rises from ~30% to ~99.8% — but tail latency adds ~2.5 minutes per affected call, and at this failure rate roughly 1 in 600 calls still bursts through the wrapper and fails the entire downstream pipeline.

Asks

  1. Is there an in-progress incident or capacity event we can track? (Couldn't find one anywhere public.)
  2. Would Microsoft consider publishing an incident-status page for the public PC STAC endpoint, similar to GitHub Status or Cloudflare Status? Right now consumers have to probe to determine health.
  3. If the 504s are coming from a specific backend (e.g. pgstac instance pool), is there a known query shape we should avoid (e.g. intersects + datetime + multiple-orbit collections at once)?

Happy to share more curl/traceback samples if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions