POST /api/stac/v1/search: persistent timeouts (~70% failure rate) starting 2026-05-02 ~00:00 UTC

**Endpoint:** `https://planetarycomputer.microsoft.com/api/stac/v1/search`
**Symptom:** ~70% of `POST /search` calls with realistic spatial+temporal parameters either time out at 30 s with no response, or eventually return HTTP 504 after `urllib3` exhausts internal retries.
**Status pages:** Azure Service Health is green; nothing posted on `microsoft/PlanetaryComputer/issues` in the prior 48 h.
**Reporter:** steve@impactobservatory.com (Impact Observatory)

## Repro (curl)

Picked a non-trivial AOI (Northern Alberta, S2-L2A summer 2021):

```bash
for i in $(seq 1 10); do
  code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 30 \
    -H 'Content-Type: application/json' \
    -X POST -d '{"collections":["sentinel-2-l2a"],
                 "intersects":{"type":"Point","coordinates":[-115.0,57.0]},
                 "datetime":"2021-06-01/2021-08-31","limit":50}' \
    "https://planetarycomputer.microsoft.com/api/stac/v1/search")
  echo -n "$code "
done
```

Two consecutive runs from a US-East egress, ~02:30–02:55 UTC 2026-05-02:

```
000 000 200 000 000 200 000 200 000 000   # 3 ok / 7 timeouts
000 000 200 000 000 200                    # 2 ok / 4 timeouts
```

`000` = curl timed out at 30 s with no response received from the LB. Trivial requests (e.g. `GET /api/stac/v1/collections` and `POST /search` with `{"collections":["sentinel-2-l2a"],"limit":1}` and no spatial/temporal filter) succeed reliably and fast (~600 ms), so the degradation appears specific to non-trivial searches that scan multiple partitions.

## Application-level traceback (one of many)

From `pystac_client==0.7.x` driving the same query shape:

```
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='planetarycomputer.microsoft.com', port=443):
  Max retries exceeded with url: /api/stac/v1/search
  (Caused by ResponseError('too many 504 error responses'))
…
pystac_client.exceptions.APIError: HTTPSConnectionPool(host='planetarycomputer.microsoft.com', port=443):
  Max retries exceeded with url: /api/stac/v1/search (Caused by ResponseError('too many 504 error responses'))
```

The 504s are observed both before and after `urllib3`'s internal retry budget is exhausted, which suggests origin-side timeouts rather than LB throttling (LB throttling typically returns 429 or 503).

## Impact

We hit this on a multi-tile geospatial regression workflow (Sentinel-2 → land-cover predictions). With a 5× retry-with-exponential-backoff wrapper on every `pystac_client.Client.search(...)` call, success probability per call rises from ~30% to ~99.8% — but tail latency adds ~2.5 minutes per affected call, and at this failure rate roughly 1 in 600 calls still bursts through the wrapper and fails the entire downstream pipeline.

## Asks

1. Is there an in-progress incident or capacity event we can track? (Couldn't find one anywhere public.)
2. Would Microsoft consider publishing an incident-status page for the public PC STAC endpoint, similar to GitHub Status or Cloudflare Status? Right now consumers have to probe to determine health.
3. If the 504s are coming from a specific backend (e.g. `pgstac` instance pool), is there a known query shape we should avoid (e.g. `intersects` + `datetime` + multiple-orbit collections at once)?

Happy to share more curl/traceback samples if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POST /api/stac/v1/search: persistent timeouts (~70% failure rate) starting 2026-05-02 ~00:00 UTC #476

Repro (curl)

Application-level traceback (one of many)

Impact

Asks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

POST /api/stac/v1/search: persistent timeouts (~70% failure rate) starting 2026-05-02 ~00:00 UTC #476

Description

Repro (curl)

Application-level traceback (one of many)

Impact

Asks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions