Julia client for Groq's OpenAI-compatible Chat Completions
API. Groq-only, by design. A sibling to
AnthropicClient.jl — same
public surface and Reply layout — built for long-running batch and pipeline
workloads where rate limiting and cost accounting are what matter. Defaults
target the open-weight openai/gpt-oss-20b model.
chat/chat_asyncagainst/openai/v1/chat/completionswith HTTP keep-alive pooling.reasoning_effortpassthrough for gpt-oss reasoning models (defaults to"low"for fast, cheap output). The reasoning channel is dropped fromReply.text— you get the final answer only.response_formatpassthrough for JSON Object and strict JSON Schema structured output — guaranteed-parseable JSON.- Per-client sliding-window RPM semaphore shared across concurrent calls.
- Per-reply token + USD cost accounting (uncached input, cached reads, output) against a bundled per-model price table.
Budgetwrapper that throwsBudgetExceededon cap.retry-after-aware 429 handling; bounded exponential backoff on 5xx.- Stub-friendly: body-building and reply-parsing are pure functions, so tests run with no network and no API key.
Base.shownever prints the API key.
While pre-1.0, use as a git dependency:
using Pkg
Pkg.add(url="https://github.com/PelehAI/GroqClient.jl")Set your API key in the environment:
export GROQ_API_KEY=gsk_...using GroqClient
c = Client(
api_key = ENV["GROQ_API_KEY"],
model_default = "openai/gpt-oss-20b",
rpm = 30,
)
reply = chat(c;
system = "You are a helpful assistant.",
messages = [(:user, "Say hi.")],
max_tokens = 64,
)
@show reply.text reply.cost_usd reply.input_tokens reply.output_tokensmessages accepts Msg, (:user, "...") tuples, or :user => "..." pairs.
system accepts String, SystemPrompt(text), or (text="...",) and is sent
as a leading role:"system" message.
gpt-oss emits a separate reasoning channel. This client reads only the final
answer (message.content) into Reply.text and ignores message.reasoning.
Tune depth vs. speed/cost with reasoning_effort:
reply = chat(c;
messages = [(:user, "Plan a 3-step outline.")],
max_tokens = 512,
reasoning_effort = "low", # "low" (default) | "medium" | "high"
)For pipelines that parse the model's output, ask for schema-constrained JSON.
strict = true uses constrained decoding — the output is guaranteed valid
against your schema:
schema = Dict(
"type" => "json_schema",
"json_schema" => Dict(
"name" => "outline",
"strict" => true,
"schema" => Dict(
"type" => "object",
"properties" => Dict("steps" => Dict("type" => "array",
"items" => Dict("type" => "string"))),
"required" => ["steps"],
"additionalProperties" => false,
),
),
)
reply = chat(c;
messages = [(:user, "Outline a talk on caching.")],
max_tokens = 512,
response_format = schema,
)
# reply.text is valid JSON matching the schemaDict("type" => "json_object") is the looser mode (valid JSON syntax, no
schema enforcement) and works on all models.
Groq does automatic prompt caching server-side — there is no per-block
cache_control marker to set. Cache hits show up as reply.cached_read_tokens
and are billed at the discounted cache-read rate; reply.cached_write_tokens
is always 0 (no write surcharge). The cache flag on Msg/SystemPrompt
exists only for signature parity with AnthropicClient.jl and is ignored.
chat_async returns a Task that runs on a background thread. Many concurrent
tasks share one rate budget — the per-client sliding-window semaphore blocks
tasks that would exceed rpm requests in the trailing 60s.
tasks = [chat_async(c; messages=[(:user, "Q$i")], max_tokens=32) for i in 1:20]
replies = fetch.(tasks)Each Reply carries token counts and a USD cost computed against the bundled
price table. Use known_models() to list what's billable; update
src/pricing.jl when Groq changes pricing or you add models.
budget = Budget(c; max_usd = 0.10)
for prompt in prompts
reply = chat(budget; messages=[(:user, prompt)], max_tokens=128)
# raises BudgetExceeded once spent_usd(budget) crosses max_usd
end
@show spent_usd(budget)c = Client(api_key="", rpm=30)
has_key(c) # falseLibrary code can degrade to identity passes / placeholders without a key.
Calling chat on a keyless client throws — guard with has_key.
has_key only tells you a key string is set, not that it works. Two live
probes go further — both make minimal real calls (a few output tokens) and
never throw:
hc = healthcheck(c) # one minimal call, classified
hc.ok, hc.status # e.g. (true, :ok) or (false, :billing)
sp = speedtest(c; n = 5) # n concurrent calls under the rpm cap
sp.throughput_rps, sp.latency_median_mshealthcheck returns a HealthStatus whose status is one of :ok,
:no_key, :auth, :quota, :billing, :bad_request, :server,
:network, :error — enough for a dashboard to show green/red and say why.
speedtest returns a SpeedResult (ok / rate-limited / failed counts, achieved
throughput_rps, and min/median/max latency). Both short-circuit on a keyless
client.
julia --project=. -e 'using Pkg; Pkg.instantiate(); Pkg.test()'All tests are pure-function / wiring-only — no live API calls.
- Streaming (SSE) responses
- Tool use / function calling
service_tier(flex/performance) selection- More Groq-hosted models in the price table
- peleh.ai — academic paper to slide deck.
MIT. See LICENSE.