Home
/
Blog
/
AI Agents Are Distributed Systems. Why Scala’s Type Safety Matters More Than Prompts

18.06.2026 / By Piotr Borkowicz

AI Agents Are Distributed Systems. Why Scala’s Type Safety Matters More Than Prompts

The alert fires at 3:17 AM.

payment-service · prod

ERROR

AttributeError: ‘NoneType’ object has no attribute ‘choices’Traceback (most recent call last): File “adapters/llm.py”, line 88 return response.choices[0].message File “adapters/llm.py”, line 71 response = openai.chat.completions.create(…) File “openai/client.py”, line 203 raise RateLimitError(…)Retries exhausted. Falling back to null.500 requests failed. customer_id=usr_8f3a2..usr_b19c7

The bug made it through tests and code review because the retry wrapper returned None instead of surfacing the rate limit, and the next line assumed success. Python can catch this with Optional[Response] and mypy/pyright, but only if every layer of the wrapper does it. The assumption stayed invisible until the rate limiter fired.

This is not a story about Python being bad. Python built the AI ecosystem. It is a story about how optional discipline scales differently from enforced discipline. In a two-call pipeline you remember every check. In a production pipeline with retries, wrappers, and fallback paths, you do not.

In Scala, the same operation would return IO[Either[LlmError, Response]]. You cannot call .choices on an Either. The compiler refuses. The rate limit still happens. The difference is where you find out.

Caught in the editor

Before the code left the developer’s machine — 0 seconds in production

LlmClient.scala

21
22
23
24
25
26
27

def complete(prompt: Prompt): Either[LlmError, Response] = …val response = complete(prompt) val message = response.choices.head.message // ←send(message)

Caught at 3:17 AM

After deploy, in production — 8 minutes of failed transactions

datadog / payment-service / prod ERROR

2026-03-14 03:17:42 ERROR payment-worker AttributeError: ‘NoneType’ object has no attribute ‘choices’Traceback (most recent call last): File “workers/payment.py”, line 142 response = llm_client.complete(prompt) File “adapters/llm.py”, line 88 return response.choices[0].message File “adapters/llm.py”, line 71 response = openai.chat.completions.create(…) File “openai/client.py”, line 203 raise RateLimitError(…)2026-03-14 03:17:43 ERROR retry-handler Retries exhausted. Falling back to null.2026-03-14 03:17:43 ERROR payment-worker Transaction abandoned. customer_id=usr_8f3a2

Compile time — the type system refuses to treat a failed LLM call as a successful response. The error path must be handled before deploy.

Runtime — the system runs until it doesn’t. Someone gets paged.

The Python-first AI stack, and where it gets complicated

Python won the AI tooling race because it moved fast. LangChain, Pydantic, Jupyter, HuggingFace. Every new model, API, and capability showed up there first. A team could prototype an LLM pipeline in a day.

That speed comes from dynamic typing. It also means the safety is optional. Type hints are annotations. mypy and pyright enforce them only if you opt in, only where they are complete, and only if the code around them plays along. Pydantic v2 is faster and stricter than v1. TypedDict and TypeAdapter help with dict-heavy code. These are real improvements.

But optional discipline does not scale the same way as enforced discipline. A two-call pipeline is fine. Five services, three models, retry logic, tool calls, streaming chunks, structured outputs, and fallback paths create too many corners for a None to hide. Each new integration point is another place where the type model and the runtime model can drift apart.

Eightfold.ai wrote about adding static type checking to a 4M-line Python monorepo. They used MonkeyType, a custom mypy config, and CI enforcement. The takeaway was not that Python cannot be safe. It was that safety in Python is a continuous project: inference, configuration, and keeping untyped escape hatches from multiplying.

A well-maintained Python pipeline with pyright strict mode, Pydantic v2, and disciplined schema management catches a lot. What it still misses is compile-time exhaustiveness: the guarantee that every response variant is handled before the code runs, not just the ones in the test suite.

Where types meet non-determinism

LLMs fail in ways that regular APIs do not. A traditional API returns a success response or an HTTP error. An LLM call can:

hit a rate limit after retries,
time out during a long generation,
return a streaming chunk without the field the next step expects,
emit a tool-call variant the orchestration code does not recognize,
return syntactically valid JSON that is semantically wrong.

The dangerous failures look like ordinary data. The response object exists. The JSON parses. The problem surfaces two steps later, when code assumes too much.

Pydantic and structured outputs constrain the shape of model output. They do not turn the orchestration layer into a compile-time contract. The API call happens before validation. The rest of the pipeline still has to model retries, refusals, tool calls, partial streams, and provider errors explicitly.

The 3 AM alert was not purely a typing problem. It was an error-handling problem: the retry logic fell back to null instead of surfacing the failure. Python can handle that correctly with explicit checks, typed SDK models, tenacity, and discipline.

What Scala adds is enforcement. Either[LlmError, Response] or a sealed response type makes the error path hard to ignore. The compiler cannot prevent rate limits. It can prevent you from treating a failed call as a successful one.

Three boundaries where types move failures left

Figure 02

Same code path. Different checkpoint.

Scala moves this class of structural errors left, before deploy. Provider failures still happen at runtime.

PYTHON

Write

LLM

Logic

Deploy

3:17 AM

AttributeError in production. 500 requests failed before anyone noticed.

SCALA

Write

LLM

Logic

Deploy

Compile error

unhandled failure case: Either[LlmError, Response]. The pipeline stops here. Deploy never happens.

The greyed-out steps don’t fail quietly. They never run. Rate limits and provider errors still do.

Most production LLM failures happen at three boundaries. At each one, Python validates at runtime. Scala can push the check earlier.

Input to prompt. In Python, you can type the input with Pydantic and validate it. The contract is explicit in the function signature, but the error path is still optional. In Scala, if the previous step returns Either[UserInputError, UserInput], you cannot build the prompt until you handle the error. The difference is not that Python cannot validate. It is that Scala makes the validation unavoidable.

LLM output to structured data. Scala’s sealed traits model response variants exhaustively:

sealed trait LlmResponse
case class TextResponse(content: String) extends LlmResponse
case class ToolCall(name: String, arguments: Json) extends LlmResponse
case class Refusal(reason: String) extends LlmResponse

def handle(response: LlmResponse): IO[Unit] = response match
  case TextResponse(text)   => processText(text)
  case ToolCall(name, args) => executeTool(name, args)
  case Refusal(reason)      => logRefusal(reason)
// Compiler warns if a case is missing

Add a new variant to LlmResponse and the compiler shows every unhandled match. Python 3.10+ with match and assert_never can enforce exhaustiveness too, but it depends on pyright/mypy configuration and team discipline. In Scala, it is part of the normal compile.

Tool calls to external APIs. Both ecosystems can validate arguments before the tool runs. The difference is how the schema stays in sync with the code. In Python, the JSON Schema sent to OpenAI is often generated once and then copied, patched, or versioned separately from the Pydantic model. In sttp-ai, the schema and the case class are one definition. Change a field name and the compiler shows every broken call site.

sttp-ai in practice

The Scala AI tooling stack is ready for production use cases that fit it. sttp-ai covers OpenAI, Anthropic, Google Gemini, Azure, Ollama, DeepSeek, and OpenAI-compatible endpoints. It supports structured outputs, tool calling, and streaming across fs2, ZIO, Akka/Pekko Streams, and Ox. Before adopting, check whether the latest model features you need are already exposed and whether your team accepts a smaller ecosystem than Python’s.

Here is a type-safe OpenAI call with ZIO:

import sttp.ai.openai.OpenAI
import sttp.client4.httpclient.zio.HttpClientZioBackend
import zio.*

val program = ZIO.scoped {
  for {
    backend  <- HttpClientZioBackend.scoped()
    client    = OpenAI.fromEnv
    response <- client.createChatCompletion(body).send(backend)
    _        <- response.body match {
      case Right(r) =>
        r.choices.headOption match
          case Some(choice) => Console.printLine(choice.message.content.getOrElse(""))
          case None         => ZIO.fail("Empty choices")
      case Left(e) => ZIO.fail(s"OpenAI error: ${e.getMessage}")
    }
  } yield ()
}

response.body is Either[ResponseError, ChatCompletionResponse]. The caller must decide what to do with both cases. The split is in the type, not in comments or conventions.

Effects make retry, timeout, and circuit-breaker logic composable:

def callWithRetry(
  prompt: String,
  maxRetries: Int = 3
): ZIO[Any, LlmError, Response] =
  callLLM(prompt)
    .retry(Schedule.recurs(maxRetries) && Schedule.exponential(100.millis))
    .catchAll { error =>
      ZIO.logError(s"LLM call failed after $maxRetries retries: $error") *>
      ZIO.fail(LlmError.MaxRetriesExceeded(error))
    }

The retry schedule, error type, and success type are visible at the function boundary. You still need runtime policy and observability, but the failure shape is not hidden in decorators or exception paths.

Where tool calls break: schema drift and unknown tools

The tool-call boundary is where Python pipelines most often surprise you in production. The issue is not that Python cannot type a function. It is that the contract between the LLM, the schema, and the business logic drifts over time.

Suppose you have a get_weather tool. The Pydantic model and the schema sent to OpenAI both start from the same definition:

class WeatherInput(BaseModel):
    location: str
    unit: str = "celsius"

@tool
def get_weather(input: WeatherInput) -> str:
    return f"{input.location}: 22°C"

This works until someone changes the field name in the model but forgets to regenerate the schema, or copies the schema into a prompt template, or wraps the tool in another decorator that does not propagate types. Then OpenAI returns "job" and Pydantic expects "occupation". The error is a runtime ValidationError, discovered after the model ran.

In sttp-ai, the same definition does both jobs:

case class WeatherInput(location: String, unit: String = "celsius")
  derives SnakePickle.ReadWriter, Schema

val weatherTool = AgentTool.fromFunction("get_weather", "Get current weather") {
  (input: WeatherInput) => s"${input.location}: 22°C"
}

Change location to city in the case class and three things happen: the business type changes, the JSON Schema sent to OpenAI changes, and the compiler shows every place that still expects location. The schema and the type are one definition.

The same pattern works for structured outputs with Tapir. You define the case class once, then derive the JSON Schema for the model and the types for the rest of the pipeline:

case class PersonInfo(
  name:       String,
  age:        Int,
  occupation: Option[String] = None
) derives ReadWriter, Schema

// Tapir derives the JSON Schema from the same case class
// and sends it to OpenAI as response_format.
// Decode error is explicit in the return type.

Pydantic v2 can generate a schema with PersonInfo.model_json_schema(), and that is the right approach in Python. The difference is what happens when the contract moves across services, wrappers, prompts, and tool registries. In Scala, a single change propagates through the compiler. In Python, it propagates through discipline.

A second failure mode is schema drift inside the agent loop. The LLM returns arguments for a tool, but the field names no longer match the Pydantic model because someone updated the schema independently. In Python, this surfaces as a ValidationError after the model call. In Scala, changing the case class forces the compiler to show every affected call site before deploy.

def agentLoop(messages: List[Message]): IO[String] =
  for {
    response <- llmClient.chat(messages, tools = List(weatherTool))
    result   <- response match {
      case TextResponse(text)   => IO.pure(text)
      case ToolCall(name, args) =>
        // Simplified: in production, name maps to a typed handler.
        executeTool(name, args).flatMap { toolResult =>
          val nextMessages = messages ++ List(
            Message.assistantToolCall(name, args),
            Message.tool(toolResult)
          )
          agentLoop(nextMessages)
        }
      case Refusal(reason)      => IO.pure(s"Refused: $reason")
    }
  } yield result

// start
agentLoop(List(Message.user("Jaka jest pogoda w Warszawie?")))

The loop keeps the full conversation state. After the tool returns “Warszawa: 22°C”, the next LLM call sees both the original question and the tool result. The match is also exhaustive: add a StreamingResponse variant and the compiler lists every call site that missed it.

Effects: IO as a contract, not a wrapper

Teams that add IO or ZIO “once the project stabilizes” usually rewrite core logic later. Effects are easier to design in from the start than to retrofit.

A Scala signature like this is a complete specification:

def callLLM(prompt: String): IO[Either[LlmError, Response]]

It says: this function performs an effect, it can fail with LlmError, and on success it returns Response. The caller cannot pretend otherwise.

The Python equivalent can be typed too:

def call_llm(prompt: str) -> ChatCompletion:
    # may raise requests.Timeout
    # may raise RateLimitError
    # may return a response with missing content
    # the surrounding code must model these paths explicitly

But the failure modes live in exceptions, comments, or conventions. The Scala signature makes the failure channel part of the value the caller must handle. Retry, timeout, and circuit-breaker logic compose because the effect is explicit.

For long-running LLM orchestration services, the JVM is a good fit. Most of the time is spent waiting on model APIs, databases, queues, or tools, not executing CPU-bound code. For serverless deployments, GraalVM native-image can reduce startup time and memory usage, though the exact gain depends on the app and dependencies.

When AI writes the code, the compiler is the first reviewer

AI coding assistants changed where the bottleneck sits. Writing is fast now. Reviewing is where the time goes.

That makes static types more useful. The compiler catches structural mistakes before a human reviewer sees the diff. Many AI-generated errors are boring but expensive: wrong field names, missing variants, unhandled errors, invalid return types, optimistic assumptions about optional data. In a dynamically typed pipeline, these can stay syntactically valid until tests or production traffic hit the wrong path.

Metals’ MCP integration pushes this further. AI agents can use compiler feedback, inspect symbols, and run tests inside the development loop. The compiler and the assistant operate against the same structured representation of the code. The loop becomes write, compile, fix — not write, deploy, discover.

When to choose Scala, when to stay with Python

Start with the trade-offs.

Scala’s learning curve is real. An engineer coming from Python needs time before effect systems and type inference feel natural. The Scala talent pool is smaller than Python’s, and smaller than TypeScript’s. Hiring, onboarding, and build times are part of the cost.

TypeScript gives you compile-time types with a larger ecosystem and more available engineers. With discriminated unions and strict compiler settings, it can model many response variants safely. Scala’s edge is different: sealed hierarchies with compiler-backed exhaustiveness, a mature effect ecosystem, and functional error handling via Either as a standard backend idiom. If your team already writes TypeScript, Scala is a bigger organizational bet. If your team already writes Scala, the cost of adding AI tooling is lower.

Python stays for GPU-bound work. vLLM, TGI, and the major model serving stacks are Python-first. The practical split: Python for inference and experimentation, Scala for orchestration, connected via HTTP or gRPC. Scala owns the boundary layer. Python owns the layer where GPU utilization matters.

Scala makes sense when:

you are moving an LLM pipeline from prototype to production,
you run multiple LLM services with typed inter-service contracts,
you already run Scala for other backend services,
runtime failures in Python pipelines are eating engineering time,
you can absorb the adoption cost: hiring, training, longer builds, and slower access to bleeding-edge model features.

Scala makes less sense for a small team with one pipeline, moving fast, where compile-time overhead outweighs reliability at current scale.

What this looks like in practice

Python built the AI prototypes. The industry needed that speed. The tooling we have exists because Python made experimentation cheap.

Production reliability requires something stricter. The three boundaries — input, output, tool calls — are where non-deterministic LLM behavior intersects with typed code. Scala’s compiler surfaces structural failures at those boundaries before the code ships. It does not prevent semantic errors or hallucinations. The LLM can still return valid JSON that is structurally correct and meaningfully wrong. Types guarantee the shape, and that every structural variant is handled.

If your team is moving LLM pipelines from prototype to production and wants to know where a type-safe foundation fits, get in touch or email projects@scalac.io. We will look at your current pipeline, identify the boundaries where failures are most likely to surface, and tell you where Scala helps and where Python should stay.

FAQ

What is sttp-ai and how does it differ from the Python OpenAI SDK?

sttp-ai is a type-safe Scala client for OpenAI, Anthropic, and OpenAI-compatible endpoints. The Python OpenAI SDK is also typed. The difference is how far the type contract reaches into the orchestration layer. sttp-ai exposes response bodies as Either[ResponseError, ChatCompletionResponse], so callers must handle both paths.

Do I need to rewrite my entire Python pipeline in Scala?

No. Start with the boundary layer: typed inputs, structured outputs, tool schemas, retry and error contracts, and inter-service APIs. Keep Python for GPU-bound model serving and experimentation. Use Scala where orchestration, typed contracts, and failure handling matter most.

Is Scala fast enough for production LLM workloads?

For long-running LLM orchestration services, the JVM is a strong fit because most time is spent waiting on APIs, databases, and queues. For serverless, GraalVM native-image can reduce startup time and memory usage, though results depend on the application.

What is an effect system and why does it matter for LLM pipelines?

An effect system like ZIO or Cats Effect IO makes operational behavior explicit in the return type. IO[Either[LlmError, Response]] says the function performs an effect, can fail with LlmError, and returns Response on success. Python’s async/await handles concurrency, but failure modes usually live in exceptions. Effect systems make retry, timeout, and circuit-breaker logic composable and visible.

How does Scala compare to TypeScript for building type-safe LLM pipelines?

TypeScript gives you compile-time types with a larger ecosystem and talent pool. With discriminated unions and strict settings, it can model many variants safely. Scala’s advantage is sealed hierarchies with compiler-backed exhaustiveness, a mature effect ecosystem, and Either as a standard backend idiom. If your team already uses TypeScript, Scala is a larger bet. If your team already uses Scala, the cost of adding AI tooling is lower.

Sources:

Authors

Piotr Borkowicz

AI Agents Are Distributed Systems. Why Scala’s Type Safety Matters More Than Prompts

Caught in the editor

Caught at 3:17 AM

The Python-first AI stack, and where it gets complicated

Where types meet non-determinism

Three boundaries where types move failures left

Same code path. Different checkpoint.

sttp-ai in practice

Where tool calls break: schema drift and unknown tools

Effects: IO as a contract, not a wrapper

When AI writes the code, the compiler is the first reviewer

When to choose Scala, when to stay with Python

What this looks like in practice

FAQ

What is sttp-ai and how does it differ from the Python OpenAI SDK?

Do I need to rewrite my entire Python pipeline in Scala?

Is Scala fast enough for production LLM workloads?

What is an effect system and why does it matter for LLM pipelines?

How does Scala compare to TypeScript for building type-safe LLM pipelines?

Authors

Categories

Index

Latest Blogposts

THE SIGNAL: What matters in distributed systems | #4

Rust as the A2A Orchestrator: What We Learned Building a Multi-Agent System

AI Agents Are Distributed Systems. Why Scala’s Type Safety Matters More Than Prompts

Need a successful project?

AI Agents Are Distributed Systems. Why Scala’s Type Safety Matters More Than Prompts

Caught in the editor

Caught at 3:17 AM

The Python-first AI stack, and where it gets complicated

Where types meet non-determinism

Three boundaries where types move failures left

Same code path. Different checkpoint.

sttp-ai in practice

Where tool calls break: schema drift and unknown tools

Effects: IO as a contract, not a wrapper

When AI writes the code, the compiler is the first reviewer

When to choose Scala, when to stay with Python

What this looks like in practice

FAQ

What is sttp-ai and how does it differ from the Python OpenAI SDK?

Do I need to rewrite my entire Python pipeline in Scala?

Is Scala fast enough for production LLM workloads?

What is an effect system and why does it matter for LLM pipelines?

How does Scala compare to TypeScript for building type-safe LLM pipelines?

Get the State of

Authors

Newsletter

Get the State of

Get the State of

Popular Posts in category

Categories

Index

Latest Blogposts

THE SIGNAL: What matters in distributed systems | #4

Rust as the A2A Orchestrator: What We Learned Building a Multi-Agent System

AI Agents Are Distributed Systems. Why Scala’s Type Safety Matters More Than Prompts

Need a successful project?