Scalac blog hero image showing AI agents as a distributed system with connected service nodes.

Scala in the AI Era: Why Type Safety Matters for LLM Systems

Scalac blog hero image showing AI agents as a distributed system with connected service nodes.

The alert fires at 3:17 AM.

payment-service · prod
ERROR
AttributeError: ‘NoneType’ object has no attribute ‘choices’Traceback (most recent call last): File “adapters/llm.py”, line 88 return response.choices[0].message File “adapters/llm.py”, line 71 response = openai.chat.completions.create(…) File “openai/client.py”, line 203 raise RateLimitError(…)Retries exhausted. Falling back to null.500 requests failed. customer_id=usr_8f3a2..usr_b19c7

The bug had been there since the last deploy. It passed tests. It passed code review. Python had no objection to the line that finally crashed — the failure contract was not in the function signature. The code looked as if it returned a completion. In reality, it could return a completion, raise, retry, time out, or fall back to None. The difference stayed invisible until the rate limiter fired at 3 AM.

In a Scala version of the same pipeline, this exact failure mode would be modeled as data: Either[LlmError, Response], Option[Response], or a sealed response type. The compiler cannot prevent rate limits. What it can prevent is treating a failed LLM call as if it were a successful response.

Caught in the editor

Before the code left the developer’s machine — 0 seconds in production
LlmClient.scala
21
22
23
24
25
26
27
def complete(prompt: Prompt): Either[LlmError, Response] = …val response = complete(prompt) val message = response.choices.head.message // ←send(message)
compile error: value choices is not a member of Either[LlmError, Response]

Caught at 3:17 AM

After deploy, in production — 8 minutes of failed transactions
datadog  /  payment-service  /  prod ERROR
2026-03-14 03:17:42 ERROR payment-worker AttributeError: ‘NoneType’ object has no attribute ‘choices’Traceback (most recent call last): File “workers/payment.py”, line 142 response = llm_client.complete(prompt) File “adapters/llm.py”, line 88 return response.choices[0].message File “adapters/llm.py”, line 71 response = openai.chat.completions.create(…) File “openai/client.py”, line 203 raise RateLimitError(…)2026-03-14 03:17:43 ERROR retry-handler Retries exhausted. Falling back to null.2026-03-14 03:17:43 ERROR payment-worker Transaction abandoned. customer_id=usr_8f3a2
Compile time — the type system refuses to treat a failed LLM call as a successful response. The error path must be handled before deploy.
Runtime — the system runs until it doesn’t. Someone gets paged. The failure is the same shape — caught in two very different places.

The Python-first AI stack, and where it gets complicated

Python won because it moved fast. LangChain, Pydantic, Jupyter, HuggingFace — the entire AI tooling ecosystem was built in Python, which meant every new model, every new API, every new capability was available there first. A team could prototype an LLM pipeline in a day.

That speed comes with a structural cost that’s easy to miss early on. Python’s type hints are annotations, not contracts. mypy and pyright enforce them if you opt in, and by 2026 both tools are meaningfully better than they were three years ago. Pydantic v2 is faster and stricter. TypedDict and TypeAdapter give structural typing for dict-heavy codebases. These are real improvements.

What they don’t fully change is the architecture. Modern Python SDKs are much better than raw dictionaries: OpenAI’s Python SDK exposes typed Pydantic response models, and strict type checking can catch many mistakes earlier. But the guarantee still depends on gradual typing discipline across the surrounding codebase — retries, streaming chunks, tool-call variants, optional fields, and fallback paths. When the LLM response is parsed or validated, the request has already been made. The model already ran. The tokens were already billed.

This works fine for a two-call pipeline. The problems compound when the pipeline grows — five services, three models, retry logic, tool calls, structured output parsing, streaming responses. The number of places where a None can appear without anyone expecting it multiplies at every new integration point.

Eightfold.ai published a case study on introducing static type checking into a 4M+ line Python monorepo. The rollout used MonkeyType for automated type inference, a custom mypy configuration system, and CI enforcement. The case illustrates the structural cost of gradual typing: Python can be made much safer, but the safety depends on adoption, configuration, and keeping untyped or loosely typed parts of the codebase from becoming escape hatches.

To be fair to the modern Python stack: with pyright in strict mode, Pydantic v2’s TypeAdapter, and disciplined schema management, a well-maintained pipeline mitigates most of this. What it still lacks is compile-time exhaustiveness checking — the guarantee that every response variant is handled before the code runs, not just the ones you remembered to test.

Where types meet non-determinism

LLMs add a layer that makes runtime validation harder. A traditional API usually gives you a predictable success response or a clear HTTP error. An LLM integration can fail in less obvious ways: a provider error, a timeout, a streaming chunk without the field you expected, a tool-call variant you did not handle, or a response that is syntactically valid JSON but semantically wrong. The dangerous part is not just that failures happen. It is that they often look like ordinary data until a later step assumes too much.

Pydantic, structured outputs, and strict schemas reduce this problem significantly. They constrain and validate the shape of model output. But they do not turn the whole orchestration layer into a compile-time contract. The API call still happens before output validation, and the rest of the pipeline still has to model retries, refusals, tool calls, partial streams, and provider errors explicitly.

One clarification worth making: the 3 AM alert in the intro is not purely a typing problem. It is also an error handling failure — the retry logic fell back to null instead of surfacing the error. Python can implement this correctly with explicit checks, typed SDK models, tenacity, and disciplined validation.

What Scala adds is that Either[LlmError, Response] or a sealed response type makes the error path impossible to ignore by accident. Types and resilience patterns are separate concerns — Scala’s contribution is that it can encode those resilience patterns directly in the function contract.

The compiler can’t make LLMs deterministic. What it can do is make every boundary between your code and the LLM’s output explicit — so that structural mismatches are caught before the pipeline runs, not after it fails.

Three boundaries where types prevent failures

Figure 02

Same code path. Different checkpoint.

The invalid happy-path assumption does not reach production. It stops at the type boundary.

PYTHON
Write
LLM
Logic
Deploy

3:17 AM

AttributeError in production. 500 requests failed before anyone noticed.

SCALA
Write
LLM
Logic
Deploy

Compile error

unhandled failure case: Either[LlmError, Response]. The pipeline stops here. Deploy never happens. 3 AM error never happens.

The greyed-out steps don’t fail quietly. They never run.

Every LLM pipeline has three places where types surface failures earlier — before they reach a running system instead of after.

Input to prompt. In Python, f"Summarize: {user_input}" will run as long as something printable reaches the template. With strict typing and validation, you can catch many bad inputs earlier, but that discipline has to be applied consistently across every upstream service. In Scala, summarize(userInput: UserInput) can make the boundary explicit: if the previous step returns Either[UserInputError, UserInput], the compiler forces you to handle the error before building the prompt.

LLM output to structured data. Scala’s sealed traits model response variants exhaustively:

sealed trait LlmResponse
case class TextResponse(content: String) extends LlmResponse
case class ToolCall(name: String, arguments: Json) extends LlmResponse
case class Refusal(reason: String) extends LlmResponse

def handle(response: LlmResponse): IO[Unit] = response match
  case TextResponse(text)   => processText(text)
  case ToolCall(name, args) => executeTool(name, args)
  case Refusal(reason)      => logRefusal(reason)
// Compiler warns if a case is missing

When a new response variant is added to the sealed hierarchy, every non-exhaustive pattern match can be flagged by the compiler. Python can model variants with Union types and static analyzers, but exhaustiveness depends on tool configuration and coding discipline. In Scala, this check is part of the normal compiler feedback loop.

Tool calls to external APIs. Python frameworks such as LangChain can generate JSON Schema from tool functions, and that works well for many use cases. The weak point is the runtime boundary: malformed or unexpected LLM arguments surface during the agent loop, after the model call has already happened. In sttp-ai, AgentTool.fromFunction[T] derives the tool schema from a case class. The schema and the function input stay tied to one definition, and malformed LLM arguments decode into an explicit error before the tool function executes.

The pattern is the same at all three boundaries. Python can validate around dynamic behavior. Scala can remove many structural invalid states before the code compiles.

sttp-ai in practice

The Scala AI tooling stack in 2026 is production-ready. Multiple Scala client libraries cover OpenAI, Anthropic, Google Gemini, Azure, Ollama, DeepSeek, and OpenAI-compatible endpoints. sttp-ai 0.4.x supports native Claude integration, Claude 4.1+ structured outputs, tool calling, and streaming across fs2, ZIO, Akka/Pekko Streams, and Ox. Metals’ MCP integration also gives AI assistants access to compiler feedback, symbol inspection, and test execution inside the development loop.

A type-safe OpenAI call with ZIO:

import sttp.ai.openai.OpenAI
import sttp.client4.httpclient.zio.HttpClientZioBackend
import zio.*

val program = ZIO.scoped {
  for {
    backend  <- HttpClientZioBackend.scoped()
    client    = OpenAI.fromEnv
    response <- client.createChatCompletion(body).send(backend)
    _        <- response.body match {
      case Right(r) => Console.printLine(r.choices.head.message.content)
      case Left(e)  => ZIO.fail(s"OpenAI error: ${e.getMessage}")
    }
  } yield ()
}

The return type of response.body is Either[ResponseError, ChatCompletionResponse]. The success/error split is explicit in the type, so caller code has to decide what happens in both cases instead of treating every response as successful by default.

With Cats Effect:

import sttp.ai.openai.OpenAI
import sttp.client4.httpclient.cats.HttpClientCatsBackend
import cats.effect.IO

val program: IO[Unit] = HttpClientCatsBackend.resource[IO].use { backend =>
  val client = OpenAI.fromEnv
  client.createChatCompletion(body).send(backend).flatMap { response =>
    response.body match
      case Right(r) => IO.println(r.choices.head.message.content)
      case Left(e)  => IO.raiseError(new RuntimeException(e.toString))
  }
}

Both ZIO and Cats Effect track effects explicitly in the return type. The IO wrapper is not an advanced feature to be added later. It is the foundation that makes retry logic, timeouts, and circuit breakers composable:

def callWithRetry(
  prompt: String,
  maxRetries: Int = 3
): ZIO[Any, LlmError, Response] =
  callLLM(prompt)
    .retry(Schedule.recurs(maxRetries) && Schedule.exponential(100.millis))
    .catchAll { error =>
      ZIO.logError(s"LLM call failed after $maxRetries retries: $error") *>
      ZIO.fail(LlmError.MaxRetriesExceeded(error))
    }

The retry schedule, the error type, and the success type are visible at the boundary of the function. You still need good runtime policies and observability, but the shape of failure is no longer hidden in comments, decorators, or undocumented exception paths.

The agent loop: typed tools vs **kwargs

The tool call boundary is where the difference between Python and Scala is most visible in practice.

A LangChain tool in Python:

@tool
def get_weather(location: str, unit: str = "celsius") -> str:
    return f"{location}: 22°C"

The @tool decorator can generate a JSON Schema for the function, which is useful and often enough for prototypes. The problem is that the contract still lives at a runtime boundary: if the LLM passes malformed arguments, the agent discovers the issue during execution, not when the orchestration code is compiled.

In sttp-ai:

case class WeatherInput(location: String, unit: String = "celsius")
  derives SnakePickle.ReadWriter, Schema

val weatherTool = AgentTool.fromFunction("get_weather", "Get current weather") {
  (input: WeatherInput) => s"${input.location}: 22°C"
}

AgentTool.fromFunction derives the JSON Schema from the case class. If the LLM returns a field that doesn’t exist in WeatherInput, the decoder fails explicitly — for example as Either[DecodeError, WeatherInput] — before the tool function is called. The tool schema and the function input are derived from the same definition.

Figure 03

One definition, fewer drift points.

Python can keep schemas safe when they stay generated from the model. Scala makes the schema and business type share the same source of truth.

PYTHON · schema drift risk
① Pydantic model
class PersonInfo(BaseModel): name: str age: int occupation: Optional[str] = None
export, copy, wrap, or maintain separately
② JSON Schema sent to OpenAI
{ “name”: { “type”: “string” }, “age”: { “type”: “integer” }, “job”: { “type”: “string” } // ↑ field renamed — Pydantic not updated }
Copied schema and model out of sync.
OpenAI returns “job”. Pydantic expects “occupation”.
ValidationError at runtime.
SCALA · one definition
case class + derives Schema
case class PersonInfo( name: String, age: Int, occupation: Option[String] = None ) derives ReadWriter, Schema
compiler derives both automatically
JSON Schema → OpenAI
Types in business logic
Change the field name once.
Schema updates automatically.
Compiler verifies the rest.
Python can keep this safe when the model and schema stay generated together. Drift starts when teams copy, patch, or version schemas separately.
Scala keeps the schema and business type closer to one definition. Fewer moving pieces means fewer places for contracts to diverge.

This is the same pattern as structured outputs with Tapir. You define the case class once, then derive the JSON Schema sent to the model and the types used in business logic from that same source:

case class PersonInfo(
  name:       String,
  age:        Int,
  occupation: Option[String] = None
) derives ReadWriter, Schema

// Tapir derives JSON Schema from the case class
val jsonSchema = TapirSchemaToJsonSchema(
  implicitly[Schema[PersonInfo]],
  markOptionsAsNullable = true
)
// Sent to OpenAI as response_format.
// The model is constrained to return valid JSON matching this structure.
// Decode error is explicit in the return type — not a runtime surprise.

The Python equivalent with Pydantic v2 can also generate a JSON Schema via PersonInfo.model_json_schema(), and that is the right approach in Python. The structural difference is what happens when contracts move across services, wrappers, prompts, and tool registries. In Scala, changing the case class forces the compiler to show every call site affected by the new shape.

The full agent loop with typed tools and explicit error handling:

def agentLoop(userQuery: String): IO[String] =
  for {
    response <- llmClient.chat(userQuery, tools = List(weatherTool))
    result   <- response match {
      case TextResponse(text)   => IO.pure(text)
      case ToolCall(name, args) =>
        executeTool(name, args).flatMap(result =>
          agentLoop(s"Tool returned: $result")
        )
      case Refusal(reason)      => IO.pure(s"Refused: $reason")
    }
  } yield result

The pattern match on response is exhaustive. Add a new variant to LlmResponse and every call site gets a compiler warning.

Effects: IO as a contract, not a wrapper

Teams that treat IO and ZIO as advanced features to be added “once the project stabilizes” typically end up rewriting core logic when they need to add retries, timeouts, or parallel calls. Effects are easier to add at the start than to retrofit.

The reason is simple. A function typed IO[Either[LlmError, Response]] is a complete specification:

def callLLM(prompt: String): IO[        // async — runs on a fiber
  Either[                                 // explicit error channel
    LlmError,                             // what can go wrong
    Response                              // what you get on success
  ]
]

The Python equivalent:

def call_llm(prompt: str) -> ChatCompletion:
    # may raise requests.Timeout
    # may raise RateLimitError
    # may return a response with optional / missing content
    # may fail during streaming or tool-call parsing
    # the surrounding code must still model these paths explicitly

The Python signature can be typed, especially with modern SDKs. But unless the surrounding code models failures as part of the contract, important paths still live in exceptions, comments, framework conventions, or runtime validation. The Scala signature makes the failure channel part of the value every caller must handle.

On the operational side, long-running LLM orchestration services are a good fit for the JVM: throughput is rarely the bottleneck when most of the time is spent waiting on model APIs, databases, queues, or external tools. For serverless deployments with strict cold-start requirements, GraalVM native-image can reduce startup time and memory usage significantly, but the exact numbers depend on the application, dependencies, and deployment target.

When AI writes the code, the compiler is the first reviewer

AI-generated code changed where the bottleneck sits. Writing used to be slow; reviewing was fast. Now writing is fast and reviewing is where the time goes.

That shift makes static types more valuable, not less. The compiler reviews structural correctness before human review begins. A 2025 paper on type-constrained code generation found that 94% of LLM-generated compilation errors in its benchmark were type-check failures; GitHub’s 2026 analysis used that result to explain why typed languages are becoming more attractive in AI-assisted development. This is not a Scala-only statistic, and it should not be read as “94% of production bugs disappear.” The useful takeaway is narrower and more practical: many AI-generated mistakes are boring but dangerous — wrong field names, missing variants, unhandled errors, invalid return types, or optimistic assumptions about optional data. In a dynamically typed pipeline, these can remain syntactically valid until tests or production traffic exercise the wrong path.

The TypePilot paper points in the same direction: type-guided workflows can improve the trustworthiness of LLM-generated code in high-assurance domains. The practical takeaway is simple. The type system narrows the space of valid outputs, and the compiler gives the AI assistant immediate feedback. The loop becomes: write, compile, fix — not write, deploy, discover.

Metals’ MCP integration takes this further. AI agents can use compiler feedback, inspect symbols, and run tests inside the development loop. The compiler and the AI assistant operate against the same structured representation of the code.

When to choose Scala, when to stay with Python

The honest trade-offs first.

Scala’s learning curve is real. An engineer coming from Python will need time before the effect system and type inference feel natural. The Scala talent pool is smaller than Python’s — and smaller than TypeScript’s, which is worth naming here.

TypeScript gives you compile-time types with a much larger ecosystem and a bigger talent pool. It can also model discriminated unions and enforce exhaustiveness with the right patterns. Scala’s advantage is different: a mature effect ecosystem, sealed hierarchies, functional error handling as a standard idiom, and compiler feedback that backend teams already use to model domain and infrastructure boundaries. If your team already writes TypeScript, Scala is a larger organizational bet. If your team already writes Scala, the AI tooling is production-ready.

Python stays on the table for GPU-bound work. vLLM, TGI, and the major model providers’ Python SDKs are the standard for model serving. The practical architecture: Python for inference, Scala for orchestration, connected via HTTP or gRPC. Scala owns the boundary layer — the three places where types prevent failures. Python owns the layer where GPU utilization matters.

Scala makes the most sense for teams that are:

  • Moving an LLM pipeline from prototype to production
  • Running multiple LLM services with typed inter-service contracts
  • Already running Scala for other backend services
  • Finding that runtime failures in Python pipelines are eating engineering time

It makes less sense for a two-person team with a single pipeline, moving fast, where the compile-time overhead outweighs the reliability benefits at current scale.

What this looks like in practice

Python built the AI prototypes. The industry needed that speed. The tooling we have today exists because Python made experimentation cheap and fast.

Production reliability requires something different. The three boundaries — input, output, tool calls — are where non-deterministic LLM behavior intersects with typed code. Scala’s compiler can surface structural failures at each of those boundaries before the code ships. It does not prevent semantic errors or hallucinations — the LLM can return valid JSON that is structurally correct and meaningfully wrong. What types give you is a guarantee that the shape is what you expect, and that every structural variant is handled.

The vibecoding shift makes this more relevant, not less. When AI generates most of the code, the compiler becomes the first reviewer. That’s a meaningful change to where errors get caught.

If your team is moving LLM pipelines from prototype to production and wants to evaluate whether a type-safe foundation fits your stack, leave your email or get in touch at projects@scalac.io. We’ll look at your current pipeline, identify the boundaries where failures are most likely to surface, and give you a concrete assessment of where Scala helps — and where Python should stay.

FAQ

What is sttp-ai and how does it differ from the Python OpenAI SDK?

sttp-ai is a type-safe Scala client library for OpenAI, Anthropic, and OpenAI-compatible endpoints. The Python OpenAI SDK is also typed and uses Pydantic response models, so the difference is not “typed vs untyped SDK”. The difference is how far the type contract extends into the orchestration layer. In sttp-ai, response bodies can expose the success/error split as Either[ResponseError, ChatCompletionResponse], so caller code has to handle both paths. sttp-ai 0.4.x supports native Claude integration, Claude 4.1+ structured outputs, streaming via fs2/ZIO/Akka/Pekko/Ox, and tool calling.

Do I need to rewrite my entire Python pipeline in Scala?

No. The recommended path is incremental: start with the boundary layer — typed inputs, structured outputs, tool schemas, retry/error contracts, and inter-service APIs. Keep Python for GPU-bound model serving, experimentation, and the parts of the ecosystem where Python is strongest. Use Scala where orchestration, typed contracts, and failure handling matter most, communicating with Python services via HTTP or gRPC. Scala does not need to replace Python — it needs to own the boundary layer where type mismatches and unhandled failure states become production incidents.

Is Scala fast enough for production LLM workloads?

For long-running LLM orchestration services, the JVM with JIT compilation is a strong fit — services spend most of their time waiting on LLM API calls, databases, queues, or tools, not executing CPU-bound code. For serverless deployments, GraalVM native-image can reduce startup time and memory usage, but exact numbers depend on the application and deployment target. Incremental compilation adds feedback time compared with Python, but it also catches a class of structural failures before runtime.

What is an effect system and why does it matter for LLM pipelines?

An effect system (ZIO or Cats Effect IO) makes the operational behavior of a function explicit in its return type. A function typed IO[Either[LlmError, Response]] communicates: it performs an effect, it can fail with LlmError, and on success it returns Response. Python’s async/await handles concurrency, but failure modes usually live in exceptions, conventions, or validation code around the call. Effect systems make retry logic, timeouts, and circuit breakers composable in a way that is visible before the code runs.

How does Scala compare to TypeScript for building type-safe LLM pipelines?

TypeScript provides compile-time types with a larger ecosystem and talent pool, and with discriminated unions plus strict compiler settings it can model many response variants safely. Scala’s advantage is specific: sealed hierarchies, compiler-backed exhaustiveness checks, a mature effect ecosystem, and functional error handling via Either as a standard backend idiom. If your team already writes TypeScript, Scala is a larger organizational bet. If your team already writes Scala, the AI tooling is production-ready as of 2026.


Sources:

Get the State of

Scala 2025 report

Download now

Latest Blogposts

18.06.2026 / By 

Scala in the AI Era: Why Type Safety Matters for LLM Systems

Scalac blog hero image showing AI agents as a distributed system with connected service nodes.

AI agents don’t fail because prompts are bad. They fail at system boundaries. This article explains why type safety matters when LLM pipelines move from prototype to production — and where Scala helps catch structural failures before they reach runtime.

17.06.2026 / By 

Scalendar – July 2026

Welcome to the July 2026 edition of Scalendar — your monthly guide to Scala events, conferences, meetups, and community happenings from around the world. This month features a strong lineup of events for Scala developers, with a particular focus on programming languages, software engineering, functional programming, and AI. From Scala-specific workshops to major international conferences […]

02.06.2026 / By 

THE SIGNAL: What matters in distributed systems | #3

Header banner for The Signal newsletter by Scalac. Black background with red geometric accents. Text reads: "MAY 2026 / THE SIGNAL / What matters in the distributed systems." Scalac logo in the bottom right.

Here is what matters in distributed systems this month. Oracle proposed removing JVMCI — Amazon pushed back. Anthropic published a Claude Code production postmortem. OpenAI shipped WebSocket Responses API. MCP lands on the JVM.

software product development

Need a successful project?

Estimate project