
LLMs are mediocre at Scala. If you don’t push back, your codebase will be too.

By Piotr Kosecki, Senior Scala Engineer at Scalac
Everyone is talking about AI making developers 10x more productive. After several months of using it daily on production Scala systems, here’s what I actually think: AI makes me faster at delivering features. It makes me slower at keeping the codebase worthy of the language I’m using. The two things are not the same and confusing them is how teams accumulate debt they don’t see coming.
The real speedup: onboarding and boilerplate
When I join a new repository, I now start by having an agent analyze it. I use Antigravity with Gemini and Codex with GPT interchangeably, depending on the task. This alone cuts my onboarding time significantly, not because the agent understands the system better than I would, but because it answers the obvious questions faster. Where is the configuration? What does this module actually do? Why does this dependency exist?
I also generate a lot of code in natural language instead of typing it out. Boilerplate, simple data transformations, test scaffolding, routine configuration. This is where the speedup is real and I don’t pretend otherwise.
Research backs this. A JetBrains longitudinal study presented at ICSE 2026, analyzing two years of logs from 800 developers, found that AI doesn’t uniformly reduce effort; instead, it redistributes effort across more fragmented, reactive, and cognitively demanding workflows. That matches exactly what I experience: the time saved on generation went into verification, not free time. Developers now spend over 50% of their time evaluating and editing AI-generated output. Of the code initially accepted, 18% is later deleted and another 6% heavily rewritten.

The real problem: AI doesn’t write Scala
More precisely: it writes Java that compiles in Scala. And in a language with Scala’s expressiveness, that is a specific failure mode.
The generated code works. It passes tests. It does what the ticket says. But it doesn’t use the type system to make illegal states unrepresentable. It doesn’t reach for a typeclass where a typeclass is the right abstraction. It defaults to mutable state when pure functions would do. It produces null checks instead of Option, try-catch blocks instead of Either. It writes the solution a competent Java developer would write – which, in Scala, is often not good enough.
This isn’t a minor aesthetic complaint. Scala’s value in production systems (the reason we use it at Scalac for financial platforms, data pipelines, and distributed systems) comes precisely from the discipline of the type system. AI-generated code that ignores that discipline compiles today and becomes the module nobody wants to touch in six months.
Ox Security’s 2025 analysis of over 300 repositories confirmed this pattern industry-wide. They found ten recurring anti-patterns present in 80–100% of AI-generated code: incomplete error handling, weak concurrency management, inconsistent architecture. Their conclusion: the core risk is not that the code resembles junior-level output, but that it reaches production faster than traditional review processes can safely manage,
The good news: when you point this out to the agent, it usually fixes it. The bad news: you need to know what to point out.
This constant friction is why, at scalac.ai, we’ve stopped treating AI as a standalone creator. We’ve integrated it into a strict engineering framework where the LLM is just a tool to populate our Scala-native blueprints. When we build for FinTech or AdTech, we use Akka and ZIO to enforce the rules that AI tends to ignore. We’re not here to “generate more code”, we’re here to ship systems that survive production, which is a nuance most models haven’t learned yet.
Where it genuinely doesn’t help (yet)
UI design. I’ve noticed that different models have a preferred style, and that style bleeds into every project they touch. Ask the same model to design UI for a logistics dashboard and a healthcare app, and both will look like variations of the same template. AI doesn’t have taste in the way a designer does; it has a statistical mode of what “good UI” looks like across its training data.
Anything underspecified. Before AI, a feature ticket was a full user story with subtasks, discussed and refined. Implementation took a week. Now I get a one-line description with no subtasks. Implementation takes a day, but the time saved in implementation was transferred to figuring out what the feature actually should do, which now falls on me as the engineer. Implementation time fell; total task time didn’t. The industry is confusing the two.
My actual setup: local models where it matters
I’m a local model enthusiast, but not a fundamentalist. The decision between SaaS and self-hosted is an engineering decision, not an ideology.
For on-demand LLM calls in web applications, I use SaaS APIs. The operational overhead of self-hosting isn’t worth it until token costs become significant at scale — roughly when self-hosting pays back in 6–12 months at your volume. That tipping point varies significantly; one approach is to calculate your monthly token spend and compare it against infrastructure costs at your expected utilization.
For pipelines where I control the execution environment, I use local models wherever possible. My side project, podcastoteka.pl, is a concrete example: I pull a podcast episode, push it through a local transcription model, then through a local LLM for summarization. Only the embedding model is remote, and that’s a deliberate choice. Embedding models must be identical between the pipeline that creates the index and the production service that queries it. Changing the embedding model means re-indexing everything. Once you standardize on a remote embedding model, you’re committed.
The hybrid approach is increasingly the professional standard. A BenchLM analysis from April 2026 put it clearly: many teams use proprietary APIs for complex tasks (coding, reasoning, agents) and self-hosted open models for high-volume, simpler workloads. This isn’t a compromise, it’s the right architecture for most production setups.
One non-obvious observation from using these tools in production: Claude performs significantly better than other models at cloud CLI deployment tasks – GCP in particular. The practical implication is that model selection should be task-specific, not model-loyal. The model that handles your architecture discussion isn’t necessarily the model that handles your deployment pipeline.
The thing that actually frustrates me
People overestimate AI capabilities in a specific way: they assume that underdefined requirements plus AI equals a working feature. They reduce the ticket to a single line of text and expect the output to match what they had in mind.
This puts the engineer in an impossible position. You’re no longer just implementing, but you’re also doing product discovery, requirement analysis, and design, then implementing. The cognitive load didn’t decrease; it changed shape. And when the feature doesn’t match expectations, the gap is attributed to the engineer’s execution, not the requirement’s incompleteness.
Before AI: a well-specified ticket took a week to implement. After AI: a poorly specified ticket takes a day to generate and two days to revise when it turns out nobody agreed on what it should actually do.
This is the fake promise of AI in development workflows — not that it can’t code, but that it removes the need for clear thinking upstream. It doesn’t.
What this means for teams using Scala seriously
Scala allows you to write code at very different levels of quality. AI consistently aims for the median: functional, readable by Java standards, idiomatic by none. For teams where the type system is a genuine design tool rather than a syntax requirement, this creates a specific discipline challenge: using AI without losing what makes the language worth using.
The workflow that works for me: generate freely, review aggressively, and when the generated code is structurally wrong, don’t just fix it, instead teach the agent what’s wrong and why. It usually gets it. That feedback loop is the actual productivity gain. Not “AI writes the code” but “AI writes a draft I can critique faster than I could write from scratch.”
The agent is fast. Judgment is still yours.
Piotr Kosecki is a Senior Scala Engineer at Scalac, specializing in production-grade AI integration. He helps clients move beyond AI prototypes into stable, scalable JVM architectures. If you’re worried about AI technical debt in your project, check out Scalac’s AI Architecture Stress Test.



