Home
/
Blog
/
Top 15 Scala Libraries for Data Science in 2023

Top 15 Scala Libraries for Data Science in 2021

30.03.2021 / By Marcin Krykowski

Top 15 Scala Libraries for Data Science in 2023

If you want to be up to date with all Scala events, sign up for Scalendar, our Event Driven Newsletter.

Python and R seem to be the first choice when it comes to data engineering. However, if we look a bit deeper, Scala is gaining more and more popularity. This is because, thanks to its functional nature, it is a more common choice when it comes to building powerful applications that need to cooperate with a vast amount of data. Moreover, thanks to tools like Spark, Scala itself is becoming the number-one choice for creating modern machine-learning models. In this article, we will briefly take a look at what libraries can help us with our first custom ML algorithm.

Breeze (stars: 3200)

Breeze is one of the most popular Scala libraries for numerical data processing. It combines the functionalities of Matlab and the NumPy library from Python. A result of a project merging ScalaNLP and Scalala, it also contains lots of functions that are helpful with linear algebra, optimization, matrix operations, analytical computing, and statistics.

Spire (stars: 1600)

A numerical library for Scala aiming to be nimble, precise, and generic. By using type classes, implicits, macros, and other Scala features, Spire tries to overcome the problem of performance and precision by attempting to deliver both. The library also allows its users to write efficient numeric code without having to “bake in” particular numeric representations

Saddle (stars: 513)

Another tool for very performant data manipulation written in Scala. Saddle provides array-backed, indexed, one and two-dimensional data structures, vectorized numerical calculations, automatic data alignment and robustness to missing values. Its authors claim it is the easiest and most expressive way to create programs with structured data on the JVM.

Scalalab (stars: 126)

Scalalab is so named because it is a Matlab-like Scala DSL that can bring some static language type safety along with a great interface and a bunch of mathematical functions for researchers. Another major advantage is also its friendly interface, which can help you produce valuable output right away. The library itself can be easily integrated with other JVM tools.

Smile (stars: 5100)

Now finally, you can do NLP, linear algebra, graph computation, machine learning, and visualization with a Smile! No jokes here :) This library covers everything necessary in ML, such as classification, regression, clustering, feature selection, and so on. It is also commonly used for data visualization. Really fast and comprehensive, the project is well structured and documented with pieces of code that will set you up with a really pleasant start.

Vegas Smile (stars: 707)

Another tool used for data visualization. Way more functional than those mentioned earlier, breeze-viz allows you not only to create custom plots but gives you a chance to manipulate data as well. You can use its declarative API to work with files, raw data, or Spark Frames. And at the end, everything is compiled to a strongly typed JSON specification.

Breeze-viz (stars: 37)

A project created for data visualization. It can help to build pretty-looking and colorful graphs, charts, and plots. Its syntax is similar to Matlab, even though it offers much fewer functions. Breeze-viz is built on top of JFreeChart, -a Java library for creating visualizations.

Epic (stars: 471)

Epic is a tool mostly used for NLP purposes. Even though the project is now archived, it is pretty well known as a structured prediction framework for Scala applications. In addition, it also contains classes for training high-accuracy syntactic parsers, part-of-speech taggers, name entity recognizers, and more. With this tool, you can choose whether you want to use pre-trained models or custom.

Puck (stars: 240)

A lightning-speed library used for NLP purposes providing high accuracy. Puck is designed for use with grammars trained with the Berkeley Parser and on NVIDIA cards. It is most useful when parsing a vast amount of sentences, say a couple of thousand. It is also focused on throughput, not latency.

Apache Spark MLlib & ML (stars: 28700)

This tool is built on top of Apache Spark and provides lots of ready-to-use ML algorithms. Although originally written in Scala, it’s possible to use its API with Python, R or even Java. The library has two separate modules: MLib and ML. The first includes core machine learning algorithms for classification, clustering and supervised learning, while the second is used to create data pipelines of different data transformations.

Apache PredictionIO (stars: 12500)

This tool, delivered by Apache, fits perfectly into its ecosystem. PredictionIO has three main components: a PredictionIO platform – a stack enabling the building and deploying of ML algorithms, EventServer – an analytics layer responsible for unifying events and TemplateGallery – a component allowing you to download engine templates depending on your calculation needs. Thanks to its architecture you can quickly build your engine using a predefined template, deploy it to a production cluster, respond to dynamic queries in real-time, simplify data infrastructure management and much more.

DeepLearning4j (stars: 11900)

Eclipse provides a whole ecosystem that should serve as a default deep learning tool for JVM. It is an end-to-end solution, meaning you can have raw data at the beginning, transform it, operate on it using complex neural networks and have some prediction at the end. To do this, the Eclipse team created a set of projects – each intended for different purposes. For example, we might use DL4J for some multi-layer neural networks, ND4J for algebra operations or DataVec for ETL processes.

BigDL (stars: 3700)

This Intel tool represents another library for deep learning, integrated with Apache Spark. BigDL contains a rich support for creating deep learning algorithms, along with being extremely performant. Integrated with other Intel tools might mean it is powerful enough in usage while providing completely end-to-end AI solutions.

Summingbird (stars: 2100)

A tool developed by Twitter. It allows users to create MapReduce programs that look like transformations of a native collection of Scala or Java. These programs might later be executed on well-known platforms such as Storm and Scalding. Summingbird consumes two types of data: streams and snapshots, regarded as a complete state of a dataset at some point in time.

H2O Sparkling Water (stars: 888)

Sparkling Water helps to integrate H2O’s ML engine with Spark. While using this library, you can easily transform Spark structures such as RDDs, Dataframes, and Datasets to H2O structures and back again. Even more exciting is that thanks to its built-in DSL, you can create ML apps using Spark and H2O APIs.

Deeplearning.scala (stars: 746)

This tool was created by Thought Works Inc. It might be used to create either simple or more complex neural networks using a functional or object-oriented approach. Thanks to its functional approach and vastly incorporating composable Monads, you might use map or reduce to build your complex networks. By adding plugins to Deeplearning.scala you can extend your models by providing more sophisticated algorithms, hyperparameters or other functions.

Conjecture (stars: 359)

A framework created by Etsy which allows users to implement a bunch of ML models running on Hadoop and using Scalding DSL. The main purpose of this project is to treat statistical components as viable components in a wide range of product settings. Thanks to integrations with Hadoop and Scalding, consuming large amounts of data and integration with established ETL processes is possible.

Why have I decided to choose these libraries?

The main reason behind my decision to choose these libraries is that they all fulfill more than a few needs you might have. No matter whether you are focused on natural language processing or creating some kind of machine learning algorithms, you can easily choose from a handful of Scala libraries for Data Science to overcome any problems you may need to solve. And most importantly, all of these libraries are open-source projects which makes them even more valuable.

To sum up

As you can see, there are plenty of tools and the number is still growing. To ensure you are fully up to date, check out this repo: https://github.com/lauris/awesome-scala#science-and-data-analysis, where most libraries are mentioned. All of this is thanks to Scala’s growing popularity and demand. We encourage you to jump in and give it a go, it’s much more fun that way!

Resources:

Authors

Marcin Krykowski

Seasoned software engineer specializing in distributed systems, functional programming, data-driven systems, and ML. Worked for various clients across industries like FinTech, Telco, AdTech, and Healthcare. On the mission to empower the Scala community. During my free time trying to connect 20+ years of being a professional athlete and entrepreneurship.