Top 15 Scala Libraries for Data Science in 2021

Top 15 Scala Libraries for Data Science in 2023

Top 15 Scala Libraries for Data Science in 2021

If you want to be up to date with all Scala events, sign up for Scalendar, our Event Driven Newsletter.


Python and R seem to be the first choice when it comes to data engineering.  However, if we look a bit deeper, Scala is gaining more and more popularity. This is because, thanks to its functional nature, it is a more common choice when it comes to building powerful applications that need to cooperate with a vast amount of data. Moreover, thanks to tools like Spark, Scala itself is becoming the number-one choice for creating modern machine-learning models. In this article, we will briefly take a look at what libraries can help us with our first custom ML algorithm.

Which Scala library would I recommend for data analysis?

Breeze (stars: 3200)

Breeze is one of the most popular Scala libraries for numerical data processing. It combines the functionalities of Matlab and the NumPy library from Python. A result of a project merging ScalaNLP and Scalala, it also contains lots of functions that are helpful with linear algebra, optimization, matrix operations, analytical computing, and statistics. 

Spire (stars: 1600)

A numerical library for Scala aiming to be nimble, precise, and generic. By using type classes, implicits, macros, and other Scala features, Spire tries to overcome the problem of performance and precision by attempting to deliver both. The library also allows its users to write efficient numeric code without having to “bake in” particular numeric representations

Saddle (stars: 513)

Another tool for very performant data manipulation written in Scala. Saddle provides array-backed, indexed, one and two-dimensional data structures, vectorized numerical calculations, automatic data alignment and robustness to missing values. Its authors claim it is the easiest and most expressive way to create programs with structured data on the JVM. 

Scalalab (stars: 126)

Scalalab is so named because it is a Matlab-like Scala DSL that can bring some static language type safety along with a great interface and a bunch of mathematical functions for researchers. Another major advantage is also its friendly interface, which can help you produce valuable output right away. The library itself can be easily integrated with other JVM tools.

Which Scala library would I recommend for visualization?

Smile (stars: 5100)

Now finally, you can do NLP, linear algebra, graph computation, machine learning, and visualization with a Smile! No jokes here :) This library covers everything necessary in ML,  such as classification, regression, clustering, feature selection, and so on. It is also commonly used for data visualization. Really fast and comprehensive, the project is well structured and documented with pieces of code that will set you up with a really pleasant start. 

Vegas Smile (stars: 707)

Another tool used for data visualization. Way more functional than those mentioned earlier, breeze-viz allows you not only to create custom plots but gives you a chance to manipulate data as well. You can use its declarative API to work with files, raw data, or Spark Frames. And at the end, everything is compiled to a strongly typed JSON specification.

Breeze-viz (stars: 37)

A project created for data visualization. It can help to build pretty-looking and colorful graphs, charts, and plots. Its syntax is similar to Matlab, even though it offers much fewer functions. Breeze-viz is built on top of JFreeChart, -a Java library for creating visualizations.

Which Scala library would I recommend for NLP?

Epic (stars: 471)

Epic is a tool mostly used for NLP purposes. Even though the project is now archived, it is pretty well known as a structured prediction framework for Scala applications. In addition, it also contains classes for training high-accuracy syntactic parsers, part-of-speech taggers, name entity recognizers, and more. With this tool, you can choose whether you want to use pre-trained models or custom. 

Puck (stars: 240)

A lightning-speed library used for NLP purposes providing high accuracy. Puck is designed for use with grammars trained with the Berkeley Parser and on NVIDIA cards. It is most useful when parsing a vast amount of sentences, say a couple of thousand. It is also focused on throughput, not latency.

Scala Libraries for Data Science

Which Scala library would I recommend for Machine Learning?

Apache Spark MLlib & ML (stars: 28700)

This tool is built on top of Apache Spark and provides lots of ready-to-use ML algorithms. Although originally written in Scala, it’s possible to use its API with Python, R or even Java. The library has two separate modules: MLib and ML. The first includes core machine learning algorithms for classification, clustering and supervised learning, while the second is used to create data pipelines of different data transformations. 

Apache PredictionIO (stars: 12500)

This tool, delivered by Apache,  fits perfectly into its ecosystem. PredictionIO has three main components: a PredictionIO platform – a stack enabling the building and deploying of ML algorithms, EventServer – an analytics layer responsible for unifying events and TemplateGallery – a component allowing you to download engine templates depending on your calculation needs. Thanks to its architecture you can quickly build your engine using a predefined template, deploy it to a production cluster, respond to dynamic queries in real-time, simplify data infrastructure management and much more. 

DeepLearning4j (stars: 11900)

Eclipse provides a whole ecosystem that should serve as a default deep learning tool for JVM. It is an end-to-end solution, meaning you can have raw data at the beginning, transform it, operate on it using complex neural networks and have some prediction at the end. To do this, the Eclipse team created a set of projects – each intended for different purposes. For example, we might use DL4J for some multi-layer neural networks, ND4J for algebra operations or DataVec for ETL processes.

BigDL (stars: 3700)

This Intel tool represents another library for deep learning, integrated with Apache Spark. BigDL contains a rich support for creating deep learning algorithms, along with being extremely performant. Integrated with other Intel tools might mean it is powerful enough in usage while providing completely end-to-end AI solutions.

Summingbird (stars: 2100)

A tool developed by Twitter. It allows users to create MapReduce programs that look like transformations of a native collection of Scala or Java. These programs might later be executed on well-known platforms such as Storm and Scalding. Summingbird consumes two types of data: streams and snapshots, regarded as a complete state of a dataset at some point in time.

H2O Sparkling Water (stars: 888)

Sparkling Water helps to integrate H2O’s ML engine with Spark. While using this library, you can easily transform Spark structures such as RDDs, Dataframes, and Datasets to H2O structures and back again. Even more exciting is that thanks to its built-in DSL, you can create ML apps using Spark and H2O APIs.

Deeplearning.scala (stars: 746)

This tool was created by Thought Works Inc. It might be used to create either simple or more complex neural networks using a functional or object-oriented approach. Thanks to its functional approach and vastly incorporating composable Monads, you might use map or reduce to build your complex networks. By adding plugins to Deeplearning.scala you can extend your models by providing more sophisticated algorithms, hyperparameters or other functions.

Conjecture (stars: 359)

A framework created by Etsy which allows users to implement a bunch of ML models running on Hadoop and using Scalding DSL. The main purpose of this project is to treat statistical components as viable components in a wide range of product settings. Thanks to integrations with Hadoop and Scalding, consuming large amounts of data and integration with established ETL processes is possible.  

Why have I decided to choose these libraries?

The main reason behind my decision to choose these libraries is that they all fulfill more than a few needs you might have. No matter whether you are focused on natural language processing or creating some kind of machine learning algorithms, you can easily choose from a handful of Scala libraries for Data Science to overcome any problems you may need to solve. And most importantly, all of these libraries are open-source projects which makes them even more valuable.

To sum up

As you can see, there are plenty of tools and the number is still growing. To ensure you are fully up to date, check out this repo: https://github.com/lauris/awesome-scala#science-and-data-analysis, where most libraries are mentioned. All of this is thanks to Scala’s growing popularity and demand. We encourage you to jump in and give it a go, it’s much more fun that way!

Resources:

  1. https://analyticsindiamag.com/top-10-scala-libraries-for-data-science/
  2. https://activewizards.com/blog/top-15-scala-libraries-for-data-science/
  3. https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781789345070/1/ch01lvl1sec14/ml-libraries-in-scala
  4. https://predictionio.apache.org/start/
  5. https://github.com/twitter/summingbird
  6. https://saddle.github.io/
  7. https://github.com/sterglee/scalalab
  8. https://github.com/dlwh/epic
  9. https://github.com/ThoughtWorksInc/DeepLearning.scala
  10. https://github.com/scalanlp/breeze
  11. https://github.com/h2oai/sparkling-water
  12. https://typelevel.org/spire/
  13. https://github.com/intel-analytics/BigDL
  14. https://github.com/haifengl/smile
  15. https://github.com/dlwh/puck

Read more

Download e-book:

Scalac Case Study Book

Download now

Authors

Marcin Krykowski
Marcin Krykowski

Seasoned software engineer specializing in distributed systems, functional programming, data-driven systems, and ML. Worked for various clients across industries like FinTech, Telco, AdTech, and Healthcare. On the mission to empower the Scala community. During my free time trying to connect 20+ years of being a professional athlete and entrepreneurship.

Latest Blogposts

19.07.2024 / By 

CrowdStrike Falcon Down: How a single security update shutdown Windows worldwide

Today, on July 19, 2024, a CrowdStrike code update led to global issues affecting Windows computers. The operating systems crashed repeatedly, displaying Blue Screen of Death (BSoD), keeping them in a non-usable loop state. This outage caused around 1400 flights to be cancelled, as well as numerous services to be stuck in a dysfunctional state: […]

27.06.2024 / By 

Scalendar July 2024

Welcome to the July edition of our newsletter! We bring you the latest updates on Scala conferences in July 2024, as well as frontend development and software architecture events. There are several conferences and meetups this month for developers of all levels, despite the start of a vacation season. Whether you’re looking to deepen your […]

19.06.2024 / By 

How Akka Specialists Drive Innovation in Software Projects

Akka Specialists

Why do you need Akka Specialists? Today’s global software development ecosystem is, to say the least, fast-paced, dynamic, and diverse. Every company, even partially operating in it, should always keep its finger on the pulse – innovation is the key to staying ahead of the competition. Companies constantly look for new ways to improve the […]

software product development

Need a successful project?

Estimate project