Top 15 Scala Libraries for Data Science in 2021

Python and R seem to be the first choice when it comes to data engineering.  However, if we look a bit deeper, Scala is gaining more and more popularity. This is because, thanks to its functional nature, it is a more common choice when it comes to building powerful applications that need to cooperate with a vast amount of data. Moreover, thanks to tools like Spark, Scala itself is becoming the number one choice for creating modern machine learning models. In this article, we will briefly take a look at what libraries can help us with our first custom ML algorithm.

Which Scala library would I recommend for data analysis?

Breeze (stars: 3200)

Breeze is one of the most popular Scala libraries for numerical data processing. It combines the functionalities of Matlab and the NumPy library from Python. A result of a project merging ScalaNLP and Scalala, it also contains lots of functions that are helpful with linear algebra, optimisation, matrix operations and analytical computing and statistics. 

Spire (stars: 1600)

A numerical library for Scala aiming to be nimble, precise and generic. By using type classes, implicits, macros and other Scala features, Spire tries to overcome the problem of performance and precision by attempting to deliver both. The library also allows its users to write efficient numeric code without having to “bake in” particular numeric representations

Saddle (stars: 513)

Another tool for very performant data manipulation written in Scala. Saddle provides array-backed, indexed, one and two-dimensional data structures, vectorized numerical calculations, automatic data alignment and robustness to missing values. Its authors claim it is the easiest and most expressive way to create programs with structured data on the JVM. 

Scalalab (stars: 126)

Scalalab is so named because it is a Matlab-like Scala DSL that can bring some static language type safety along with a great interface and a bunch of mathematical functions for researchers. Another major advantage is also its friendly interface, which can help you produce valuable output right away. The library itself can be easily integrated with other JVM tools.

Which Scala library would I recommend for visualisation?

Smile (stars: 5100)

Now finally you can do NLP, linear algebra, graph computation machine learning and visualisation with a Smile! No jokes here :) This library covers everything necessary in ML,  such as classification, regression, ustering, feature selection and so on. It is also commonly used for data visualisation. Really fast and comprehensive, the project is well structured and documented with pieces of code that will set you up  with a really pleasant start. 

Vegas Smile (stars: 707)

Another tool used for data visualisation. Way more functional than those mentioned earlier, breeze-viz allows you not only to create custom plots but gives you a chance to manipulate data as well. You can use its declarative API to work with files, raw data or Spark Frames. And at the end, everything is compiled to a strongly typed JSON specification.

Breeze-viz (stars: 37)

A project created for data visualisation. It can help to build pretty looking and colourful graphs, charts and plots. Its syntax is similar to Matlab even though it offers much fewer functions. Breeze-viz is built on top of JFreeChart -a Java library for creating visualisations.

Which Scala library would I recommend for NLP?

Epic (stars: 471)

Epic is a tool mostly used for NLP purposes. Even though the project is now archived, it is pretty well known as a structured prediction framework for Scala applications. In addition, it also contains classes for training high-accuracy syntactic parsers, part-of-speech taggers, name entity recognizers, and more. With this tool, you can choose whether you want to use pre-trained models or custom. 

Puck (stars: 240)

A lightning speed library used for NLP purposes providing high accuracy. Puck is designed for use with grammars trained with the Berkeley Parser and on NVIDIA cards. It is most useful when parsing a vast amount of sentences, say a couple of thousand. It is also focused on throughput, not latency.

blank

Which Scala library would I recommend for Machine Learning?

Apache Spark MLlib & ML (stars: 28700)

This tool is built on top of Apache Spark and provides lots of ready-to-use ML algorithms. Although originally written in Scala, it’s possible to use its API with Python, R or even Java. The library has two separate modules: MLib and ML. The first includes core machine learning algorithms for classification, clustering and supervised learning while the second is used to create data pipelines of different data transformations. 

Apache PredictionIO (stars: 12500)

This tool, delivered by Apache,  fits perfectly into its ecosystem. PredictionIO has three main components: a PredictionIO platform – a stack enabling the building and deploying of ML algorithms, EventServer – an analytics layer responsible for unifying events and TemplateGallery – a component allowing you to download engine templates depending on your calculation needs. Thanks to its architecture you can quickly build your engine using a predefined template, deploy it to a production cluster, respond to dynamic queries in real-time, simplify data infrastructure management and much more. 

DeepLearning4j (stars: 11900)

Eclipse provides a whole ecosystem which should serve as a default deep learning tool for JVM. It is an end to end solution, meaning you can have raw data at the beginning, transform it, operate on it using complex neural networks and have some prediction at the end. To do this, the Eclipse team created a set of projects – each intended for different purposes. For example, we might use DL4J for some multi layer neural networks, ND4J for algebra operations or DataVec for ETL processes.

BigDL (stars: 3700)

This Intel tool represents another library for deep learning, integrated with Apache Spark. BigDL contains a rich support for creating deep learning algorithms, along with being extremely performant. Integrated with other Intel tools might mean it is  powerful enough in usage while providing completely end-to-end AI solutions.

Summingbird (stars: 2100)

A tool developed by Twitter. It allows users to create MapReduce programs that look like transformations of a native collection of Scala or Java. These programs might later be executed on well-known platforms such as Storm and Scalding. Summingbird consumes two types of data: streams and snapshots regarded as a complete state of a dataset at some point in time.

H2O Sparkling Water (stars: 888)

Sparkling Water helps to integrate H2O’s ML engine with Spark. While using this library, you can easily transform Spark structures  such as RDDs, Dataframes and Datasets to H2O structures  and back again. Even more exciting is that thanks to its built-in DSL you can create ML apps using Spark and H2O APIs.

Deeplearning.scala (stars: 746)

This tool was created by Thought Works Inc. It might be used to create either simple or more complex neural networks using a functional or object-oriented approach. Thanks to its functional approach and vastly incorporating composable Monads, you might use map or reduce to build your complex networks. By adding plugins to Deeplearning.scala you can extend your models by providing more sophisticated algorithms, hyperparameters or other functions.

Conjecture (stars: 359)

A framework created by Etsy which allows users to implement a bunch of ML models running on Hadoop and using Scalding DSL. The main purpose of this project is to treat statistical components as viable components in a wide range of product settings. Thanks to integrations with Hadoop and Scalding, consuming large amounts of data and integration with established ETL processes is possible.  

Why have I decided to choose these libraries?

The main reason  behind my decision to choose these libraries is that they all fulfill more than a few  needs you might have. No matter whether you are focused on natural language processing or creating some kind of machine learning algorithms, you can easily choose from a handful of Scala libraries for Data Science to overcome any problems you may need to solve. And most importantly, all of these libraries are open source projects which makes them even more valuable.

To sum up

As you can see, there are plenty of tools and the number is still growing. To make sure you are fully up to date, check out this repo: https://github.com/lauris/awesome-scala#science-and-data-analysis where most libraries are mentioned. All of this is thanks to Scala’s growing popularity and demand. We encourage you to jump in and give it a go, it’s much more fun that way!

Resources:

  1. https://analyticsindiamag.com/top-10-scala-libraries-for-data-science/
  2. https://activewizards.com/blog/top-15-scala-libraries-for-data-science/
  3. https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781789345070/1/ch01lvl1sec14/ml-libraries-in-scala
  4. https://predictionio.apache.org/start/
  5. https://github.com/twitter/summingbird
  6. https://saddle.github.io/
  7. https://github.com/sterglee/scalalab
  8. https://github.com/dlwh/epic
  9. https://github.com/ThoughtWorksInc/DeepLearning.scala
  10. https://github.com/scalanlp/breeze
  11. https://github.com/h2oai/sparkling-water
  12. https://typelevel.org/spire/
  13. https://github.com/intel-analytics/BigDL
  14. https://github.com/haifengl/smile
  15. https://github.com/dlwh/puck

Read more

Author

blank
Marcin Krykowski

Seasoned software engineer specializing in distributed systems, functional programming, data-driven systems, and ML. Worked for various clients across industries like FinTech, Telco, AdTech, and Healthcare. On the mission to empower the Scala community. During my free time trying to connect 20+ years of being a professional athlete and entrepreneurship.

Latest Blogposts

08.04.2021 / By Daria Karasek

How Outsourcing Can Solve Technical Debt

Technical debt has become widely prevalent nowadays. Since technology is constantly evolving, many businesses have to choose between acquiring new solutions or sticking with tried-and-tested ones. No right answer can be given to this hard choice. But even though debt sounds threatening for many, it doesn't always have to be.

30.03.2021 / By Marcin Krykowski

Top 15 Scala Libraries for Data Science in 2021

Python and R seem to be the first choice when it comes to data engineering. However, if we look a bit deeper, Scala is gaining more and more popularity. This is because, thanks to its functional nature, it is a more common choice when it comes to building powerful applications that need to cooperate with a vast amount of data.

23.03.2021 / By Tomasz Bogus

How To Tackle Technical Debt Quadrants?

You come across the term technical debt, read a little about this matter, know everything in theory, and you’re still not sure how to correctly identify whether some kind of design flaws are actually technical debt or not.  We have a solution that can help you with this issue – Technical Debt Quadrants. This method […]

blank

Need a successful project?

Estimate project