The XXI century has brought us a lot of wonderful inventions. New technical solutions that allow us to travel, to get knowledge about the world and be in touch with our friends and family – even if they live in a different country and different time zone. Building high, impressive buildings, driving fast cars, changing clothes every season, buying a new smartphone every year – it all seems very natural to our generation. Unlike in the past, we can go to the local supermarket whenever we want, just to buy a pack of perfect looking nuts or jelly beans in a portion-sized plastic bag. 

Ecology in the times of coronavirus 

All of these examples, and more, are a part of our everyday routines. It’s something we have just got used to and cannot live without.This has become even more obvious now, during these strange times when the quarantine forced on us by COVID-19 can make us anxious and lost. And the only things that can seem to make us happier are prepackaged foods, new clothes or 1,930,490 useless items from Wish. 

We’re a consumer society, there’s no doubt. But what is the real cost of our unquenching thirst for things?  

Earth Day 

On the day I’m posting this article – the 22nd of April – it’s the Earth Day. This special occasion has motivated me to share a little bit on how we can reduce the price that we pay for all the goods I’ve mentioned. Working separately, we can’t do much, but if we make a good  example as an individual, we can inspire a  whole population.

Eco Bottle Earth Day Scalac
Reusable thermal Scalac bottles on vacay #DoTheRightThing and #HaveFun

So let me tell you now why I think it’s so important to take action, and also give you a few examples of how we at Scalac are trying to be more responsible. And if you end up stealing some of these ideas, it’ll make us more than happy! 

Why is sustainability so important? A few numbers

The idea of sustainable growth

There is a concept, that each generation should have the possibility to satisfy their needs, but without limiting access to meeting the needs of future generations. Put briefly, we can take what we need from the world’s natural resources, while still keeping in mind that life on earth won’t stop after we die. There will be future  generations after us, our children and grandchildren. They should have the right to use those resources as well. And we at Scalac believe that this “theory” is actually something that should be put into practice. 

(Un)sustainable world 

Sustainable development is when there is a balance between peoples, the environment, and industry. Unfortunately, nature is losing now. The state of the Earth is at its worst since anyone can remember.

We have known about the “plastic problem” in the oceans since all the way back to 1970. But now, 50 years later, we still don’t know what to do about it. The only thing we can say for sure is that pollution is getting worse. 

Fire, earth, air, water, and plastic?

Plastic 

In the last 50 years, plastic waste has increased from 2 Mt in 1950 to 380 Mt. The total amount of plastic manufactured from 1950 through 2015 is 7,800 Mt. Half of this, 3,900 Mt, has been produced, during only the past 13 years. It is estimated that during these 50 years the numbers have grown twenty fold, and during the next 15 years they will increase even twice more. Surprisingly, 49% of this waste comes from disposable packaging – the part of the product you don’t even use (actually – it isn’t even part of it)  and the part you probably throw away after 5 minutes. 

49% of this waste comes from disposable packaging.

Water

Equally important is water. Water is essential to life. However, to take the situation in Poland as an example – scientists estimate that the drought in Poland in 2020 will be the strongest in 50 years. This is a really clear example of changes in our climate. In addition, in 2020, the global need for fresh water will exceed its supply by as much as 40%. 

In 2020, the global need for fresh water will exceed its supply by as much as 40%. 

Air

What about air? In just 2018 alone, global warming increased by 3.3%. On top of that, 7 mln people die every year because of bad air quality! It has become so bad, that the poor condition of the air is in fourth place on the list of most serious threats to human life.

Taking care of one ton of garbage costs Irish citizens roughly 1,116 euros.

Costs

In Ireland, the national government doesn’t cover the costs of cleaning and refuse collection from household  bins. Every year, the amount of trash left in public spaces, forests, etc. is as much as 100k tons. Local budgets (i.e. local taxpayers) have to cover the costs of dealing with this problem. Between 2016 and 2018,  the amount of waste in Dublin amounted to 17,147 tons, and taxpayers had to pay about 20 mln euro to deal with the trash. Therefore, getting rid of one ton of trash alone has cost Irish citizens roughly 1,116 euro.

In comparison, every private brand that uses reusable packages and recycling processes, in the same country, pays just 89 euro for one ton. Just think about all the good things we could do with this amount of money saved! 

Sustainable T-shirts eco cotton Earth Day Scalac

How we do it at Scalac

Think about materials

Swag, personalized gifts, gadgets. These are part of building a company’s identity – they can really make people feel like they belong to the same team – wearing the same war paint, in a way. But in 2019, we decided to tackle this particular issue  in the most responsible way we could. Keeping in mind that in the end – a team is all about the people, not things. 

Some nice examples; our hoodies are now made from high-quality materials. As a happy side result, they now last  a long, long time, with no need to replace them after just a few washes. We’ve also changed our T-shirt suppliers and now we’re sure that our T-shirts are made from eco-friendly cotton. 

Useful stuff only

On special occasions, like Christmas, we face the  great challenge of choosing  the right gifts for our employees and our clients. Of course, the main challenge is to #DoTheRightThing and stay as environmentally responsible as possible. 

We now always get our team together and discuss what type of gifts we should consider buying. The point is to choose gifts that are useful and won’t just be thrown away immediately.Obviously, it’s impossible to get it right every single time. Because each of our 122 employees – not to mention clients – is different. But we do our best, and I dare to say we’re getting better and better at it.

Here are a few practical examples of gifts we have purchased in the past:

  • metal thermal bottles, 
  • glass lunch boxes,
  • notebooks made from recycled paper,
  • metal, reusable straws,
  • Christmas cards that can be planted (they were printed with special non-toxic paint on a paper that contained seeds) 

I’m also glad to say that we are working on  a completely new way of giving gifts to  our people. This is based on collecting points on a special platform, which everyone can then use to buy the best gift for themselves, at a specific time most convenient to them. Sometimes, giving people a choice might simply be the best option.

Local suppliers

Supporting local manufacturers is an integral part of supporting the environment. It is also the perfect way to fight with our carbon footprint; – one of the main causes of global warming. How? By collecting goods ourselves, without having to ship things all the way from China or other distant countries, and also by skipping unnecessary packaging.

Remember,it’s always a good idea  to use a local  sewing service, gadget manufacturer,etc.if you have the chance! Not only eco-friendly but also people-friendly – as those kinds of businesses are often smaller, so you’ll also be  helping your neighbor. 

Some examples of our local companies:

  • The Dalba Brewery, which hires people with disabilities, who produce an amazing beer we drink during our Pizza Days (knowledge sharing events) or while just hanging out together after hours,
  • For the Scala Wave conference (which Scalac organizes), we chose a local sewing service  Panato (instead of “eastern” producers) who sewed beautiful bags for our attendees. Panato is also not only local but, like  Dalba, hires people with disabilities,
  • Our T-shirts owe their branding to the local workshop from Gdańsk,
  • Also we sewed our Scalac shopping bags in local manufactures like Pakuj Worki and Pakuj do Swojego.

Less waste

There are a lot of everyday adjustments that do not cost much – or cost nothing at all – but can have a big impact on the environment as well. Below are some of the practices that help us produce less waste on a daily basis. 

  • We buy water in glass bottles, recycle it later or give it back to the distributor (we use bottled water mostly for the purposes of business meetings), in the office, we also use filtered tap water,
  • During our company retreats, we order catering that brings food in ceramic dishes and uses metal cutlery,
  • We use reusable plates and cutlery in the office every day,
  • We fill our postal parcels with some natural or waste materials like scrap paper, 
  • We choose “less waste” and recycle gadgets – eg. we designed a special cable segregator made from our old rollups, 
  • For events (like Scala Wave), we make badges from recycled paper (actually we’ll probably resign from them completely and replace them with electronic ones). We have also resigned from canned sodas as well as unnecessary gadgets, 
  • If we want to organize an event with a dedicated gift package, we make sure attendees register. Then we will know exactly how many packages we need. If we plan to give away things like T-shirts, we ask about the size too, to avoid overordering, 
  • We also organized workshops on “How to not waste food” during one of our team retreats and we’re planning to share this kind of knowledge even more in the future!
Less Waste workshops at Scalac
“How to not waste food” workshops at Scalac

Of course, these are only examples. You can find your own way to become more “eco”. Just find one little thing you can change and start today. 

Learn from your mistakes

He who makes no mistakes makes nothing. 

In the end, even if you want to really  #DoTheRightThing, sometimes you will fail. It might be because of the law, transport, or cost. However, whatever happens, the best thing you can do is to just learn from your mistakes – and go on to try something different. 

We have had a few failures on our path too. But one is definitely worth mentioning.

We once ordered new notebooks.  Wonderful graphics,  practical design and of course made from recycled paper. Everything seemed so perfect. So what was the problem?.  EVERY single notebook came wrapped in a  plastic case, which was supposed to protect the notebooks during transport. Obviously, this was completely unnecessary. The notebooks were perfectly safe traveling in a box without these additional plastic covers.

What have we learned? That you have to talk through every single detail of the gadget – and packaging – with the producer and explain why you care. 

Nobody’s perfect

We’re not an ideal, eco-friendly company. But you know what? We will probably  never be perfect. There is still so much to do in this area. But we do our best to stand by our values, both inside and outside the company, and find new ways to implement new solutions all the time.  Scalac has three main mottos; #Workhard #HaveFun and #DoTheRightThing. That third is about social responsibility. We approach this in many different ways. Not only by supporting ecology, but also by supporting local suppliers and companies that hire people with disabilities, organizing fundraisers for charity and taking part in multiple events. But those are  topics for a completely different article. 

To sum up. Being ecologically responsible  is not about certificates and formal proofs. It is about consciousness, everyday actions, and habits. We are all building the world around us with the small everyday decisions we take – and it’s up to us what our world is going to look like tomorrow. 

Happy Earth Day! 

#DoTheRightThing and share our message. 

Sources

Kubernetes is not a novelty anymore – but somehow, its potential is still far from being exploited. It started being adopted on a massive scale only recently, and is now in production at 78% of enterprises surveyed by The Cloud Native Computing Foundation (CNCF). Before that, however, Kubernetes didn’t really get the attention it deserves.   

If you’re still sceptical about Kubernetes for some reason, it might be the right time to change your mind and join these change-makers, instead of struggling to understand what this platform is all about. To make it easier for you: Here’s what Kubernetes is, along with the reasons why the tech world is going crazy about it (and why you should, too). 

Read more

Here at Scalac, we know how to take your business to the next level. We specialize in IT staff augmentation, custom software development, data engineering, blockchain, and analytical dashboards to help give your company the best technology to keep growing. We work with clients of all sizes but focus mainly on helping mid-sized companies all around the globe accelerate time to market and expand. 

How we keep customers satisfaction as high as 4.9/5

Our team of over 120 tech and business experts works every single day on making sure we’re the best partners in growing our client’s businesses.

We are one of the Top Scala software houses, a Gold Lightbend Partner and we offer a wide range of services to build end-to-end solutions. 


Our team includes Scala Team, Frontend Team, Data Engineering Team, DevOps Team, UX/UI, and QA to guarantee that our clients get the right services for their projects.

Take a look at some of the recommendations we got

Adtech & Data

One of our projects consisted of providing a recommendation for an advertising technology company using Scala. We made sure to really understand the company and blend into their own team and working habits. It was also important to make sure we delivered the product in an efficient and timely manner. 

Scalac Top IT Company San Francisco Clutch

The product was developed in a timely manner. There were no delays. Everyone involved was quick to pick up the challenge we were solving and the milestones we were trying to reach. Overall, their contribution was greatly appreciated by the entire company and its customers.

Product Manager at an advertising tech company

Another review, left by one of our New York partners, gives an excellent overview of our expertise regarding remote work and tech stack:

Scalac team was highly collaborative during the entire lifecycle of the project. Even though all the engineers worked in a distributed, remote environment, there were no issues with regular communication. The team was also highly knowledgeable in the technology stack that they used.

Paweł Cejrowski, Senior Software Engineer, Tapad

Cloud computing & Data engineering

As we mentioned before, our company covers both functional programming and data engineering areas.  A project we finished up at the end of last year was conducting pipeline development for a data insights company. We utilized mostly Scala with Python to design beta engineering pipelines. By adapting to the client’s needs, we ensured that once the partnership ended, they would be able to successfully move forward.

The collaboration was a success. They’re highly competent engineers. They’re oriented to provide the best possible development and highest quality results. We’ve now moved the work internally, and everything was handed over to us perfectly documented.

Chief Data Scientist at a data insights company

Given the impact that proper software development can have on a product, it is important to find a partner that will deliver. Our clients have found that our team consistently produces remarkable outcomes. Our reviews are clear evidence of this. We are thankful for the positive feedback we have received from this particular client we’ve been working with for almost 3 years: 

One of the most impressive aspects of working with Scalac, Inc. is the fact that they are incredibly flexible. They also listen to requirements. And I don’t just mean technical project requirements, but also from management and organization perspective. Even when I have tasked engineers with tasks outside of their comfort zones, they have risen to the challenge. When I have asked them to work with unfamiliar technology they have passed the bar with flying colors.

Pawel Gieniec, CEO & Founder, CloudAdmin

About Clutch

Clutch is an independent B2B company that features reviews from our past clients. The Clutch team assesses a company’s market presence, project history, and quality of services, which is informed by verified client reviews. Clutch is a great platform to find the right IT company for you because you can: 

TOP Software Developers in San Francisco according to The Manifest

Scalac has also been featured on Clutch’s sister site, The Manifest. The Manifest includes top businesses in shortlists of the best-performing companies in every industry imaginable. Our profile on The Manifest includes valuable information on our work with a past client and some of our former clients.

Scalac Top IT Company San Francisco Clutch

To wrap up

We’re grateful to our clients for providing their valuable feedback to Clutch. As a result, the team at Scalac continues to be recognized for the efficiency and high-quality services we provide.

Our team is always looking for new partners and would love to hear about any great project ideas you have. Take a look at our site to learn more about what we’re all about. Reach out to us today to help turn your wildest ideas into reality!

Want to know more about our projects?

See also:

JVM creators designed it with automatic memory management in mind, which means programmers don’t need to worry about memory allocation and memory. Unused objects can be released automatically in a transparent way, which is really convenient, especially when you’re new to JVM. But even in general, there’s less code to write and it’s less error-prone than the traditional approach which requires you to do everything manually. 

Read more

This article is based on a story of one of our developers, ex Java now Scala developer, who decided to follow the Scala path because he found writing in Scala extremely developing and interesting.

Scala programming is definitely one of the main topics that come to mind when it comes to Scalac (the name itself is actually Scala + C = Scala compiler). As you can probably tell from the name of our brand, before we developed teams like Data Engineers, DevOps, UX/UI and Frontend, our company was powered by Scala experts – even now, we are one of the biggest Scala company in the world, and a Gold Lightbend Partner. 

Some time ago I started wondering; what is it about Scala that developers prefer over other languages? Why do we need Scala?  This post was originally meant to be focusing on Scala vs Kotlin comparison – so we interviewed our senior engineer (who was previously a Java engineer) just to find out, that there is a way better story to tell – about how Scala makes Developers enjoy their work again. Let me take you on a journey to discover the ultimate truth! 

On the first day, there was Java 

Java was developed in the early 1990s and it changed the whole IT world. In the early 90s, the idea of extending the power of network computing to the activities of everyday life was something totally abstract. Today, with technology being such a great part of our daily lives, we take it for granted. As a matter of fact, Java is still, after all these years, an engine that runs many applications – from games to business solutions.

The real question is: is Java enough?


The slow evolution of the Java programming language has led to quite a few inventions in the JVM programming languages. As a result, there are over 50 JVM languages! They were all created to fill in the gaps in existing versions of Java by developers who didn’t settle on “good enough”. 

In theory, Java can be used for everything –  a mobile app, a server, an e-commerce platform. In practice – iif you want a mobile app you’re better off with Kotlin, and if anything regarding your solution is more complex and requires a big amount of data Scala is a way to go.

On the second day, there was Kotlin

Sorry to have to put it this way but Kotlin is just Java in a nicer package – still an object-oriented language but fresher and funkier. It’s so similar that it doesn’t require a different mindset. Java developers are able to quickly switch to Kotlin and program smoothly in it. You can also call Java code from Kotlin and vice versa without any hassle. 

Moreover, apart from that, it has several other advantages, such as the way the collection library is built, support for nulls, and coroutines. Kotlin supports functional constructions, but it is still closer to Java deprived of its old-fashioned legacy than to Scala. It gives Java programmers freshness and constructions that are not available in Java, as well as brevity in expressing thoughts through code.

Kotlin’s domain in contrast to Scala is mobile. It can be used for both server and client (mobile) development – Frontend and Backend. Android supports Kotlin out of the box. You can also use Scala for that purpose but it is not as hassle-free as Kotlin.

Scala vs Kotlin popularity
Scala vs Kotlin popularity in USA
Source: Google Trends

I chose Scala as a junior because in a way it can be easy to start with.
You don’t have to jump straight into object-oriented programming (it can be introduced in a procedural paradigm at first), but at the same time, it has all the necessary concepts to learn it. You can move over to functional programming without having to switch to other functional languages. Scala also shares C-like familiarities so it’s easy to learn C-style languages too. 

Patryk Kirszenstein,
Junior Scala Developer 

and on the third day… there was Scala

As mentioned before, some Java developers knew that they couldn’t quite settle for what they had. They wanted more. They were looking for an alternative, where nicer syntax could be used. 

Scala as a programming language with functional facilities requires a different mindset and much more knowledge than Java or Kotlin because of the fact that, among other things, code is written differently.

In Scala, you can write functional and object-oriented code. It has a very rich syntax that enables you to write code in many ways. This is also why people who are not involved with Scala may have the impression that it’s over-complicated. This is because, in the case of Java and Kotlin, programmers know what to expect. With languages like Scala – you go big or you go home. 

Systems written in strongly typed languages like Scala are easier to maintain in the long term. Also, a big business-related advantage is that it’s a JVM compiled language so you can use all the existing Java libraries. Did you know that Scala, although not so popular as Java, it’s also used by world-wide enterprises which seems to see its advantages like Zalando, LinkedIn, Twitter, Foursquare, Netflix, Tumblr, Walmart, Paypal, Intel and Samsung?.


What are the business reasons to use Scala?

If you’re working with complex environments, applications, analytics it’s way more efficient than Java or Kotlin. Scala is used to implement Spark (the leading platform for distributed data processing)  as well as to define the computations themselves, so it’s really data-scientist-friendly. Not to mention that Scala is a great language for domains like Machine Learning and Blockchain which are extremely important in today’s business, and will become even more significant in the future. 

Scala is pure 

Challenges may be the key to why Scala has such a big community – especially for a seemingly niche language – with over 95 919 questions asked solely about Scala on StackOverflow (not to mention related topics).

Scala is a powerful language with highly-advantageous features and flexible syntax. But, it is quite difficult for developers to get a grip on this JVM programming language. However, with time, it becomes really natural and straightforward. What Scala represents, with its functional approach, is passion-driven work leading to pure self-development. To fall in love with it, you need to first start writing in it, and then dig a little deeper. Scratching the surface may only give you the wrong idea. 

Dorian Sarnowski Scala Engineer Developer Expert

“While many languages share some of Scala’s features, few share so many of them, making Scala unique in ease of development.

Dorian Sarnowski,
Senior Scala developer 

At Scalac we use Scala for a variety of reasons, which mostly come down to reducing development time and reducing the number of bugs in code. While many languages share some of Scala’s features, few share so many of them, making Scala unique in ease of development.

Scala – simply beautiful.

Scala really encourages switching from mutable data structures to immutable, and from regular methods to pure functions (without getting crazy about it like Haskell).  It’s simply beautiful. It provides a good balance between the conciseness of a language, extensibility, and performance. Keeping to JVM’s paradigm promise of “write once, run anywhere”, and combining it with functional coding, Scala lets you build the best of both worlds in your code.

It’s functional

I don’t think there is any programmer who wouldn’t find Functional Programming interesting, but sometimes they just let it go because they have other priorities at work. Fortunately, some jobs actually let you do exactly that for a living and that’s quite amazing. 

Scala is a very appealing choice for developers who are looking to strike a balance between writing pure code and writing solid code quickly – by building upon existing solutions and libraries when available. That  – combined with interoperability with the Java ecosystem and the JVM – makes Scala look very promising.

Scala is more influenced by functional programming languages like Haskell than Kotlin. It encourages the use of functional coding along with some additional features such as pattern matching and currying. Not only that,  but functional programming is also more substantial in the Scala environment. 

“Scala lets you build the best of both worlds in your code.

It’s love

Scala is functional and type-safe, but you encounter the difficulties progressively (you can still write Java-like code), so at no point will you get bored. It’s more likely you will feel like you are always on an intellectual adrenaline rush. You could poetically compare it to the feeling of falling in love for the first time. I think that’s truly one of the reasons Scala is definitely worth the time you will need to learn it. It will change the way you think forever. 

Share some love for Scala in the comments and feel free to ask any questions.

Check out other Scala-related articles on our blog:

At Scalac we always try to use appropriate technology to meet the requirements given by the business. We’re not afraid to change them once we realize they’re not fit for purpose. By making these changes we learn, we grow, we evolve. In this article, we want to share what we learned during the migration from Apache Spark + Apache Cassandra into Apache Druid.

 

Why did we decide to do it?

What you can expect after the change?

How did this change reduce costs and simplify our solutions and why did we fail to deliver expectations with the first architecture?

Let’s find out together!

 

The Law of Expectation

Defining expectations is a crucial part of every architecture, application, module or feature development. The expectations that were presented to us were as follows.

 

We need to process and deliver data to the UI in near real-time.

Since our project is based on Blockchain, we want to react to all of the changes as soon as they are made. This can be achieved only in real-time architecture. This is one of the most important decisions to be taken at the beginning of the project because it limits the range of technologies we can take with us.

 

We need to be able to display aggregated data with different time granularities.

To improve performance and readability on the UI, we want to return aggregated data. This should make displaying the data more responsive and lightweight. The solution should be flexible enough to allow us to configure multiple time-ranges because they are not yet defined and may depend on the performance or layer findings.

 

We need to introduce a retention policy.

Since the data after a specific time are meaningless for us, we do not want to keep them forever and we want to discard them after 48 hours. 

We want to deploy our solution on the cloud.

The cloud is the way-to-go for most companies, especially at the beginning of a project. It allows us to scale at any time and pay for the resources we are using. Often just a small number of developers start a project, so there is no time to focus on machine management. The entire focus should be on something that brings value to the business.

 

We want to be able to display data about wallets and transactions.

Since this project is all about Blockchain, we want to display some information on different levels. The most detailed is at the transaction level, where you can see data exchange and payments between different buyers and sellers. Wallet groups multiply transactions for the same buyer or seller. This allows us to observe the changes in the Blockchain World.

 

We want to display TOP N for specific dimensions.

In order to display wallets that have the biggest participation in day-by-day payments in Blockchain, we need to be able to create tables that will contain wallets sorted by the amounts of data changes or payments during a certain time.

 

Let’s take Spark along with us!

Since the road for creating real-time architecture was a bit dark for us at the time, we decided to take Apache Spark along with us! Nowadays, Spark is advertised as a tool that can do everything with data. We thought it would work for us… but we couldn’t see what was coming. It was dark, remember?

 

Spark Cassandra Architecture

Architecture

Initially, we came up with the above architecture. There are a couple of modules that are important, but they are not crucial for further discussion.

  1. Collector – This was responsible for gathering the data from different APIs such as Tezos and Ethereum, unifying the format between these two and pushing these changes to Apache Kafka in the form of JSON. Additionally, since we were dealing with streaming, we needed to know where we were, so we would be able to resume from a specific place so not to read the same data twice. That’s why Collector saves the number of processed blocks to the database – Apache Cassandra in our case. We had it in place, we simply used it for a slightly different purpose.
  2. API – This was a regular API, simply exposing a couple of resources to return results from Cassandra to our UI. It is important to know that it was not only the window for the data in Cassandra, in this API, we could also create users, roles and write them back to the database.

 

Motivation

It’s always important to ask yourself why you are choosing these technologies, what you want them to do for you. Here are the motivations for the tools that we decided to use:

 

Why did we decide to use Apache Kafka?

At the very beginning, we wanted to create a buffer in case Spark would be too slow to process these data, to not lose them. Additionally, we wanted to separate two different tools from each other. Direct integration between Data Source and Spark might be hard and deployment of any of these components could break the entire pipeline. With Apache Kafka between them, it’s unnoticeable. Since the entire solution was designed to work with Blockchain, we knew that we couldn’t lose any transactions. Thus Kafka was a natural choice in this place. Additionally, to fulfill also the requirement for different time-series to be handled. We knew that we could read the same data, from the same topic by multiple consumers. Apache Kafka is very popular, with good performance and documentation.

 

Why did we decide to use Spark Streaming?

Spark Streaming was a good choice in our opinion due to the fact that it has good integration with Apache Kafka, it supports Streaming (in the form of micro-batches), it has good integration with Apache Cassandra and aggregating by a specific window is easy. Spark Streaming stores the results of the computation in the internal state, so we could configure it to precompute TOP N wallets by the transactions there. Scaling in Spark is not a problem. Spark supports multiple modes for emitting events (OutputMode) and for writing operations (SaveMode). We thought  that OutputMode.Append and SaveMode.Append would be a good option for us. 

 

Why did we decide to use Apache Cassandra?

We decided to use Apache Cassandra mostly due to fast reads and a good integration with Spark. Additionally, it’s quite easy to scale Cassandra as well. Cassandra is also good if we want to store time-series data. SQL should not be too hard for this , in our case data should be already prepared. Cassandra requires more on the writing side than reading.

 

1. First bump. Stream Handling in Spark.

At the beginning, we didn’t realise that one of the problems would be that Spark can read/write only from/to one stream at a time. This meant that each time we wanted to add another time dimension we would need to spawn another Spark Job. This problem had an impact on the code because it had to be generic enough to handle multiple dimensions by its configuration. Adding more and more Jobs always increases the complexity of the entire solution.

 

2. Second bump. Cluster requires more and more resources.

Along with the generalization of Spark Job, we had to change our infrastructure. We were using Kubernetes for this, but the problem was that with Spark we could not assign less than 1 core even though Kubernetes allows us to assign 0.5 CPU. This led to poor core utilization of Kubernetes Cluster. We were forced multiple times to increase the resources within the Cluster to run new components. At some point, we were using almost 49 cores!

 

3. Third bump. Spark’s OutputMode.

We didn’t have to wait too long for the first problems. That was actually a good thing! We realized that OutputMode.Append was not going to work for us. You may ask yourself: why? Glad you asked! We’re happy to explain.

We need to be able to ask Apache Cassandra for the TOP N results. In our case, it was the TOP 10 wallets by the number of transactions made by them.

 

Firstly, let’s think about the table structure:

CREATE TABLE IF NOT EXISTS wallet (

   wallet_id       TEXT,

   window          TEXT,

   window_ts       TIMESTAMP,

   amount          DOUBLE,

   PRIMARY KEY ((wallet_id, window, window_ts), amount)

) WITH CLUSTERING ORDER BY (amount DESC);
JavaScript

Secondly, let’s think about the query:

SELECT * FROM wallets WHERE window = '10m’ AND window_ts = '2020-01-01 10:00:00.000' LIMIT 10;
JavaScript

Due to the definition of primary key + WITH CLUSTERING ORDER BY, the data will already be sorted by amount.

 

Thirdly, let’s insert some data into it:

INSERT INTO wallets (wallet_id, window, window_ts, amount) VALUES ('WALLET_A', '10m', '2020-01-01 10:00:00.000', 1000);

INSERT INTO wallets (wallet_id, window, window_ts, amount) VALUES ('WALLET_B', '10m', '2020-01-01 10:00:00.000', 1020);

INSERT INTO wallets (wallet_id, window, window_ts, amount) VALUES ('WALLET_C', '10m', '2020-01-01 10:00:00.000', 1050);

INSERT INTO wallets (wallet_id, window, window_ts, amount) VALUES ('WALLET_D', '10m', '2020-01-01 10:00:00.000', 800);

INSERT INTO wallets (wallet_id, window, window_ts, amount) VALUES ('WALLET_B', '1m', '2020-01-01 10:00:00.000', 500);
JavaScript

Now imagine that Spark is grouping the data for a 10-minute window and it’s going to emit an event once the internal state changes. Only the change itself will be emitted. 

The new transaction is coming for the wallet: WALLET_A, time: 2020-01-01 10:03:00.000 and amount: 500.

After that, Spark is going to emit an event and ask Cassandra to insert:

INSERT INTO wallets (wallet_id, window, window_ts, amount) VALUES ('WALLET_A', '10m', '2020-01-01 10:00:00.000', 1500);
JavaScript

Normally, what we would expect is that the first row from the data we inserted would be updated, but it wasn’t because the ‘amount’ which changed, is part of the primary key. Why is this important? Because with insert operation instead of update we will have multiple rows for the same wallet. We cannot change the primary key, because we want to sort the data by the amount and so the circle closes. 

 

Can OutputMode.Complete push us forward?

Since the road with OutputMode.Append was closed for us, we decided to change the output mode to Complete. As the documentation says, the entire state from Spark will be periodically sent to Cassandra. This would, of course, put more pressure on Cassandra, but at the same time, we would ensure that we would insert no duplicates into the database. In the beginning, everything worked well, but after some time, we noticed that with each insert, Spark was putting in more and more rows. 

 

Let’s turn back. Spark + Cassandra does not meet our expectations.

Why did we decide to turn back and change the current architecture? There were two main reasons.

Firstly, we investigated why Spark was adding only new rows and it turned out that Watermark which is configured in Spark does not discard the data when they should be. This means that Spark would collect the data indefinitely in its internal state. Obviously, this was something that we could not live with and that was one of the most important reasons to change Spark + Cassandra. Spark could not produce the results for one of the most important queries, so we had to turn back.

Secondly, we noticed that maintenance was getting more complicated with each time-series dimension and the configuration for resources for Spark Cluster was not flexible enough for us. This led to too high costs for the entire solution. 

 

The New Beginning, with Apache Druid.

You may ask yourself why we decided to choose Apache Druid? As it turned out, when you are deep in the forest, a Druid is the only creature that can navigate you through it.

Apache Druid Architecture

Architecture

At first glance the change is not that significant, but when you read the motivation, you will see how much good this change did for us.

Motivation

  1. Firstly, Apache Druid allows us to explore the data, transform them and filter what we do not want to store. It contains databases as well, so we can configure the data retention policy and we can use SQL to query against collected data. This means that Druid provides a complete solution for processing and storing the data, whereas in the previous architecture we had been using Apache Spark + Apache Cassandra. By using this tool, we could delete two components from our initial architecture and replace it with a single one, which reduced the complexity of the entire solution. Additionally, it gave us a single place to configure it all. As a trade-off, this solution is not as flexible as Apache Spark + ApacheCassandra may be. A lack of flexibility can be noticed mostly at the data transformation + data filtering more than in the storage. Even though Druid provides simple mechanisms to do this, it’s not going to be as flexible as writing code. Since we mostly aggregate by time, this is a price we can pay.
  2. Secondly, the configuration of Apache Druid on top of Kubernetes is much better. Druid contains multiple separated applications that handle different parts of the processing. This allows us to configure resources at multiple levels, so we can use the resources we have more efficiently.
  3. Thirdly, in Apache Druid, we can use the same data source multiple times to create different tables. Since we used Apache Druid mostly to fulfill the TOP N query, we re-used the same topic in Kafka twice. With and without rollup. Since the smallest time granularity that we have is 1 minute, this is how rollup was configured. This speeds up the query causes that the data are already pre-aggregated. This may have a huge impact when the rollup ratio is high. In cases when it’s pretty low, we should disable it.
  4. Fourthly, since there is a UI in Apache Druid, you do not need to write any code to make it work. This is different from what we had to do in Apache Spark. At the end of the configuration process in Apache Druid, we got JSON, which can be persisted and versioned somewhere in an external repository. Note that some of the configurations (such as retention policies) are versioned by Apache Druid automatically.
  5. Fifthly, Apache Cassandra is not the best database to store data about users, permissions, mostly website management data. This is due to the fact that Cassandra is made for fast reads, and operations of this kind can have a negative performance impact. To avoid this, these operations are not supported by design. This is why relational databases are better for this kind of data and we decided to introduce PostgreSQL for it.

What did Druid tell us?

Most likely, you don’t want to run Druid in local mode.

By default, Druid is configured to run only one task at a time (in local mode). This is problematic, when you have more data sources to be processed, essentially, creating more than one table in Druid. We had to change druid.indexer.runner.type to “remote” to make it work as expected. When you have only one task running and multiple pending, make sure that you have this option changed to ‘remote’. You can read about that here.

 

The historical Process needs to see segments.

We encountered an issue with the Historical Process, which did not notice the segments (files) created by Middle Manager. This happened because we did not configure proper storage for the segments. Druid works in a way that the segments need to be accessible via Historical Service and Middle Manager because only the ownership is transferred from Middle Manager to Historical Service. That is why we decided to use storage from the Google Cloud Platform. You can read more about deep storage here.

 

What resources did we burn?

 

 

Apache Spark + Apache Cassandra

Apache Druid

CPU

49 cores

6 cores

RAM

38GB

10GB

Storage

135GB

20GB

As you can see, we managed to drastically decrease the number of cores used to run our infrastructure as well as memory and storage. The biggest reason why this was possible was that we had multiple Spark Jobs running to do the same task that the single Apache Druid can do. The resources needed for the initial architecture had to be multiplied by the number of jobs running, which in our case was 24.

 

Summary

To sum up what we learned during this venture. 

Apache Spark is not a silver bullet.

Spark allows us to do many, many things with the data. Beginning with data reading from many formats, cleaning data, transforming them, aggregating them and finally writing to multiple formats. It has modules for machine learning and graph processing. But Spark does not solve all the problems! As you can see, in our case it was impossible to produce results that could fulfill the requirements of aTOP N query. 

Additionally, you also need to keep in mind that the architecture of your code, the generalization, and the complexity of it will significantly affect the resources needed to process the data. Due to the fact that we had 24 jobs to be executed, we needed a huge amount of resources to make it work.

 

Apache Spark is made for big-data.

When you have many gigabytes of data to be processed, you could consider  Spark. One rule says that you should consider using Spark when the biggest machine does not allow you to process the data on it. Keep in mind that processing data on a single machine will be always better than doing it on multiple machines (I’m not saying faster, but sometimes this can be also seen). Nowadays everyone wants to use Spark, even if there are just a couple of megabytes of data. Bear in mind, choosing the right tool for the right job is half of the success.

 

There is always a tradeoff.

There are no silver bullets in the Software Development World, Spark is not one of them as you have seen above. There is always a tradeoff. 

You can drive faster, but you will burn more fuel. You can write your code faster, but the quality will be poorer. You can be more flexible, generalize your code better, but it will become harder to read, extend, debug. Finally, you can pick a database that allows you to read fast, but you will need to work more on the structure of this database and writing into it will be slower.

You need to find the right balance and see if you can cope with the drawbacks that a specific technology brings along with its positives.

 

Use the right tool for the right job.

Good research is the key to making sure that the appropriate technology has been chosen. You need to remember that sometimes it is better to reserve a bit more time for investigation, writing a Proof of Concept before diving hard into one or other technologies. You will definitely end up wasting time having to rewrite your solution. Obviously, this depends on where you spot the first obstacles. In our example, you can see that firstly our infrastructure was quite complicated and required a lot of resources. We managed to change it dramatically by using a better tool which fitted the purpose better.

 

How it looks now. A working solution.

Even though this article sounds a bit theoretical, the project we took this experience from is real and still in progress. 

The Analytical Dashboard that we created to visualize the data from different blockchains are based on React with Hooks, Redux, Saga on the front side, and Node.js, Apache Druid, Apache Kafka on the back-end. 

On the day of writing this article, we are planning to re-write our data source component from Node.js to Scala + ZIO. So you can expect some updates, either in the form of a blog post or an open-source project at GitHub. We have a couple of Blockchains integrated, and we are planning to add more. We are also planning to make the API public so that everyone can experiment and learn from it just as we did.

 

You can check our Analytical Dashboard out right now at https://www.flowblock.io

Check out also our websites on: