More and More Data: How We’re Confronting the Challenges

This post was originally published on Brandwatch's main blog on 1st August 2014.

Working at Brandwatch over the past three years has been exciting.

As a world-leading social media monitoring tool, we crawl the web in real-time for mentions that match customers’ queries. There are various facets to the challenges we face, the first being that our business is booming. As mentioned on a previous blog post by our Architecture Lead Steve Mason, the number of client queries is doubling every year.

In addition to this, the amount of data that we crawl is always increasing.

With news articles, forum comments, tweets and Facebook posts as just some of our sources, we’ve already passed processing 70 million matches to queries per day from 40+ web crawlers. We process and store every one of these.

This is a big challenge for our Java developers.

New features have to scale out of the box in order to meet the capacity of our infrastructure, and vertical scaling only gets you so far.

Horizontal scaling is a big focus of our work at the moment. This affects how we move data around our infrastructure and also how we process it. We need to think carefully about the current requirements of our system and also how effective it will be in the years to follow.

Scaling messages

Once upon a time, when Brandwatch was much smaller, we were able to cope by having applications poll the database for updates to the jobs they were working on.

As we grew, various parts of the system started using queues to process the ever-increasing number of messages. These messages ranged from notifications of new tasks for the system to run through to heavy data consumption from Twitter.

However, we found that configuring clusters of queues was cumbersome, not very flexible and difficult for developers to learn and experiment with. We needed a better way.

During the prototyping of a new feature we decided to try out Apache Kafka.

Kafka is a publish-subscribe messaging system rethought as a distributed commit log which runs over a cluster of nodes of unbounded size. It was developed at LinkedIn and is currently in wide usage in production: they use a 350+ node cluster of brokers processing over 278 billion messages per day.

Kafka is extremely easy to install and play with and we recommend having a go at the Quick Start guide. Based on our positive experience of Kafka so far, we are beginning to replace our queue instances, including those that process Twitter data. Sending a message to Kafka involves it being written to disk rather than kept in RAM like most queues.

In fact, storing messages contiguously on disk dramatically reduces disk seek time and can even exceed the access time in RAM! Another benefit of disk persistence is that processes are less affected by large input spikes, as they consume at the fastest rate possible for them and are able to catch up in quieter times.

Really interesting stuff.

Scaling processing

Now that we’re using Kafka in order to move gigabytes of data around, we need to make sure that we’re able to process it. A single JVM will not be able to process everything, so we need to spread the load over multiple JVMs.

This is a common problem with any new data processing features that we release.

Creating applications that share their load and deal with failover via leader election has been working well as a solution. Leader election is a way of deciding who the leader is for a task in a group of distributed nodes. We have been using Apache ZooKeeper coupled with the Apache Curator libraries in order to coordinate our jobs.

Imagine that you have jobs being created, each with a unique ID, in a distributed system of JVMs.

Each JVM needs to be notified when there is a new job to take, and each job should be taken by exactly one JVM. Without the help of existing open source systems, writing the job scheduling would be a complete pain.

However, Curator and ZooKeeper tame this complexity.

ZooKeeper is a way of coordinating and managing distributed systems, and it can conceptually be thought of as a tree of nodes that your application can read from and write to.

If we’re implementing a system to support a new type of recurring job called foo, then we would create a node in the tree called foo.

Then, each time a new foo job was created, the system should create a new child node of foo with a unique identifier; we suggest the ID of that job in the system. For example, the diagram below shows two active foo jobs with IDs 143 and 523.

Next, we need our JVMs to be assigned jobs. This happens via leader election.

Leader election with Curator works by having each JVM watch the parent foo node and receive a callback each time that a child node is added or removed.

When a child is added, each JVM races to write an ephemeral sequential child node underneath foo. An ephemeral node only exists as long as the connection with the creator exists.

A sequential node is automatically given a number that corresponds to the order it was written in.

Whichever JVM gets there first (i.e. has sequential node 0) wins the leader election and gets a callback to begin working on that job. The others wait. If the leader dies, the leader’s ephemeral node is removed, and leadership passes to the next sequential node in line.

Using this leader election technique we have been able to scale single applications into clusters of applications that coordinate their work automatically and deal with failover reliably, since there are always spare JVMs in the leadership queue.

We are also in the process of moving more of our configuration and discovery mechanisms away from the database and into ZooKeeper itself.

Scaling engineers and careers

Building for scale is a talent ever increasingly in demand.

As Brandwatch grows, we are always looking for talented engineers to join us and contribute towards building something amazing. There’s lots to do and learn. Head over to our careers page to see where your next opportunity could be.