Blending Brandwatch data with Apache Drill

Using Brandwatch, you can collect some fascinating data. Our clients use our platform to gain intelligence on what the world is saying about what matters to them most. Often, we notice that users will export data from the app and then compare and contrast it with other types of data like their sales figures. Sometimes this can be a tricky process, requiring a lot of Excel wizardry or worse: manual editing, cutting and pasting!

Recently, in my Funky Friday time, I was experimenting with Apache Drill, which has just announced its 1.0 release. Drill is a tool for doing SQL queries over unstructured data, meaning that you can go from data to insights in next to no time. Being able to be so agile with data opens up a lot of possibilities for data scientists and analytics to get their hands dirty without needing any support from engineers or sysadmins. Drill also works on top of Hadoop, Amazon S3, and Google Cloud Platform (to name a few), meaning that it can scale to extremely large datasets. But, for today, let's just have a look at some of the things that you can do with Drill on your own machine working from plain old data files.

First, let's install Apache Drill. Head over to the downloads page, and download it. Unzip it to a location of your choosing, and you're ready to go.

Next, I thought I'd grab some Brandwatch data. I happen to be using our platform to look into conversation about tax in the UK; in particular: what was the world saying about HMRC? I exported a CSV of the last two weeks of data in the usual way. This consisted of about 12,000 mentions.

Now, it's worth mentioning that there were two preparations I had to do before I could use this with Drill. Firstly, I had to delete the initial CSV header line, as Drill can't automatically detect it yet. Secondly, I had to replace carriage returns in the mention text with spaces, which I did using the excellent csvquote. After that, we were good to go.

If you know SQL, you can use Drill. For example, in our CSV exports, column 23 is the country that the mention was posted from. We can select all of the countries in Drill by using the following statement:

select mentions.columns[23] as country  
from dfs.`/Users/jamess/libraries/apache-drill-1.0.0/hmrc-cleaned.csv` mentions;  

We run this by executing the embedded Drill server, which always gives us a lovely random quote when starting up.

Then we paste in the query above and hit return:

Cool! So this is starting to look very much like SQL, and I feel comfortable here. How about counting the top 10 countries from this CSV file? Let's use the following statement:

select mentions.columns[23] as country,  
count(mentions.*) as no_mentions  
from dfs.`/Users/jamess/libraries/apache-drill-1.0.0/hmrc-cleaned.csv` mentions  
where mentions.columns[23] <> ''  
group by mentions.columns[23]  
order by count(mentions.*)  
desc limit 10;  

This gives us the following result:

So far, so good. But it gets more interesting when we bring in other data sources, and we don't even need to use CSV. That's the beauty of Drill - it seamlessly works with JSON, HBase, TSV and more. Let's see how many people live in the countries that we're selecting here. You can head over to the World Bank and find population data to export. Let's export it in JSON to show the power of Drill. The data looks like this, with a JSON map for each year:

  …
  {
    "Country Name": "Venezuela, RB",
    "Country Code": "VEN",
    "Year": "1989",
    "Value": "19294590.4"
  },
  {
    "Country Name": "Venezuela, RB",
    "Country Code": "VEN",
    "Year": "1990",
    "Value": "19750000"
  },
  {
    "Country Name": "Venezuela, RB",
    "Country Code": "VEN",
    "Year": "1991",
    "Value": "20197000"
  },
  …

Using Drill, we can select the 5 smallest population values for 2005:

select *  
from dfs.`/Users/jamess/libraries/apache-drill-1.0.0/population.json` population  
where population.`Year` = 2005  
order by cast(population.`Value` as double)  
asc limit 5;  

Which gives us this (I certainly hadn't heard of Tuvalu before!):

Time for the fun bit: let's mix this up with our Brandwatch data, and provide the population values for the top 10 countries talking about HMRC:

select  
mentions.columns[23] as country,  
count(mentions.*) as no_mentions,  
population.`Value` as no_people  
from dfs.`/Users/jamess/libraries/apache-drill-1.0.0/hmrc-cleaned.csv` mentions  
join dfs.`/Users/jamess/libraries/apache-drill-1.0.0/population.json` population  
on mentions.columns[23] = population.`Country Name`  
where population.`Year` = 2005  
group by mentions.columns[23], population.`Value`  
order by count(mentions.*)  
desc limit 10;  

And the results:

Insights in mere minutes from Brandwatch and the World Bank with Apache Drill, in a way that works identically on my laptop and across an entire Hadoop cluster. I'm definitely going to be using Drill more to explore our existing data. It's great for prototyping new features and generally exploring what you already have, without having to write or deploy any real code.

If you're interested in following the development around Drill, then do check out @ApacheDrill and their website. What could you find within your own data?