Mar 3, 2017

Exploring the GDELT data set with Amazon Athena

The Global Database of Events, Language and Tone (GDELT) Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organisations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day.

Data set v1.0 is publicly available in S3 and I figured that this would be as good an excuse as any to play with my new favorite analytics service, Amazon Athena. Yes, Redshift, I still love you, baby ;)

The data set contains (at the time to writing) 1,542 uncompressed CSV files: 58 columns, 440+ million lines, 140+ GB. It is updated daily.

Data is formatted according to a couple of specs (here and here, PDF). A few lookup tables are also available here (country codes, organisation names, etc.).

All right, let's get to business. As you may know, Athena is able to query data hosted in S3 : no infrastructure to launch or manage, no data preparation, no loading time. All we have to do is create a table using the Hive DDL.

This is how we do it. All scripts and queries are available on Github, including how to create lookup tables.

Quite a mouthful, but pretty straightforward: just read the doc, define as many columns as needed with the right type and point to the S3 bucket holding all files. One immediate benefit of this is that whenever we run queries, Athena will automatically use all available files, including the additional one that's delivered daily. Zero work!

OK, let's run some queries. I'm using SQL Workbench/J with the Athena JDBC driver.

None of these took more than 30 seconds and that's with uncompressed CSV, the least performing data format possible. Converting the data set columnar formats such as Parquet or Orc would yield a massive improvement, but it's extra work, so why bother? ;)

That's it for today. Thanks for reading!