How to Process a Million Songs in Seconds: With Shark

Appuri is hiring data scientists, Java/Scala developers as well as front-end devs proficient with modern frameworks like AngularJS. Please drop me a line at bilal at appuri dot com if you are interesting in learning more.

TL;DR

Shark gives SQL-like access to big data with the speed advantages of a distributed cache.

In my previous blog post, I showed how to process the Million Song Dataset using Spark. This post will be a brief introduction to using Shark to do the same.

What is Shark?

Shark is a project by the same team that brought us Spark. Shark has its own REPL (or ‘shell’, as they call it), where you can write SQL-ish queries against large datasets. Shark is built on top of Hive, and implements the Hive Query Language (HiveQL).

The key benefit of Shark vs. Hive is that Shark queries run on Spark, so they can take advantage of Spark’s distributed cache.

Setup

Follow instructions in my first blog post to set up the cluster. Once you are ssh’ed in, run this command to start the Shark shell in cluster mode:

export MASTER=`cat /root/spark-ec2/cluster-url`
/root/ephemeral-hdfs/shark/shark-shell

Just like Hive, you need to create a table in Shark before you can query it. This command will create a table in Shark – note that it doesn’t actually copy the data to the Hive warehouse, it just populates metadata:

CREATE EXTERNAL TABLE msd (track_id string, analysis_sample_rate int, artist_7digitalid int, artist_familiarity double, artist_hotttnesss double, artist_id string, artist_latitude double,      artist_location string, artist_longitude double, artist_mbid string, artist_mbtags string, artist_mbtags_count int, artist_name string, artist_playmeid string, artist_terms array<string>, artist_terms_freq array<double>, artist_terms_weight array<double>, audio_md5 string, bars_confidence array<double>, bars_start array<double>, beats_confidence array<double>, beats_start array<double>, danceability double, duration double, end_of_fade_in double, energy double, key int, key_confidence double, loudness double, mode int, mode_confidence double, release string, release_7digitalid int, sections_confidence array<double>, sections_start array<double>, segments_confidence array<double>, segments_loudness_max array<double>, segments_loudness_max_time array<double>, segments_loudness_start array<double>, segments_pitches array<double>, segments_start array<double>, segments_timbre array<double>, similar_artists array<string>, song_hotttnesss string, song_id string, start_of_fade_out double, tatums_confidence array<double>, tatums_start array<double>, tempo double, time_signature int, time_signature_confidence double, title string, track_7digitalid int, year int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY "," LOCATION '/msd';

Even though this looks like a regular Hive CREATE TABLE command, let’s break it down:

  • You have to define field name and type e.g. track_id string
  • You can create collection fields e.g. artist_terms array, for fields whose values consist of multiple elements e.g. for a song, the genres field value may be “rock,hard rock”
  • You can specify how fields are delimited e.g. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  • You can specify how collections are terminated COLLECTION ITEMS TERMINATED BY ","
  • You have to specify where the data is located on HDFS e.g. /msd

Running queries

Queries are really simple:

SELECT title, artist_location FROM msd4 WHERE year >= 1970 AND year < 1980

You just ran your first Shark query! Notice that if you run this query again, it will take the same time, because you have not enabled caching yet. Let’s create another table, this time enabling caching:

CREATE EXTERNAL TABLE msd_cached...

Note that by appending _cached to the table name, you instruct Shark to keep the cache hot between queries. Now if you run the same query twice, it will run much faster.

Final Thoughts

Shark is an interesting project, and a great complement to Spark. We don’t use it as much as I’d like at Appuri, mainly because Spark fits the bill. I can see programmers using Spark more, and data scientists who are not very strong in Scala using Shark.

Do you have thoughts on a killer use case for Shark? Please share it in the comments below.