In this post we’ll be taking a look at Athena. Launched towards the end of 2016 and complementing the existing Amazon Analytics suite (which features tools such as Elastic Map Reduce, Kinesis and Redshift), It is a ‘serverless’ technology which in short means you don’t need to make provisions in EC2 or VPC to run it. Using Presto behind the scenes, it allows you to access data files stored in S3 (in formats such as CSV, Parquet and JSON) using standard SQL. We’ll walk through configuring users, access, and table creation, and run some simple queries and visualisations.
We’ll start by creating a new S3 bucket, in which we’ll store our data files:
In IAM, create a new user and access policy. We’ll create the user with an access ID and secret key (required later)
Next we’ll create and attach a new policy to the user, which gives access to the S3 bucket we created earlier, a bucket we’ll use for staging Athena result sets and permits various actions in Athena.
Upload and Prepare Data
For this walk through we are using UK road traffic incident data, one of many public datasets which are available from gov.uk. A quick line count before we start –
Using the AWS CLI, we push those files up to our S3 bucket, in a sub-folder called traffic-data:
Now that the files are uploaded, we switch over to the AWS Console and use the Athena management Window to set up our table. N.B. for this step I’m using a high-privileged AWS Console User, not the one we created earlier.
First take a quick look in settings and check where our query results will be stored:
Good, this matches what was set in the IAM policy (you can of course change these if you wish). For the next step, we go into The Catalog Manager Menu, create a new database and table, specifying the S3 location that contains our target data. We’ll name the database traffic_db and the table incident_history.
On the next step we specify the format as CSV
and then set the column names and types. We’ll go with the following:
Step 4 is for Configuring Partitions – as this is optional we’ll leave it for now and revisit this in a future blog. All being well, the table is created successfully and we can see the DDL:
CREATE EXTERNAL TABLE IF NOT EXISTS traffic_db.incident_history (
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘serialization.format’ = ‘,’,
‘field.delim’ = ‘,’
) LOCATION ‘s3://ss-athena-demo/traffic-data/’;
We can inspect the table properties:
Finally, before we leave the console we’ll run a test query.
Great! The row count matches our earlier line count in unix.
Analysis and Visualisation
So we have our data staged as a table, what now? We can query it using standard tools like SQLWorkbench. First we download the driver – as per documentation at the time of writing, it’s available from https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar. Once downloaded, simply go into Manage Drivers and create a new one using the jar file:
Once this is set, we can create a new connection specifying this JDBC driver and using the access key and secret ID for the athena_demo_user we created earlier. There are a number of extended properties you can set here which are documented in the Athena User Guide. The essential one is the s3_staging_dir, which should be set to the bucket where the result sets are stored (specified earlier under Settings in the Athena Console).
Once connected, we can issue SQL statements as you would for any database session.
Of course, this leads quite nicely to QuickSight, Amazon’s BI / visual analysis tool which also sits in their Analytics offering. QuickSight is compatibile with Athena databases and tables after granting permission to access the required AWS resources – ensure you select the required S3 buckets.
Once configured, you can create a new connection, and start adding data sets:
At this stage you can optionally preview the data, and then select to have the dataset loaded into SPICE. By doing so this speeds up data manipulation operations, allowing you to return result sets quickly and create visualisations in seconds.
So in summary, Athena provides a quick route to making your data files available in a SQL-queryable format as well as exposing it to QuickSight (amongst other tools) for Visual Analysis. We hope you find this walk through useful, and we’ll be posting more about our experiences with these technologies over the coming weeks. For more information on how we can help you work with these technologies, please don’t hesitate to get in touch.