In an age where companies are heavily investing in data to gain insights into their business, more and more are building data lakes to store unstructured data for analysis. This data has an audience of diverse interests, from data scientists using it to power ML models or generate benchmarks, to Business Analysts seeing metrics on company/project performance. At Flow, we have adopted AWS Athena and S3 for the storage and querying of data within our data lake.
This blog post is to show how simple it is for companies on AWS to hook up a data lake and start gaining insights into their data. It’s taking the fact that companies already are on AWS, have some data in a data lake that's stored correctly (will get to that later) and have a data analytics/BI platform like Domo, Tableau or Looker setup. We are using Domo at Flow, but any BI tool should be able to access via JDBC driver for Athena.
First things first, let’s talk about a data lake. I won’t go into the hows and the whys of data lakes in general, so let’s just start with a simple lake example. You have been capturing JSON/CSV/AVRO/ORC or Parquet files and storing them on S3. For more info on what a data lake is, please check the AWS write up here.
Depending on the volume of the data in its raw format, you may want to further refine it by processing it or perhaps building aggregations, but for the purpose of the example, we’ll assume that raw is fine. A potential reasoning of why raw data may not be suitable for Athena, would be capturing data in small files on S3. This can cause a lot of throttling of s3 requests and poor performance in Athena query times. A general rule of thumb is to ensure data files are compressed and aim for around a 128mb file size.
So let’s look at the schema for cool_data_source. Opening it up in Spark and investigating (df.printSchema() is fine for this stage), you are left with the following schema:
By leveraging the AWS stack, we can use AWS Glue meta store via Athena to build out a table that represents this data in a hive compatible external table. Here's the example DDL for the data source. We’ll use the default database for this example, but you can easily create your own.
You can hand craft these DDLs, but it's also pretty easy to build them out dynamically via Spark.
Since the data is partitioned, we need to tell the meta store to load up any partitions we haven’t seen before. This is pretty easy to do using the msck repair table from hive, but if your table has a lot of partitions, it's sometimes better to just add the specific partitions (bonus points if you are using aws glue crawlers or if you have something like airflow automatically doing this):
So far, we have data on a data lake that's now loaded up into Athena and ready for querying. Athena, like presto, is accessible via a JDBC connection, which means we can hook it up to most BI tools (personally I’ve used Domo, Tableau, Looker, DBBeaver and Superset and have found them all to be easy enough to hook up).
Bonus section - Connecting Athena to Domo
At Flow, we’ve adopted Domo as our BI tool and connecting Domo to Athena is quite simple by leveraging the Athena Data Connector. Note that we will need to set up AWS access keys and secret access keys. Domo will be accessed over the internet (via https) and we need a way to authenticate the query against it. We’ll be saving these keys in Domo so that users won’t have to enter them each time they want to access data accessed from the connection.
The first step in the connector setup is to add your AWS credentials, ensuring that your staging directory is specified in the correct region (This is where Domo will look for the result of the Athena queries it runs).
The next step is to specify the Data catalog, Database, and the query used to pull the information from Athena. If you have specified the correct permissions n the credentials stage, Domo will list the data catalogs available, and you should select the one that we create the cool_data_source table in. The next step is to specify the table and the columns that you want to pull into Domo. Domo has a nice little query helper guide, but you can provide your own query here. My recommendation would be to use a select query with no manipulation as introducing transforms here can be a bit tricky to debug.
The penultimate step is to specify the scheduling of the query. Domo allows you to customize the schedule and what it does with the result set. Depending on the data you have in Athena, you may want to replace the entire source every time (a full refresh), or just pull the latest result of the query and let the data build up in Domo each time.
The final step is to name the data source and description. These steps are often rushed and not given much thought, but it's a great opportunity to provide users of your new data source some context. Once you’ve specified a name and description, you are all set to start pulling data from your data lake into Domo and start generating insights.