Integrating S3 with SQL Server
Author: Tom Carpentier
Imagine this scenario: You are an average Joe SQL Server or ETL developer and your customer wants you to use this amazing new technology called a “Datalake”. The technology is in AWS and it is amazing, and you absolutely need to use it to store the data... What do you do?
You are correct in thinking that you would just need to query the data from SQL Server by using a linked server object. How, you ask? That is what I will explain to you in this tutorial.
There are a couple of things that you will need. These are listed below:
S3 Parquet/JSON/… files
(Note: this is just one way of getting to your stuff, I'm sure there are many others, but this one is fairly simple.)
The first thing you need is Datalake (S3) data. A datalake can store lots of types of data, so to make this work you will need structured or semi-structured data. Filetypes that can be used are Parquet, JSON, XML, CSV and more. For this tutorial, I will use a parquet export of some AdventureWorks data.
The second thing you need to do is to create a Glue Crawler.
A Glue Crawler can turn your data into something everyone understands; a table. Point the Glue Crawler to your S3 Datalake using the correct path (e.g. The folder where you have stored your data).
Tell the crawler where he needs to store the result. If necessary, click the “Add database” button and give it a prefix for the tables that are about to be generated.
In addition, for the purposes of this tutorial, we do not need it to be run automatically. You only need to run a crawler again when the data structure changes, not when the data itself changes. Continue creating until you are completely done. You will return to an overview screen containing all of your crawlers.
Click the checkmark in front of your new crawler, and let it run! This might take a minute, so go get a coffee! I will still be here when you get back.
Next, you need to go to Athena. You should be able to see the table(s) you requested in the database where you told the crawler to load.
This was generated by the Glue Crawler, is that not amazing? This table automatically contains the data from all files and subfolders in the folder you assigned your crawler to check. If new files are added in there, your table will also get their data.
Now, I say "contains the data", but that is an oversimplification. This table does not contain any data. It has the metadata needed to query S3. So, basically, this is just a semantic layer, nothing more.
Now, we all know that linked servers have a few distinct disadvantages… To limit those as much as possible, we need a fourth step. In this step, we should write a view that already performs the necessary column selection and filtering as much as possible. Especially with Parquet files the column selection will help your performance tremendously. (Parquet stores data in a columnar format). Luckily, Athena understands SQL. Unfortunately, we cannot use T-SQL, but for our purpose normal SQL will do...
Write a query that does the following:
This will add a view. The view only contains the columns and records that you need. Because of this step, loading time will lessen significantly!
You are almost ready to get the data from the AWS cloud into your local server. However, we need to make sure the server can actually reach AWS (firewall rules). Thus, you will need to install a driver on your SQL Server machine. Installing this driver is our fifth step. You can download it via this link: https://docs.aws.amazon.com/athena/latest/ug/connect-with-odbc.html)
The driver, once installed, needs a couple of things configured.
(Note: all screenshots are for version 126.96.36.1996, 64-bit).
First, RDP into your server, and open your ODBC data source administrator. Mine looks like this:
Click the "Add" button to create a new data source. Select the "Simba Athena ODBC Driver", and click "Finish"
This is where it gets interesting because we need to configure the setup. To make things a little bit easier, I have listed the necessary actions below.
For data source name, you can fill in practically anything. Make it something you can remember like "Athena_AWS".
Description: again, whatever you want, something like "AdventureWorks on Athena" will work perfectly fine.
AWS region: this is the first really important one! Make sure to set this to the same region where your Athena tables/views are stored!
Schema: not important (for our purposes)
Workgroup: not important (for our purposes)
S3 output location: this one is tricky, Athena will, for every query you send it, create an "output" file with the results of the query. The default location contains a bunch of numbers (because it is generated and has to be globally unique). Make sure that the location you input here is accessible by your user.
For Authentication options it is the easiest if you select IAM credentials. You can also use an IAM Profile, but for the purposes of this post we will use Credentials. For User and Password you can input an aws access key and an aws secret access key respectively. This is the user who needs access to the S3 output location. Unsurprisingly, this user also needs read permissions on the AWS database you want to query.
Now that we have sorted that out, click "OK" and go into "Advanced Options". Depending on your network settings, the bottom checkmark will need to be unchecked. This can be an annoying issue to encounter if you do not know it exists.
Finally you click the "Test…" button and you'll be greeted by this cheerful window:
We are almost at the final step! Now we just need to connect from SQL Server to our Athena database! All you need here is a new Linked Server Object.
As a name, choose something easy, and remember your choice.
For Server type, select "other data source". Then, select the "Microsoft OLE DB Provider for ODBC Drivers" provider, and input the same name you gave your Athena connection in the "Data Source:" textbox.
In the security tab, check this option:
There are no other settings required!
Check in your "Server Objects" under "Linked Servers" if you can open the ATHENA_AWS linked server. There will be only 1 catalog "AwsDataCatalog". This is normal, an athena "Database" is seen in SQL Server as a schema. If you click on the schema you will be able to see your tables.
You are now ready for the final step: to query that Athena table!
For querying the Athena data, I strongly suggest using the OPENQUERY syntax. Maybe it depends on what servers are used, but this gave me FAR superior performance compared to using 4-part names.
(If you're not getting any data, go back to your Simba ODBC driver settings, and uncheck that "resultset streaming" checkbox!)
That’s it! You are now able to fetch data from S3 and you can integrate it in your ETL according to your own preferences. You can use stored procedures, even source queries.. The sky is the limit!
This process does not need to be repeated any time you add a table or view in Athena. The only part that needs to be repeated is the Glue Crawler part (and even then, only for new tables).
We are experts in our fields but we also like to learn!
If you have any suggestions for future tutorials or questions about this specific tutorial, please contact us! We welcome feedback!