Very frequently I resort to quick and dirty learning short-cuts. One reason I avoid, whenever possible, reinventing the wheel and second, more importantly, these short-cuts provide me with a starting point for diving deep. Here is a writeup that I created for a learning demo about SnowPipe.
SnowPipe is a very exciting feature of SnowflakeDB that allows real-time data load in a table created on Snowflake by reading incoming files in AWS S3 (and other similar Cloud services). I have used AWS S3 in my writeup.
My quick demo will read files placed in a S3 bucket to be automatically picked and loaded in a pre-defined table in Snowflake. The input files taken as source are actually coming in from Kaggle. I have given some details about this dataset in a prior writeup about DynamoDB.
The table structure is pretty straightforward -
Next we create the file format for Snowflake to read the file. I have used a very common format suitable for csv. I am assuming there would be a header so skipping the first line, impact of which we will see later on.
Let’s create a staging area now. This uses a location on AWS S3 bucket. I have used a key_id/key_secret I created specifically for this demo.
When I do a listing of stage I see one file that fits my demo namely — imdb_test.csv. There are other csv files but they don’t play any role here. (on second thought I should have removed them before taking the screenshot)
I try reading the file directly in the S3 to see if the content is readable. (Take that AWS Athena, Snowflake allowed querying S3 before you made it a fashion :-D )
Next create a Snowpipe — I named it SNOWPIPE_IMDB and use the command that will allow it to read from S3 and copy into the table.
Now that pipe is successfully created, the data should get loaded. Right?
Wrong. We have told Snowflake about S3 bucket stage. However this information is incomplete since we have not told S3 about the Snowflake yet. Next step is to create a trigger on the S3 bucket so that the pipe runs anytime it receives a file.
Let’s go to the AWS console and open the S3 page.
Next click on the bucket that we want to manipulate. For us it is — imdb-data-bucket-demo-001
If in confusion, refer to the bucket name you specified when you created the stage. Go to the Properties tab.
Keep scrolling down till you reach Event Notifications section
It will most likely show you no available notifications. For me it shows 2 because I have done some experiments for AWS Lambda practice.
Click on the right hand button — Create Event Notification
Once the event notification tab is open — we will start with filling the General Configuration. This includes a name for event (I chose snowpipe_imdb, but you can chose anything you wish).
We also need to give a prefix — which is the name of folder within the bucket where your files will be available. If you are not using any folder, leave this part empty. Make sure you place the slash (‘/’) after name of bucket.
Finally a suffix is needed if you want to process some limited files. For instance, I am going to process files that have csv extension and no other file. So I place ‘.csv’. Notice I didn’t give ‘*.csv’.
The event type is chosen as All Object Creation events. Depending on your need, you can fine tune your options. For our use case, it is sufficient.
Ok, we have not discussed yet on how to get the SQS queue value. For that we go back to Snowflake GUI and run command “show pipes”.
From the output look for the entry that specifies the pipe that we just created and corresponding notification_channel.
Copy the channel name IN FULL and paste it in SQS queue entry.
Finally save changes.
Now, let me deviate a bit from the actual topic and talk about a possible error. If you get an error like below, it means you are trying to create a notification for an event which already has a notification defined. For instance, if you already had a notification for same bucket, same folder and same event in it — AWS will not allow it. I spent some time trying to figure out what I was doing wrong when I saw this for the first time. So may be this will save you some time.
Now, we come back to original topic. After a few seconds, I check the table if records got loaded.
I see nothing in the table. So what is going on?
Recall that the file was loaded prior to creation of the Snowpipe and the event notification is for an object create. So let’s load another file — I used a similar file with 21 records this time.
Here is how the contents look.
Post upload in the S3 bucket, it is visible in the stage area.
And it is readable too.
And this time, after a few seconds, the file got loaded to Snowflake table — thanks to Snowpipe.
We see only 20 records — since first entry is removed. Reason being the definition of our file format where we said the first line of file will be header. Just a lesson on how we should be careful about small things impacting much later.
Hope the demo and quick guide was useful.