Large files within Queues & Streams
This is a short article about a couple of architectural techniques that I have used in my career to deal with passing large files through streams and queues. For those that try this, you will quickly realise that there are limits in most tools to the size of the record. For example, the maximum record size for AWS Kinesis is 1MB (at time of writing article). For AWS SQS the size is even smaller with just 256KB. This could also get very expensive if the charging model of the tool you are using takes into account the number of bytes being put and read from it. It could even require you to have more shards/topics, which once again could have a cost impact. Finally, it may slow down the throughput of your system, having impacts further down the chain.
Notifications
To get around these problems, you can put a ‘notification’ onto the stream, which points to the location of the larger file. For example, this could be a file within AWS S3, which is generally a very cheap solution to storage. The consuming processors can then ingest a notification and download the file without having to store the file itself within the pipe.
{ itemId:'79786f4e-2685-497b-8696-ed4c38810350', key:'myfile.json'}
The problem with the above notification however, is that it does not specify everything needed to get the file. It instead, relies upon the consuming processor to know which S3 bucket to pull the file from. This makes the consuming processor fragile and if there is more than one consumer, then we would have to update multiple components if we ever changed the name of the bucket. Instead, we might want to have the following:
{ itemId:'79786f4e-2685-497b-8696-ed4c38810350', bucket:'my0-bucket', key:'myfile.json'}
There are additional values we might want to have within our notification message to allow the consuming application to filter through items that it does not care about without having to get a large file. For example, if we were only interested in items that were red, we might have the following:
{ itemId:'79786f4e-2685-497b-8696-ed4c38810350', bucket:'my0-bucket', colour:'red', key:'myfile.json'}
Large File Queue
Within this example, we have used S3 as our file store, which gives us an added bonus. We can set up an event within the bucket to fire off a notification to an SQS queue. This notification will have everything needed to identify where the file has come from, including the key and bucket name. If we want more control over the notification content, however, we may still need to write some code to put an item onto the queue.
After the item has been removed from the queue, we still need to delete the file itself from the S3 bucket to tidy up things.
Large File Streams
One of the main differences between a queue and stream is that generally, a stream has more than one consumer reading from it. The large file data will also need to exist as long as the data retention period of the stream, rather than deleting the item once it has been read. For example, the default retention period for Kinesis is 24 hours.
If the first consuming processor were to delete the file from the bucket, then the second processor would not be able to find the file. To solve this problem we need to devise a method to delete the file once the retention period is up. Within S3, this could be as simple as setting a lifecycle rule, which would automatically delete the file after a period of time. If it is important to clear out the files as soon as possible, then a component to clean up files after an “I’ve read this” flag has been set for each consuming processor.
Event Sourcing & Replaying Data
Although I have detailed some of the techniques for deleting files, many applications will find this redundant. For example, within event sourcing and CQRS patterns, you are probably always going to want to store the raw data that comes through. The idea being, that you can easily replay data through the system and create new read models for future functionality. You might also find it useful to keep the files in case any of the consuming processors fail. Data can then be replayed through the system once the consuming processors have been fixed. To find out more about these concepts, check out this video for more details.
Please share with all your friends on social media and hit that clap button below to spread the word. 👏
If you liked this post, then please follow me and check out some of my other articles.
About
Matthew Bill is a passionate technical leader and agile enthusiast from the UK. He enjoys disrupting the status quo to bring about transformational change and technical excellence. With a strong technical background (Node.js/.NET), he solves complex problems by using innovative solutions and excels in implementing strong DevOps cultures.
He is an active member of the tech community, writing articles, presenting talks, contributing to open source and co-founding the Norwich Node User Group. If you would like him to speak at one of your conferences or write a piece for your publication, then please get in touch.
Find out more about Matthew and his projects at matthewbill.gihthub.io
Thanks for reading!