This project came about helping a friend with a new Podcast they help run, and rather than spending the money upfront for a podcast hosting SAS he wanted to host it himself and figure out if it was popular enough to justify that additional cost. I helped him with setting it up, what goes into an RSS feed and then hosting it all on AWS S3 the same way you would with any other static website hosting. We turned on logging in order to get an idea of the number of requests coming in for the RSS feed, as well as each episode. We also turned on Lifecycle management on that bucket to delete the log files after x (where for us x=10) number of days.
The AWS S3 Logs are interesting, it takes a couple of hours for them to start showing up once turned on but then you get a new file like every 4 minutes and the log can have a single entry or a few requests. It makes sense what’s happening behind the scenes: S3 is behind endpoints and load balancers to spread out those requests and make it highly available and scalable. So, for a particular user’s request, they will hit a certain machine in the pool and their requests are logged there, then periodically a script or tool will parse all those requests across all those instances and drop them into the different accounts and buckets for those who have logging turned on. But that’s not immediately helpful, for a single day I can have like a thousand log files. Therefore, I wrote a python script available on GitHub that will go through the bucket, download any new files that it doesn’t have locally and combine them into a single log file for a day. I tried writing it in a slightly different way so that I can run it periodically in Jenkins against multiple accounts/buckets without changing the code or managing configuration files, I did this with simple Environment Variables.
First off, here’s the GitHub repo: https://github.com/kmkingsbury/python-s3-logs
The code should be pretty straightforward, a README covers some assumptions/prerequisites, which are a few environment variables for AWS Creds, bucket name, and prefix, and then two directories for the log files to go into.
The environment variables are super easy to pull in python and I wrote the AWS connection in two ways where it would look for the environment variable and use that, otherwise it would fall back to looking for the
.aws/credentials file which are just two common ways of connecting and there are more ways as well which may fit different requirments.
prefix = os.environ['LOG_PREFIX'] bucket = os.environ['MY_BUCKET'] # Create a client client = None if 'AWS_ACCESS_KEY_ID' in os.environ: #pull from env var client = boto3.client('s3', region_name='us-east-1', aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'], aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']) else: #pull from ~/.aws/credentials file client = boto3.client('s3', region_name='us-east-1')
As a future development I’d like to use Hashicorp’s Vault in the future to inject in these credentials adding an additional layer of security and making it so those critical AWS credentials are not visible in the Jenkins Job and can even be rotated out periodically.
A key step in getting the files is using a paginator, otherwise it’ll only grab 1000 results which won’t be the complete set. We then iterate through those pages, and iterate through each object pulling the name and downloading it if we don’t already have a copy.
# Create a reusable Paginator paginator = client.get_paginator('list_objects') # Create a PageIterator from the Paginator page_iterator = paginator.paginate(Bucket=bucket) # Download all files to src, skip if have it already for page in page_iterator: for file in page['Contents']: print("key: " + str(file['Key']))
Each log file is downloaded to the
src directory and that’s wrapped in a try/catch block just in case something goes wrong.
We then build a string of today’s date, this is so we don’t process today’s logs, since more would be coming in and we also build a list of the logfiles we’ve already created. This is so that once we’ve processed a day we don’t keep reprocessing it. As we iterate through the files we’ll create more combined log files but those won’t exist in this list allowing the individual logs to continue to be processed and added to that combined log file.
For each new log we process we simply
cat it onto the combined log file.
I stated that I was running this in a Jenkins Job, it pulls from the git repo and runs it once a day in the early morning to process the previous days worth of logs. The heart of the Job looks like this:
#from CLI: sudo apt-get install python3-venv # python3 -m venv aws . aws/bin/activate # pip install boto3 export AWS_ACCESS_KEY_ID=Axxx export AWS_SECRET_ACCESS_KEY=yyyyyy export AWS_DEFAULT_REGION=us-east-1 export LOG_PREFIX=LogPrefix_ export MY_BUCKET=mylogbucket python3 grab-process-s3-logs.py
I used a Python virtual environment to keep my modules and dependencies tidy. My initial run creates this and installs the boto3 module for talking with AWS, then I comment these steps out to not try and redo them on later runs.
Overall, I’m happy with the results, I get a combined log file which right now I can search and grep against to pull some basic counts and statistics. This is a starting point and I think there may be later developments to start binning up the podcast files and getting downloads per day per episode and such but that’ll require some more scripts. I might also add a post-build action to Jenkins to automatically email out that new log file to my friend each day.