til: Addressing S3 URIs with Pandas using s3fs

Another short “today I learned” post from the analytics mines. If you have previous experience writing any form of data munging or analytics tasks then you have almost certainly encountered Python, Pandas, and AWS S3 in some combination.

These jobs usually follow the structure:

download the files from S3.
deserialize them into Python objects & create Pandas dataframes.
perform calculations over these dataframes.

Normally #1 and #2 would be wasted repetitive work that is left to the reader, but there is a better way.

Pandas allows for other Python packages to extend its set of known URIs for addressing objects outside the common http:// and file:// defaults. The s3fs package supports this API and when installed allows Pandas to understand URLs that begin with s3://.

This means that the the steps of #1 and #2 can be abbreviated as one.

before#

import boto3
import pandas as pd
import json

# retrieve the object / parse it
response = boto3.client('s3').get_object(Bucket="$YOUR_BUCKET_NAME", Key="$YOUR_KEY")
json_contents = json.loads(response['Body'].read().decode('utf-8'))

# create a Pandas DataFrame
df = pd.DataFrame(json_contents)

after#

import pandas as pd

# read the JSON file from S3 and parse its contents into a dataframe
df = pd.read_json(f"s3://$YOUR_BUCKET_NAME/$YOUR_KEY")