Processing Parquets 101
Introduction
Apache Parquet is a powerful file format widely used for handling large-scale data analytics. Due to its columnar storage format, it offers efficient data storage, especially when working with complex data. In this guide, we'll walk you through the steps to process Parquet files using Python. We'll filter a dataset, download images associated with the data, and generate captions based on product details. We'll be using a Gucci product dataset for this demonstration.
Getting Started
Requirements
Before we start, you'll need to install the following Python packages:
pip install curl-cffi pandas pillow pillow-avif-plugin
- curl-cffi: A modern Python binding to libcurl, which is more efficient than the standard requests library for handling HTTP requests.
- pandas: A powerful data manipulation library.
- Pillow: The Python Imaging Library used for image processing.
- pillow-avif-plugin: A plugin to handle AVIF image files, which are increasingly popular due to their high compression rates.
Dataset
We'll be using the Gucci dataset, which contains product details such as images, names, categories, descriptions, and more. You can download the Parquet file from the following link:
Setting Up the Environment
Start by setting up the directories where you'll store the images:
import pathlib
base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE.joinpath("images")
IMAGES.mkdir(exist_ok=True, parents=True)
This code creates a base directory and a subdirectory called images
where we'll save the product images and captions.
Loading the Dataset
Load the dataset into a Pandas DataFrame:
import pandas as pd
df = pd.read_parquet(BASE / "gucci.parquet")
You can preview the first few rows of the dataset using df.head()
to understand its structure. The dataset contains information like product codes, images, product names, and more.
Filtering Data by Category
We want to focus on specific categories, such as men's and women's shoes. Here's how to filter the data:
mens_shoes = df[df.department == "MENS SHOES"]
womens_shoes = df[df.department == "WOMENS SHOES"]
shoes = pd.concat([mens_shoes, womens_shoes]).reset_index(drop=True)
This code filters out all rows that belong to either the "MENS SHOES" or "WOMENS SHOES" categories and combines them into a single DataFrame.
Generating Captions
Each product in the dataset includes detailed parts that describe its features. We'll create a caption for each product by combining its name with the details:
shoes['caption'] = shoes.apply(lambda row: f"{row['name']}, {', '.join([detailPart for detailPart in row.detailParts if detailPart not in row['name']])}", axis=1)
This function iterates over each row and generates a caption by appending the details not already present in the product name.
Downloading Images
Next, we download the primary image associated with each product:
from curl_cffi import requests
import io
from PIL import Image
import pillow_avif
def download(url: str, productCode: str):
extension = url.split(".")[-1]
if extension == "jpeg":
extension = "jpg"
file_path = IMAGES.joinpath(f"{productCode}.{extension}")
if file_path.exists():
print(f"{productCode}: {str(file_path)}")
return str(file_path)
try:
r = requests.get(url, timeout=15, impersonate="chrome")
except (requests.errors.RequestsError, requests.errors.CurlError):
print(f"{productCode}: request error (timeout)")
return None
if not r.ok:
print(f"{productCode}: {r.status_code}")
return r.status_code
Image.open(io.BytesIO(r.content)).save(file_path, "JPEG")
print(f"{productCode}: {str(file_path)}")
return str(file_path)
shoes['file'] = shoes.apply(lambda row: download(row['primaryImage'], row['productCode']), axis=1)
The download
function fetches the image from the provided URL and saves it locally. The function returns the file path if successful, which is then stored in a new column called file
.
Writing Captions to Files
Finally, let's save the generated captions to text files:
def write_caption(caption: str, productCode: str):
file_path = IMAGES.joinpath(f"{productCode}.txt")
file_path.write_text(caption)
return caption
shoes['caption'] = shoes.apply(lambda row: write_caption(row['caption'], row['productCode']), axis=1)
This function writes each caption to a text file named after the product's code.
Conclusion
Processing Parquet files can be straightforward with the right tools and approach. By using Pandas to filter and manipulate the data, combined with image processing libraries like Pillow, you can easily extract valuable information and resources from large datasets. This guide provided a basic overview, but the same principles can be applied to more complex datasets and projects. With these techniques, you can efficiently manage large-scale data and generate meaningful insights from it.