Processing Parquets 101

Community Article Published August 19, 2024

Introduction

Getting Started
Requirements

Dataset

Setting Up the Environment

Loading the Dataset

Filtering Data by Category

Generating Captions

Downloading Images

Writing Captions to Files

Conclusion

Introduction

Apache Parquet is a powerful file format widely used for handling large-scale data analytics. Due to its columnar storage format, it offers efficient data storage, especially when working with complex data. In this guide, we'll walk you through the steps to process Parquet files using Python. We'll filter a dataset, download images associated with the data, and generate captions based on product details. We'll be using a Gucci product dataset for this demonstration.

Getting Started

Requirements

Before we start, you'll need to install the following Python packages:

pip install curl-cffi pandas pillow pillow-avif-plugin

curl-cffi: A modern Python binding to libcurl, which is more efficient than the standard requests library for handling HTTP requests.
pandas: A powerful data manipulation library.
Pillow: The Python Imaging Library used for image processing.
pillow-avif-plugin: A plugin to handle AVIF image files, which are increasingly popular due to their high compression rates.

Dataset

We'll be using the Gucci dataset, which contains product details such as images, names, categories, descriptions, and more. You can download the Parquet file from the following link:

Download gucci.parquet

Setting Up the Environment

Start by setting up the directories where you'll store the images:

import pathlib

base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE.joinpath("images")
IMAGES.mkdir(exist_ok=True, parents=True)

This code creates a base directory and a subdirectory called images where we'll save the product images and captions.

Loading the Dataset

Load the dataset into a Pandas DataFrame:

import pandas as pd

df = pd.read_parquet(BASE / "gucci.parquet")

You can preview the first few rows of the dataset using df.head() to understand its structure. The dataset contains information like product codes, images, product names, and more.

Filtering Data by Category

We want to focus on specific categories, such as men's and women's shoes. Here's how to filter the data:

mens_shoes = df[df.department == "MENS SHOES"]
womens_shoes = df[df.department == "WOMENS SHOES"]
shoes = pd.concat([mens_shoes, womens_shoes]).reset_index(drop=True)

This code filters out all rows that belong to either the "MENS SHOES" or "WOMENS SHOES" categories and combines them into a single DataFrame.

Generating Captions

Each product in the dataset includes detailed parts that describe its features. We'll create a caption for each product by combining its name with the details:

shoes['caption'] = shoes.apply(lambda row: f"{row['name']}, {', '.join([detailPart for detailPart in row.detailParts if detailPart not in row['name']])}", axis=1)

This function iterates over each row and generates a caption by appending the details not already present in the product name.

Downloading Images

Next, we download the primary image associated with each product:

from curl_cffi import requests
import io
from PIL import Image
import pillow_avif

def download(url: str, productCode: str):
    extension = url.split(".")[-1]
    if extension == "jpeg":
        extension = "jpg"
    file_path = IMAGES.joinpath(f"{productCode}.{extension}")
    if file_path.exists():
        print(f"{productCode}: {str(file_path)}")
        return str(file_path)
    try:
        r = requests.get(url, timeout=15, impersonate="chrome")
    except (requests.errors.RequestsError, requests.errors.CurlError):
        print(f"{productCode}: request error (timeout)")
        return None
    if not r.ok:
        print(f"{productCode}: {r.status_code}")
        return r.status_code
    Image.open(io.BytesIO(r.content)).save(file_path, "JPEG")
    print(f"{productCode}: {str(file_path)}")
    return str(file_path)

shoes['file'] = shoes.apply(lambda row: download(row['primaryImage'], row['productCode']), axis=1)

The download function fetches the image from the provided URL and saves it locally. The function returns the file path if successful, which is then stored in a new column called file.

Writing Captions to Files

Finally, let's save the generated captions to text files:

def write_caption(caption: str, productCode: str):
    file_path = IMAGES.joinpath(f"{productCode}.txt")
    file_path.write_text(caption)
    return caption

shoes['caption'] = shoes.apply(lambda row: write_caption(row['caption'], row['productCode']), axis=1)

This function writes each caption to a text file named after the product's code.

Conclusion

Processing Parquet files can be straightforward with the right tools and approach. By using Pandas to filter and manipulate the data, combined with image processing libraries like Pillow, you can easily extract valuable information and resources from large datasets. This guide provided a basic overview, but the same principles can be applied to more complex datasets and projects. With these techniques, you can efficiently manage large-scale data and generate meaningful insights from it.

Upvote