Getty Images Brings High-Quality, Commercially Safe Dataset to Hugging Face

Community Article Published September 6, 2024

Andrea Gagliano, Head of AI/ML at Getty Images

Hey Hugging Face community! We are Getty Images, and we’re excited to partner with Hugging Face to share something we think you’ll love – AI/ML scientists are now able to access a new sample dataset of our own wholly owned creative images and associated structured metadata that we’re making available right here on Hugging Face.

The Getty Images Sample Dataset includes 3,750 high-quality images from 15 categories, providing a wide range of visuals for various applications. If you’re into building generative AI models or enhancing ML capabilities that not only look good but are also built responsibly and safe for commercial use, this is for you.

image/png

For those who might not be familiar with Getty Images or are scratching your heads wondering why you’ve found us on Hugging Face, know that we’re passionate about visual content, and we know many of you are too. For those who need an introduction we are a leading global visual content creator and marketplace, and the first-place people turn to discover, purchase and share powerful visual content from the world’s best photographers and videographers.

What you may not know about us, is that we also think that building AI/ML capabilities is as much about the data as it is about the algorithms. You can have the best model architecture, but if your data isn’t up to par, your outputs won’t be either.

That’s why we’ve curated a sample dataset that’s packed with high-quality images and rich metadata. Our data represents the cleanest, highest quality creative photo open dataset available, offering you:

  • Consistently high-quality images, free from low-resolution issues

  • Rich structured metadata that helps your models understand context better

  • A curated selection without excessive infographics and NSFW content

  • No unwanted celebrity images, no trademark brands, products or characters, or identifiable people or locations in your training data

  • Detailed information on usage rights, ensuring peace of mind.

Building with Responsibility

What we are also passionate about is respecting the rights of creators and sustaining ongoing creation by obtaining consent from rights holders for AI training. This means that this sample dataset is commercially safe, meaning you can focus on building and innovating without worrying about accidentally infringing on someone’s rights.

But what does ‘commercially safe’ really mean? To us this means that our datasets are free from misappropriated training data. It means our dataset is clean and made up of licensed creative pre-shot visuals (not editorial). It means that the resulting outputs will not generate an image that includes trademark brands, products or characters, or identifiable people or locations.

Plus, if you go on to license a full data set from us you will be contributing to a more sustainable ecosystem. Revenue from our training data licensing goes back to the creators, supporting the artists and photographers who made these images possible. It’s a way to innovate responsibly and ensure that everyone involved in the creative process benefits.

We’re not just dropping this sample dataset and disappearing—we want to be part of the conversation on the Hub. We’re here to collaborate, share insights, and see what incredible things the Hugging Face community will create with this data. Whether you’re refining an existing model or starting from scratch, we’re excited to see how you’ll push the boundaries.