Spaces:

competitions
/

movie-genre-prediction

Paused

Possible issue with the data

by pgabrys - opened Jun 13, 2023

Jun 13, 2023

I believe there is a problem with the data labels. Some records with the same title and synopsis can have various labels in the dataset. It would be fine if such records were either in train or test set, but it isn't. For example "Sironia" is only in the train set with two various labels. "Iron Man" has adventure genre in the train set. Does it mean that it if expected to have anything but adventure genre in the test set? There are 8540 (out of 70291) movies with the same title and synopsis that are both in the train and test set.

abhishek

Competitions org Jun 13, 2023

@pgabrys Thank you for pointing this out.
A movie can be associated with multiple genres. We'll leave handling training data upto the user. While making predictions, user should choose the most probable genre for test samples.

pgabrys

Jun 13, 2023

Ok. Let's take the example of "Iron Man". The record has "adventure" genre in the train set. Is it possible for it to have the same genre in the test set? My concern is that it's impossible and we should show the most probable genre, but not "adventure".

sagar-thacker

Jul 10, 2023

A follow-up question:

Consider the toy dataset below with the actual and predicted labels. We know there are duplicated entries in both train and test. However, if we focus on the test set and in the below scenario predictions are correct but in different order. Submission script contains the id and predicted genre, in this case even though model could identify correct labels the accuracy will be low because of the mapping of id to genre being in different order.

What can one do in such scenario?

id	title	genre (actual)	prediction
1	ABC	family	drama
2	ABC	drama	family

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment