Possible issue with the data
I believe there is a problem with the data labels. Some records with the same title
and synopsis
can have various labels in the dataset. It would be fine if such records were either in train or test set, but it isn't. For example "Sironia" is only in the train set with two various labels. "Iron Man" has adventure
genre in the train set. Does it mean that it if expected to have anything but adventure
genre in the test set? There are 8540
(out of 70291
) movies with the same title
and synopsis
that are both in the train and test set.
Ok. Let's take the example of "Iron Man". The record has "adventure" genre in the train set. Is it possible for it to have the same genre in the test set? My concern is that it's impossible and we should show the most probable genre, but not "adventure".
A follow-up question:
Consider the toy dataset below with the actual and predicted labels. We know there are duplicated entries in both train and test. However, if we focus on the test set and in the below scenario predictions are correct but in different order. Submission script contains the id
and predicted genre
, in this case even though model could identify correct labels the accuracy will be low because of the mapping of id to genre being in different order.
What can one do in such scenario?
id | title | genre (actual) | prediction |
---|---|---|---|
1 | ABC | family | drama |
2 | ABC | drama | family |