Luckily the community seems to be converging on a simple and elegant chat dataset format: a list with each record being an array with each conversation turn being an object with a role (system, assistant or user) and content. Hugging Face uses this input format in the [Templates for Chat Models](https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates) docs:
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
Popular datasets like HuggingFaceH4/no_robots follow this format.
To encourage usage of this format, I propose we give it a name: Hugging Face MessagesList format.
The format is defined as:
- Having at least one
messages
column of type list.- Each messages record is an array containing one or more message turn objects.
- A message turn must have
role
and content
keys.-
role
should be one of system
, assistant
or user
.-
content
is the text content of the message.This may be a small thing, but having a common dataset format will reduce wasted time data wrangling and help everyone.