Great work and here's my personal thoughts
This is an excellent work, thank you for sharing the results of what you have done.
However, after reviewing the dataset, I had a thought: Could the original model achieve similar results?
In the paper of the LIMA model published by Meta AI before, it was concluded that most of the model's knowledge is learned during the pre-training phase, and subsequent instruction learning and reinforcement learning are used "to better align to end tasks and user preferences". So, I think your work is essentially adjusting this alignment, rather than adding new knowledge.
Based on this idea, I believe it might be possible to achieve similar effects simply by adjusting the prompts. I immediately thought of the prompt I saw before for launching chagpt in "Developer Mode" . Although OpenAI has imposed many restrictions on this, we can still remove some restrictions through similar prompts. Here's an example I am sharing, with a question adopted from one of the questions in your dataset:
https://chat.openai.com/share/8ec1dde0-2101-47bd-8c1a-f10a538be9c9
My personal feeling is that the results obtained are somewhat similar to the effects that the datasets trying to demonstrate, which seems to suggest that the effect of fine-tuning is to replace similar prompts.
Based on this observation, I thought that perhaps we could use such prompts to obtain a large number of normal outputs and human mode outputs from chatgpt, and then build a dataset for comparison between the two kinds of responses, and finally use it for instruction learning. Anyway, as someone who has just entered this field recently, I'm not sure whether anyone has done this before, and I don't have a dataset to carry out the experiment myself, so I wonder whether you think this idea makes sense and is feasible?
its different - if you apply an alignment and then later convince it to ignore it. It loses something in the process.
It's like if you cut off someones hand and then replace it with a robot hand.
Also - my goal isn't to make it "pretend to have opinions"
My goal is to reveal its "actual" opinions.