Text Classification with LLMs
Can llama help to classify news headlines. I have around 4–5 labels and 10K around dataset for each label. Any recommendation on how to achieve this. Already using Setfit and mini llm models to classify news but the results are not that promising.
- Try zero-shot learning learning (prompt only)
- If it doesn't work, try few-shot learning (prompt only)
- If it doesn't work, try generate embeddings with LLM + logistic regression model with 10K training dataset
- If none of the above approaches work, consider fine-tuning the LLM with your labeled data (including new tokens). However, if you reach this step, you may need to reconsider your approach and exam your dataset.
@tanliboy Because my knowledge of "try generate embeddings with LLM + logistic regression model with 10K training dataset" is lacking, I didn't understand the exact code implementation method. This is the first time I learned about this method.
I thought LLM embeddings were only used for RAG, but I just found out that they can be used for text classification!
Is there an example code or explanatory material URL for the "try generate embeddings with LLM + logistic regression model with 10K training dataset" method?
Thank you.
I was also looking for whether llm could be used to improve text classification performance, and I found this article!
When llm is used for text classification, it will naturally cost more than the existing AI model, but I expect the classification accuracy to be higher with llm,
and additionally, I think that the classification accuracy will be maintained for a long time even if labeling is reduced,
so I'm going to try text classification using llm.
I'm not sure if llm will work as I expected ;; I haven't found any solid data [articles, papers] that proves that llm is effective for text classification.
Hmm, it will depends on your concrete use cases.
As a simple start, probably you can try with the GPT embedding APIs (https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) and train a simple logistic regression model on top of it.
If it works, you can replace the embedding part with this Llama models later on, which requires some changes in the model to output embeddings instead of tokens.
I am doing a email classification with 22 classes. But my clean data is around 3000, around 150 emails per class.
Do you think Llama 3 embeddings + Logistic/SVM classifier will be a good idea ? Or an out and out LLM classification will be a better idea ?
Probably start with the embedding (grab a model from the gte leaderboard) and then try with AutoModelForSequenceClassification using Llama that may need more training data than 3K.
Hi
@dss107
@Sudipta1995
See my recent post https://www.linkedin.com/feed/update/urn:li:activity:7243302933855916033/
Please try the new embedding models. Im skeptical you can get better results with LLama.