Session Outline
This session at the NDSML Summit 2023 describes how Gavagai has improved sentiment analysis for a specific domain by using generative models (GPT-3.5/4) to generate synthetic examples that were then used for fine-tuning an existing transformer-based model. In initial experiments, the method they devised allowed them to scale up 450 domain-specific, severely skewed texts to a corpus of 500.000+ balanced and labeled texts, allowing them to circumvent privacy issues in the original data, and improve predictive power in the final model. The final fine-tuned model showed an improved F1 score of between 8 and 10 points when evaluated on a held-out, non-synthetic dataset. The talk also addresses the two major challenges when generating labeled synthetic training data: label noise, and making sure that the data generated is “similar enough” to the original data to be of use as training data.
Key Takeaways
- Generative LLMs can be used to generate synthetic, labeled training data for NLP tasks.
- Two challenges of the synthetic data: Can we trust the labels that the LLM outputs? Can we control the distribution of the generated data to ensure that it is similar enough to the original data?
Add comment