【Whispering AI】Finetune LLMs with Direct Preference Optimization

Whispering AI ：Finetune LLMs with Direct Preference Optimization

In this tutorial will discuss a powerful alignment technique called Direct Preference Optimisation (DPO) which was used to finetune Mistral 7b and is rapidly becoming the de facto method to boost the performance of open chat models.
DPO is replacement of the Reinforcement learning with human feedback(RLHF), where we believe mistral model or any general LLM can be trained without Reward model.

⏱️ Timestamps
0:00 Intro
2:18 Dataset for finetuning using dpo
3:49 Setup Project In Colab
4:33 Understanding the code base
7:49 Training.

Paper: https://arxiv.org/abs/2305.18290
Dataset: https://huggingface.co/datasets/argilla/distilabel-math-preference-dpo

#MistralFineTuning ##alpaca #Mistral7B #MistralAI #Mistral #LargeLanguageModels #LLM #AI #LargeLanguageModel #LLMTrainingCustomDataset #LLMFinetuning #OpenSourceLLM #FinetuneLlama #FineTuneLlama2 #FinetuneModel #TrainLLMsWithOwnData #TrainYourLLM #FineTuneLlama #FinetuneLLMs #TrainingLLMModels #FinetuningMistral7B #FineTune #FineTuning #MistralFineTuning #FineTuneMistral7B #Mistral7B #LLM #Largelanguagemodels #Llama2 #opensource #NLP #ArtificialIntelligence #datascience #langchain #llamaindex #vectorstore #textprocessing #deeplearning #deeplearningai #100daysofmlcode #neuralnetworks #datascience #generativeai #generativemodels #OpenAI #GPT #GPT3 #GPT4 #chatgpt4