Improving Machine Learning from Human Feedback Erin Mikail Staples + Nikolai Lubimov PyData DE 2023
Slide 2
Erin Mikail Staples (she/her)
Sr. Developer Community Advocate Empowers the open source community through education, collaboration, and content creation.
Nikolai Liubimov (he/him) CTO
Helps customers debug and adopt label studio usage best practices
Slide 3
Large Foundational Models have hit the cultural zeitgeist
Slide 4
Slide 5
We will not be creating Terminator here.
Slide 6
These large generative models are better with a human signal.
Slide 7
Why does this matter?
Slide 8
Bigger ≠ Better
Slide 9
Internet-trained models bring with them internet-scaled biases.
Slide 10
biases social problems
poor data quality
limited applications
Slide 11
Slide 12
Power of Reinforcement Learning
Slide 13
Slide 14
Reinforcement Learning with Human Feedback helps to adjust for problems that tend to come with large-scale foundational models.
Slide 15
Reinforcement Learning Goal-oriented model that seeks to identify similar actions or sequence of actions that would maximize future rewards. Able to select the best output among a series of outputs.
Slide 16
Unsupervised Learning and Prompt Engineering focuses on adapting to an existing model’s limitations.
Slide 17
Known limitations include: - Harmful Speech - Overgeneralized Data - Out-of-Date Data
Contain racial, gender, and religious biases - Require large computational resources
Slide 18
Reinforcement Learning focuses on optimizing for the end goal by adapting the model itself to new and possibly uncertain information based on a human signal.
Slide 19
With RLHF one can align model output with one’s specific needs while reducing bias at a fraction of the original training cost.
Slide 20
BLOOM - ChatAlpaca - OpenLlama - CasperAI/TRLX - PyTorch - InstructGOOSE - Label Studio - Hugging Face
Slide 21
We’re already seeing RLHF used in the wild
Slide 22
So how did they do it?
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
The Importance of the Reward (Preference) Models
Slide 28
Preventing Unwanted Model Drift
Slide 29
Final Stages of Model Development
Slide 30
Ready for Production
Slide 31
We know what this looks like theoretically…
Slide 32
… now let’s demonstrate this in real time.
Slide 33
See it in action! https://github.com/heartexlabs/RLHF
Slide 34
Problems with RLHF
Slide 35
Humans ruin everything.
Slide 36
RLHF relies on social engineering and data integrity as much as it does technical skill.
Slide 37
Keeping annotators well-informed and motivated
Slide 38
Try out RLHF for yourself. ➡ @erinmikail @liubimovnik @labelstudioHQ community@labelstud.io https://labelstud.io/pydata-berlin