ML//RL
- Agent learns by trial and error, maximizing cumulative reward.
Agent learns by trial and error, maximizing cumulative reward.
No labeled data — just actions, states, and outcomes.
Q-learning: learn the value of each action in each state. Policy gradient: directly optimize the action probabilities.
Connected to language models through RLHF — RL trains the reward model, reward model trains the LLM.