ML//RL

- Agent learns by trial and error, maximizing cumulative reward.


Agent learns by trial and error, maximizing cumulative reward.

No labeled data — just actions, states, and outcomes.

Q-learning: learn the value of each action in each state. Policy gradient: directly optimize the action probabilities.

Connected to language models through RLHF — RL trains the reward model, reward model trains the LLM.