ML//RL

2018-06-15

- Agent learns by trial and error, maximizing cumulative reward.

Agent learns by trial and error, maximizing cumulative reward.

No labeled data: just actions, states, and outcomes.

Q-learning: learn the value of each action in each state. Policy gradient: directly optimize the action probabilities.

Connected to language models through RLHF: RL trains the reward model, reward model trains the LLM.