Engineering
Learnings
Research
-
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
- shows limitations of DPO, also key factors to use PPO correctly
-
PRIME (Process Reinforcement through Implicit Rewards)
- train outcome reward model and use it as a process reward model
-
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
- LaTent Reasoning Optimization (LaTRO)
- from prompt , sample rationale , compute likelihood of correct response , and treat it as the self-reward ⇒ use RLOO
-
Enhancing multi-step reasoning abilities of language models through direct q-function optimization.
-
OREO: an offline RL method to improve LLM multi-step reasoning
- no preference data, jointly learns a policy and value model by optimizing the soft Bellman Equation.
-
Learning by Distilling Context
- Concretely, given a synthetic unlabeled input for the target task, we condition the model on
[instructions] + [task-input]
to predict[scratch-pad] + [final answer]
; then we fine-tune the same model to predict its own[final answer]
conditioned on the[task-input]
, without seeing the[instructions]
or using the[scratch-pad]
.
- Concretely, given a synthetic unlabeled input for the target task, we condition the model on
-
Data release for verification rationales for GSM- from GenRM paper