Engineering

Learnings

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
- shows limitations of DPO, also key factors to use PPO correctly
https://www.linkedin.com/pulse/why-rlhf-other-rl-like-methods-dont-bring-true-rl-llmsand-atlas-wang-s1efc
PRIME (Process Reinforcement through Implicit Rewards)
- train outcome reward model and use it as a process reward model
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
- LaTent Reasoning Optimization (LaTRO)
- from prompt $x$ , sample rationale $z$ , compute likelihood of correct response $y$ , $π_{θ} (y ∣ x, z)$ and treat it as the self-reward ⇒ use RLOO
Enhancing multi-step reasoning abilities of language models through direct q-function optimization.
OREO: an offline RL method to improve LLM multi-step reasoning
- no preference data, jointly learns a policy and value model by optimizing the soft Bellman Equation.
Learning by Distilling Context
- Concretely, given a synthetic unlabeled input for the target task, we condition the model on [instructions] + [task-input] to predict [scratch-pad] + [final answer]; then we fine-tune the same model to predict its own [final answer] conditioned on the [task-input], without seeing the [instructions] or using the [scratch-pad].
Data release for verification rationales for GSM- from GenRM paper
- https://github.com/genrm-star/genrm-critiques