TULU

Camels in a Changing Climate: Enhancing LM Adaptation with TÜLU 2
326,154 samples (https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)
Data mixture
FLAN [Chung et al., 2022]: We use 50,000 examples sampled from FLAN v2.
CoT: To emphasize chain-of-thought (CoT) reasoning, we sample another 50,000 examples from the CoT subset of the FLAN v2 mixture.
Open Assistant 1 [Köpf et al., 2023]: We isolate the highest-scoring paths in each conversation tree and use these samples, resulting in 7,708 examples. Scores are taken from the quality labels provided by the original annotators of Open Assistant 1.
ShareGPT: We use all 114,046 examples from our processed ShareGPT dataset, as we found including the ShareGPT dataset resulted in strong performance in prior work.
GPT4-Alpaca [Peng et al., 2023]: We sample 20,000 samples from GPT-4 Alpaca to further include distilled GPT-4 data.
Code-Alpaca [Chaudhary, 2023]: We use all 20,022 examples from Code Alpaca, following our prior V1 mixture, in order to improve model coding abilities.
LIMA [Zhou et al., 2023]: We use 1,030 examples from LIMA as a source of carefully curated data.
WizardLM Evol-Instruct V2 [Xu et al., 2023]: We sample 30,000 examples from WizardLM, which contains distilled data of increasing diversity and complexity.
Open-Orca [Lian et al., 2023]: We sample 30,000 examples generated by GPT-4 from OpenOrca, a reproduction of Orca [Mukherjee et al., 2023], which augments FLAN data with additional model-generated explanations.
Science literature: We include 7,544 examples from a mixture of scientific document understanding tasks— including question answering, fact-checking, summarization, and information extraction. A breakdown of tasks is given in Appendix C

🤖 Harold's Notes

Explorer

TULU

Graph View

Backlinks