Large Scale (multi-task)
FLAN-2
- 1836 tasks, 15M examples (NLP tasks, code, reasoning,… ) extremely varied
- https://github.com/google-research/FLAN/tree/main/flan/v2
- Muffin = Multi-task finetuning with instructions.
- T0-SF = tasks from SF that do not overlap with Muffin (SF stands for “sans Flan”)
- Niv2 = Natural-Instructions_v2 from “Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks”
- CoT ( prior work for which human raters manually wrote CoT annotations for a training corpus.)
UltraChat
- Tuned towards chat
- 1.5 million high-quality multi-turn dialogues and covers a wide range of topics and instructions.
- curate three sectors: Questions about the World, Creation and Generation, and Assistance on Existing Material
- To construct informative and realistic multi-turn conversations, two separate ChatGPT Turbo APIs are adopted in the conversatin generation, where one plays the role of the user to generate queries, and the other generates the response
Camel
- Camel: Communicative agents for “mind” exploration of large scale language model society.
SODA (old)
- “SODA: Million-scale Dialogue Distillation with Social Commonsense Contextulization”
- million-scale high-quality social dialogue dataset (synthetic)
Smaller scale (alignment)
Dump
Synthetic
- Alpaca constructed using Self-Instruct and Text-Davinci-003. Self-Instruct uses a small seed set of tasks to construct new instruction tuning tasks and filter out bad ones
- ShareGPT/Vicuna is a dataset of 70K voluntarily-shared ChatGPT conversations
- Evol-Instruct/WizardLM contains 70k single-turn instructions that are considered more complex than Alpaca. This dataset was derived from the Alpaca dataset by using ChatGPT to evolve the initial instructions
Real-data curated
-
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
-
LIMA
-
Open-Platypus is a curated dataset amalgamated from 11 open-source datasets, curated specifically towards improving LLM performance in STEM and logical domains. This set contains 25k questions where ≈ 10% are LLM-generated and the remainder human written.