Large Scale (multi-task)

FLAN-2

UltraChat

  • Tuned towards chat
  • 1.5 million high-quality multi-turn dialogues and covers a wide range of topics and instructions.
  • curate three sectors: Questions about the World, Creation and Generation, and Assistance on Existing Material
  • To construct informative and realistic multi-turn conversations, two separate ChatGPT Turbo APIs are adopted in the conversatin generation, where one plays the role of the user to generate queries, and the other generates the response

Camel

  • Camel: Communicative agents for “mind” exploration of large scale language model society.

SODA (old)

  • “SODA: Million-scale Dialogue Distillation with Social Commonsense Contextulization”
  • million-scale high-quality social dialogue dataset (synthetic)

Smaller scale (alignment)

Dump

Synthetic

  • Alpaca constructed using Self-Instruct and Text-Davinci-003. Self-Instruct uses a small seed set of tasks to construct new instruction tuning tasks and filter out bad ones
  • ShareGPT/Vicuna is a dataset of 70K voluntarily-shared ChatGPT conversations
  • Evol-Instruct/WizardLM contains 70k single-turn instructions that are considered more complex than Alpaca. This dataset was derived from the Alpaca dataset by using ChatGPT to evolve the initial instructions

Real-data curated

  • LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

  • LIMA

  • Open-Platypus is a curated dataset amalgamated from 11 open-source datasets, curated specifically towards improving LLM performance in STEM and logical domains. This set contains 25k questions where ≈ 10% are LLM-generated and the remainder human written.