🤖 Harold's Notes

Search

❯

❯

❯

❯

Data

Jul 03, 20242 min read

Large Scale (multi-task)

FLAN-2

1836 tasks, 15M examples (NLP tasks, code, reasoning,… ) extremely varied
https://github.com/google-research/FLAN/tree/main/flan/v2
Muffin = Multi-task finetuning with instructions.
T0-SF = tasks from SF that do not overlap with Muffin (SF stands for “sans Flan”)
Niv2 = Natural-Instructions_v2 from “Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks”
CoT ( prior work for which human raters manually wrote CoT annotations for a training corpus.)
- https://huggingface.co/datasets/lucasmccabe-lmi/FLAN_CoT_alpaca_style

UltraChat

Tuned towards chat
1.5 million high-quality multi-turn dialogues and covers a wide range of topics and instructions.
curate three sectors: Questions about the World, Creation and Generation, and Assistance on Existing Material
To construct informative and realistic multi-turn conversations, two separate ChatGPT Turbo APIs are adopted in the conversatin generation, where one plays the role of the user to generate queries, and the other generates the response

Camel

Camel: Communicative agents for “mind” exploration of large scale language model society.

SODA (old)

“SODA: Million-scale Dialogue Distillation with Social Commonsense Contextulization”
million-scale high-quality social dialogue dataset (synthetic)

Smaller scale (alignment)

Dump

Synthetic

Alpaca constructed using Self-Instruct and Text-Davinci-003. Self-Instruct uses a small seed set of tasks to construct new instruction tuning tasks and filter out bad ones
ShareGPT/Vicuna is a dataset of 70K voluntarily-shared ChatGPT conversations
Evol-Instruct/WizardLM contains 70k single-turn instructions that are considered more complex than Alpaca. This dataset was derived from the Alpaca dataset by using ChatGPT to evolve the initial instructions

Real-data curated

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
LIMA
- https://huggingface.co/datasets/GAIR/lima
Open-Platypus is a curated dataset amalgamated from 11 open-source datasets, curated specifically towards improving LLM performance in STEM and logical domains. This set contains 25k questions where ≈ 10% are LLM-generated and the remainder human written.

Graph View

Large Scale (multi-task)
FLAN-2
UltraChat
Camel
SODA (old)
Smaller scale (alignment)
Dump
Synthetic
Real-data curated

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2024