https://tiger-ai-lab.github.io/MAmmoTH2/
Idea: discover naturally existing instruction data from the web
Pipeline for WebInstruct
1. high-quality data recall from the web corpus
2. Q-A pair extraction
3. Q-A pair refinement

Recall from Common Crawl

First FastText model
- start by crawling 100k seed data from known educational websites as positive training examples
- randomly select 100k negative examples from CC
Recalled 100B tokens using the trained fasttext model from an internal CC.
- raw web documents are further grouped by their domains (root URL) and only domains with more than 1000 documents are retained
- extracted roughly 600K domains from the recalled documents
Prompt GPT-3.5 to scan through the domains and automatically select those that might contain instruction data.
- Around 50K domains are further labeled as positive samples by GPT-3.5.
New improved FastText model
- sample documents from the selected domains as positive examples,
- documents from the non-selected domains, and general Common Crawl as negative examples
recalled 40B tokens using the newly trained fastText model
prompt GPT-4 to sift through the recalled domains again, ultimately leading to 18M raw document

Q-A pair extraction

carefully pre-process the HTML to pre-extract useful content from the recalled documents
- mostly rule-based filtering to clean site information, ads, HTML boilerplate, etc
- This step significantly reduces the document length for the next stage
few-shot prompt Qwen-72 to identify the question and answer pairs
- allow the model to return void if no natural question-answer pairs exist

prompt Mixtral-22B×8 ( and Qwen-72B (Bai et al., 2023) to reformat the extracted Q-A pairs.
- If the answer does not contain any explanation, these two LLMs will attempt to complete the intermediate reasoning steps leading to the given answer. We adopt two models to increase the diversity of our dataset