MEDITRON builds on Llama-2 (through our adaptation of Nvidia’s Megatron-LM distributed trainer)
extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines
Synthetic data ?
Architecture
They adopt most pretraining settings and model architecture from the Llama-2 paper (Touvron et al., 2023b). For optimization, they use the AdamW optimizer with a cosine learning rate scheduler. For the model architecture, they inherit the standard transformer architecture, the use of RMSNorm, the SwiGLU activation function, and rotary positional embeddings directly from the implementation of Llama. they use group-query attention (GQA) introduced by Llama-2, and a context length of 2048 for the 7B model and 4096 for the 70B model.
Hyperparameters and Tokenization
The parameters for the AdamW optimizer are as follows: β1 = 0.9, β2 = 0.95, eps = 10−5. The cosine learning rate schedule uses 2000 steps for warmup and decays the final learning rate to 10% of the maximum learning rate. they use 1.5 × 10−4 as the learning rate for the 70B model and 3 × 10−4 for the 7B and 13B models. The weight decay is set to 0.1, and the gradient clipping is set to 1.0. they inherit the tokenizer from Llama and use the bytepair encoding algorithm (BPE) implemented with SentencePiece. The total vocabulary size is 32k tokens. Extra tokens are added to incorporate the new tokens we introduced for the pretraining data preprocessing.
Data
MEDITRON ’s domain-adaptive pre-training corpus GAP-REPLAY combines 48.1B tokens from four datasets;
Clinical Guidelines: a new dataset of 46K clinical practice guidelines from various healthcare-related sources,
Paper Abstracts: openly available abstracts from 16.1M closed-access
PubMed and PubMed Central papers, Medical Papers: full-text articles extracted from 5M publicly available PubMed and PubMed Central papers
Replay dataset: general domain data distilled to compose 1% of the entire corpus.