Summary

  • They identify the key ingredients for building a top-tier code LLM:

    1. code optimized heuristic rules for data cleaning and methods for data deduplication
    2. recall of text corpus related to code
    3. high-quality synthetic data in both annealing and supervised fine-tuning stages
  • They release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline

Takeaways

  1. very important to remove non-informative data such as pure hexadecimal code and excessively short code snippets
  2. file-level deduplication proving to be more effective than repository-level deduplication by maintaining data diversity and enhancing model performance on downstream task
  3. filtering data based on Github star count can possibly reduce data diversity and affect the overall data distribution
  4. In the annealing phase, the use of high-quality data is crucial for further enhancing the model’s capabilities, indicating that data quality is more important than quantity in the later stages of model training.
  5. A two-stage instruction tuning strategy is shown to be effective, allowing the model to acquire broad capabilities initially and then refine them with code-specific tasks

Detailed

Pretraining data

  • They produce RefineCode,
    • 960B tokens
    • 607 programming languages
    • composed of two main parts
      • raw code (Github up to November 2023 + non-Github data from The Stack v2)
      • code-related web data (75B tokens)

Raw Code data processing

  • Preprocessing

    • exclude files exceeding 8 MB in size (non-text files, data, …)
    • They use the linguist file extension list to keep only file types related to programming languages.
      • They preserve 607 different types of files (code+data+text) (listed in Appendix E)
  • Deduplication

    • Due to to the extremely high repetition of the source code in Github, this is done early in the pipeline
    • aggressive file-level deduplication strategy
      • exact deduplication
        • Due to the prevalence of forking and copy-pasting within the codebase, nearly 75% of files are completely duplicated
        • compute the SHA256 hash value for each document, where files with identical hash values are compared, and only the code files with the highest star count as well as the latest commit time are retained.
      • fuzzy deduplication
        • split the raw text into 5-gram pieces, and then calculate the 2048 MinHash function
        • utilize LSH by setting bands to 16 and rows to 128, to retain only those distinct files with the highest stars and latest commit time ( more bands → more chances to find similar items , more rows → need higher similarity of the MinHash sequence to be considered a match)
  • Transformation

    • certain issues, though small in text size, are pervasive across numerous files. This can lead to many high quality files being filtered out in the next steps.
    • Copyright removal
      • There are over 15% code files including the copyright notices at the beginning of the content like “Copyright Intel Corporation (C) 2014-2016”
      • highly repetitive and irrelevant to the coding task
    • PII Reduction
      • complex regular expressions to detect such information and replace them with placeholders such as “name” and “password”.
  • Filtering

    • extends and refines the existing rules from StarCoder to better align with the unique properties of code dataset
    • They consider the following guidelines when designing their filters:
      1. Filter out files with poor self-containment
      2. Filter out files with poor or minimal logical structure
      3. Remove files that deviate significantly from standard formatting.
    • The actual rules
      • Natural Language Filtering Rules: These rules filter data based on common properties for all text files, such as file size, number of lines, and other general metrics. Both text and code files share these filtering rules.
      • General Code Filtering Rules: These rules apply to all code files by filtering data based on general code characteristics such as the number of variables, average function length, and other common features
      • Language-Specific Filtering Rules: These rules are designed according to the unique characteristics of specific programming languages, such as the frequency of “pass” statements in Python or the use of “goto” statements in C. We have developed these rules for the following eight commonly used programming languages: Python, C, C++, C#, Java, JavaScript, Go, and HTML
  • Data sampling

    • downsample Java data from 409GB to 200GB, due to its excessive volume compared to other common languages.
    • downsample HTML data from 213GB to 64GB, as HTML files often contain a significant amount of non-informative structured content and lack substantial coding logic.
  • Collect high-quality code-related data corpus Common Crawl

  • FastText Model

    • Due to the lack of open-source fine-gained code corpus, they first annotate 500,000 high-quality code-like data from CommonCrawl using the Autonomous Data Selection with Language Models method as seed data for training a FastText model. The method gives a scalar score to each sample
    • To maintain a controllable vocabulary size in fastText and enable tokenization of Chinese texts using spaces, they first apply the BPE tokenizer to segment the corpus
  • Recall from text corpus

    • retain web pages with scores above a certain thresold
    • applied on Common Crawl, FineWeb, Skypile, and web part of AutoMathText
  • Code-related Domain Discovery

    • Define a domain as web pages with the same base URL(e.g. stackoverflow.com), where domains with over 10% of web pages are classified as code-related
  • Manual url annotation

    • manually annotate the URLs associated with code content within these identified domains. For instance, they identified all content under “stackoverflow.com/questions” as computer technology questions. Then, they include samples with URLs matching “stackoverflow.com/questions”, which are not correctly classified by fastText, into the code seed corpus
    • repeated 3 times

Annealing Data

  • bridge between the general pretraining stage and the supervised fine-tuning (SFT) stage
  • data mixture
    • need to ensure distribution shift is not too large (84% is from refine code)

Algorithmic Corpus

  • Algorithmic code files exhibit strong code logic and minimal dependency on external files, demonstrating excellent self-containment.
  • more aligned with the distribution of smaller, independent tasks commonly encountered in real-world interactive scenarios.
  • sample a certain proportion of the original pretraining data that contains keywords such as “leetcode,”, “def solution,” or “class solution” to create this corpus.

Synthetic Data

  • Algorithmic Corpus as the seed
  • They do two forms of enhancement
    • High Quality Code Snippet
      • employ a strong LLM to synthesize a batch of self-contained independent functions along with their corresponding test cases
      • retain the data that successfully passed the test cases and included them in the annealing stage dataset
    • Code Textbooks
      • constructed educational text snippets based on the hqcode dataset using Qwen2-72B-Instruct
      • Hqcode is a multilingual code dataset synthesized with GPT- 4o-Mini, where each entry describes an independent task and provides a corresponding function as a solution
      • ask LLM to perform interactive analysis on the code within the dataset
        • extract and elaborate on abstract code knowledge

Post-Training

Open-source Training Data

  • Collect open-source instructions corporas

  • Response quality is assessed, and and low-quality response are regenerated using a robust LLM

Educational Instruction Synthesis (Python only)

  • Following Magicoder - OSS Instruct recipe

  • they observe that the educational value of the synthesized data largely depends on the quality of the seed data

    • ⇒ use a scorer model to identify high-quality seed data
  • synthesize new instruction from seed data

  • generate test cases for the problem

    • append to the code snippets, and executed using a Python interpreter
    • only retain data samples that successfully pass the tests
  • Due to a significant amount of outdated package usage in the pre-training data, LLM may sometimes employ methods from older versions of libraries when generating code
  • they synthesize a tool usage instruction tuning dataset using up-to-date external library documentation
    • analyzed commonly used external Python libraries and retrieved API signatures and usage examples for widely used syntax and tools via PyDoc
    • this information was sent to prompt a teacher model that generated accurate and up-to-date question-answer pairs reflecting current usage

Large-scale Diverse Instruction Synthesis

  1. An LLM is used first to clean the irrelevant context (e.g. advertisements on the web) in the websites and select useful sentences as the seed for further question generation.
  2. A task specification module defines programming languages, difficulty levels, and coding task types, utilizing a configuration file for easy customization. The prompt engineering component employs a template-based system to generate diverse, contextually rich prompts, incorporating real-world scenarios and best practices in software development. We set temperature T = 1.0 for diverse questions.
  3. An advanced LLM with more parameters first generates the created questions and then generates the corresponding answers. The validation module combines automated code execution and unit testing to check the correctness.
  4. Then an LLM is adopted to refine the response by adding code comments and more explanation.

Two-Stage Instruction Tuning

  • first stage, theory
  • second stage, practice
  • Ablations to prove it works