Tokenization

  • Has special tokens

  • <|fim_prefix|>, <|fim_middle|>, and <|fim_suffix|> tokens are used to implement the Fill-in-the-Middle (FIM) technique, where a model predicts the missing parts of a code block. <|fim_pad|> is used for padding during FIM operations.

  • <|repo_name|>, which identifies repository names.

  • <|file_sep|>, used as a file separator to better manage repository-level information.

  • These tokens are essential in helping the model learn from diverse code structures and enable it to handle longer and more complex contexts during both file-level and repo-level pretraining.

Data

Pretraining data

Composition

  • Dataset comprises five key data types: Source Code Data, Text-Code Grounding Data, Synthetic Data, Math Data, and Text Data

  • Source Code: public repositories from GitHub, spanning 92 programming languages

  • Text-Code Grounding Data: code-related documentation, tutorials, blog

  • Synthetic Data: CodeQwen1.5 used to generate large-scale synthetic datasets. Only executable code was retained.

  • Math Data: pre-training corpus from Qwen2.5-Math. mathematical capabilities are useful for good code (algorithms, structured reasoning, …)

  • Text Data: pre-training corpus of the Qwen2.5 model

Mixture

Training policy

  • file-level pretraining: 5.2T tokens, individual code files, seq len 8k, next token & FIM

    • <|fim_prefix|>{code_pre}<|fim_suffix|>{code_suf}<|fim_middle|>{code_mid}<|endoftext|>
  • repo-level pretraining: 300B tokens, entire repos, seq len 32k (rope base freq adjusted to 1M + yarn extrapolation),

    • next token + repo-level FIM
<|repo_name|>{repo_name}
<|file_sep|>{file_path1}
{file_content1}
<|file_sep|>{file_path2}
{file_content2}
<|file_sep|>{file_path3}
<|fim_prefix|>{code_pre}<|fim_suffix|>{code_suf}<|fim_middle|>{code_fim}<|endoftext|>

Post-training recipe

Multilingual Programming Code Identification

  • finetune CodeBERT to perform language identification to categorize documents in ~100 programming languages.
    • randomly discard rare languages samples (prevent overfitting?)

Instruction Synthesis from GitHub

  • Use the LLM to generate the instruction from the code snippets within 1024 tokens (e.g. )
  • Use the code LLM to generate an universal code response (i.e. pseudocode)
  • Use LLM scorer to filter low-quality ones
  • We then have a (instruction, universal code) pair
  • Then for any language, you can generate (instruction, universal code, solution) triplet
  • Another thing they do for diversity, for multiple languages,
    • take code snippet β†’ generate pseudocode β†’ create instruction
    • filter
    • then have a (instruction, universal code, solution) triplet

Evolving and creating new instructions

  • Main takeaways:

    • They use a multilingual multi-agent collaborative framework to synthesize the multilingual instruction corpora. The agents are language-specific.

    • These agents are initialized with language-specific instruction data derived from the limited existing multilingual instruction corpora. (By initialized, it means that this is the starting seed data from which the agent will evolve)

    • Each agent is responsible for generating instructions for its specific language, it keeps track of its generated instructions using

      • Adaptive Memory System: Each agent has a memory system that tracks what it has already generated. This prevents repetition, allowing each agent to keep a fresh perspective and avoid creating the same samples repeatedly.
    • Adaptive Instruction Generation: The framework also includes a mechanism to dynamically generate new instructions based on identified knowledge gaps across languages.

      • This can be done by consistently keeping track of a fixed set of concepts/tags and identifying gaps in the overall observed distribution
    • The collaborative part:

      • Multiple language-specific agents engage in a structured dialogue to formulate new instructions and solutions.
      • Agents share insights and patterns across language boundaries, fostering a more comprehensive understanding of programming concepts
  • What to keep track of in your instruction dataset

    • general programming concepts (patterns, algorithms, …)

    • language-specific concepts (python decorators, scala pattern matching, …)

    • Keeping track of such tags allows to have a structured view over your corpus

How to score code instruction data (checklist-based)

  1. Question&Answer Consistency: Whether Q&A are consistent and correct for fine-tuning.
  2. Question&Answer Relevance: Whether Q&A are related to the computer field
  3. Question&Answer Difficulty: Whether Q&A are sufficiently challenging.
  4. Code Exist: Whether the code is provied in question or answer.
  5. Code Correctness: Evaluate whether the provided code is free from syntax errors and logical flaws.
  6. Consider factors like proper variable naming, code indentation, and adherence to best practices.
  7. Code Clarity: Assess how clear and understandable the code is. Evaluate if it uses meaningful variable names, proper comments, and follows a consistent coding style.
  8. Code Comments: Evaluate the presence of comments and their usefulness in explaining the code’s functionality.
  9. Easy to Learn: determine its educational value for a student whose goal is to learn basic coding concepts
  • After gaining all scores , we can get the final score with , where are a series of pre-defined weights

Mutlilingual sandbox for code verification

  • Only the self-contained (e.g. algorithm problems) code snippet will be fed into the multilingual sandbox.

  • What it does:

    • Abstract Syntax Tree parsing & static analysis module*
      • can filter out code with parsing errors (unclosed brackets, missing keywords, …)
    • Unit test generator
      • Analyzes sample code to identify key functionalities and edge cases
      • Automatically generates unit tests based on the expected behavior
    • Code Execution Engine
      • isolated environment for executing code snippets securely (handles resource allocation and timeout mechanisms)
      • parallel execution of test cases
    • Result Analyzer
      • Compares the output of code snippets against expected results from unit test
      • Generate detailed reports on test case successes and failures
      • Provides suggestions for improvements based on failed test cases

Final mixture

  • Coarse-to-fine Fine-tuning
    • synthesized tens of millions of low-quality but diverse instruction samples to fine-tune the base model.
    • In the second stage, millions of high-quality instruction samples to improve the performance of the instruction model with rejection sampling and supervised fine-tuning.
      • For the same query, they generate multiple candidates and then score the best one for supervised fine-tuning
  • Mixed tuning
    • They use combination of FIM (o keep the long context capability of the base mode)+ normal SFT
    • For FIM, they use the Abstract Syntax Tree to parse the code snippets and extract the basic logic blocks as the middle code to infill (when using FIM).