Summary

  • Idea: crafting diverse synthetic instruction data for code using abundant open-source reference

Detailed

Instruction sample generation

  • For each code document from the corpus, they randomly extract 1-15 consecutive lines as the seed snippet.
  • The seed snippet is fed into the following prompt template
You are exceptionally skilled at crafting high-quality programming problems and offering precise solutions.

Please gain inspiration from the following random code snippet to create a
high-quality programming problem. Present your output in two distinct sections:
[Problem Description] and [Solution].
Code snippet for inspiration:
"```"
{code}
"```"
Guidelines for each section:
1. [Problem Description]: This should be **completely self-contained**, providing all the contextual information one needs to understand and solve the problem. Assume common programming knowledge, but ensure that any specific context, variables, or code snippets pertinent to this problem are explicitly included.
2. [Solution]: Offer a comprehensive, **correct** solution that accurately
addresses the [Problem Description] you provided.
  • They use greedy decoding

Keeping track of the content of the generated data