For each code document from the corpus, they randomly extract 1-15 consecutive lines as the seed snippet.
The seed snippet is fed into the following prompt template
You are exceptionally skilled at crafting high-quality programming problems and offering precise solutions.
Please gain inspiration from the following random code snippet to create a
high-quality programming problem. Present your output in two distinct sections:
[Problem Description] and [Solution].
Code snippet for inspiration:
"```"
{code}
"```"
Guidelines for each section:
1. [Problem Description]: This should be **completely self-contained**, providing all the contextual information one needs to understand and solve the problem. Assume common programming knowledge, but ensure that any specific context, variables, or code snippets pertinent to this problem are explicitly included.
2. [Solution]: Offer a comprehensive, **correct** solution that accurately
addresses the [Problem Description] you provided.
They use greedy decoding
Keeping track of the content of the generated data
They use the Instructor embedding model to keep track of 10 manually designed categories specific to coding