Guide on how to train new models on an existing codebase? #74
Replies: 5 comments 5 replies
-
|
Sure. Models trained using CodeGen or GPT-J from HuggingFace can be used with FauxPilot by running them through the conversion scripts in the
And to prepare a dataset to train on:
There are also a lot of good resources on the HuggingFace web site. As a warning, fine-tuning or training large models (like CodeGen 16B) takes a lot of GPU resources – we fine-tuned a 16B model on Verilog code, and it took 3xA100 GPUs with 80GB of VRAM each running for six days to do one pass over the 400MB dataset. |
Beta Was this translation helpful? Give feedback.
-
|
Ah, I see. It's doable you just to be using something pretty heavy duty like several GTX 4090s, and it may take a week for each pass over. Are there any specific links or videos for Hugging Face that you would recommend? |
Beta Was this translation helpful? Give feedback.
-
|
I have a question related to this (forgive my ML ignorance) Instead of training an entirely new model, is it possible to "tweak" the existing model? Something I have found myself wishing for a lot with Copilot is the ability to point it towards a folder containing repos of projects for a particular domain, so that it "gains knowledge" of that domain and offers better completions. For example, when writing a database, pointing it to a folder containing the source code of Postgres, sqlite, BerkeleyDB, etc. I know the source code to these projects is inside of Copilot's dataset, but I want the "relevance" to be weighted higher. {
"fauxpilot.project_additional_training_code": "/copilot/datasets/relational-databases"
}One hack I've been using for this is to keep multiple tabs open that contain the source code to similar projects so that they are in the model's "context" (or whatever you call it). Is there any way to accomplish something like this? |
Beta Was this translation helpful? Give feedback.
-
|
Tweaking the model on a small amount of code is what's called fine-tuning; it basically updates the model just a little bit based on a much smaller amount of data (the actual mechanics of the training process are identical to training from scratch, but fine-tuning takes much less time since it starts out much closer to the goal and doesn't have as much data to work on). Unfortunately, fine-tuning takes just as much VRAM as training from scratch, which makes it hard to do on a personal machine. There are some architectures out there like Memorizing Transformers (https://arxiv.org/abs/2203.08913) or RETRO (https://arxiv.org/abs/2112.04426) that would be able to incorporate new information without being retrained at all, but I don't know of anyone who has made a code model that makes use of this. |
Beta Was this translation helpful? Give feedback.
-
|
@moyix I feel like most projects aren't large enough to be worth fine-tuning on. IMO it would be better to extend to context length and then put more of the project into the prompt. lucidrains says that one can replace the attention mechanism with Performer attention and fine tune — it seems like one could fine tune the 16B model on sequences of length 8192 to get "16B-long". Then a preprocessing step could be following imports and inlining relevant parts of the project into the context. That way we'd only have to finetune once |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
This may be a super loaded topic, but I was wondering if there are any resources on training a custom model for software teams that might be restricted on which license they are allowed to use for their software.
Meaning since they don't know the licensing of the software the model samples they aren't allowed to use the software.
Beta Was this translation helpful? Give feedback.
All reactions