Guide on how to train new models on an existing codebase? #74

ThatGuySam · 2022-10-03T20:56:12Z

ThatGuySam
Oct 3, 2022

This may be a super loaded topic, but I was wondering if there are any resources on training a custom model for software teams that might be restricted on which license they are allowed to use for their software.

Meaning since they don't know the licensing of the software the model samples they aren't allowed to use the software.

moyix · 2022-10-04T17:58:14Z

moyix
Oct 4, 2022
Maintainer

Sure. Models trained using CodeGen or GPT-J from HuggingFace can be used with FauxPilot by running them through the conversion scripts in the converter directory. I also wrote a very brief set of instructions on how to fine-tune an existing model like CodeGen with HuggingFace:

It is possible to fine-tune CodeGen using Huggingface Transformers! Then you'd be able to fine-tune it on your own code and use the resulting model. However, training is more expensive -- you'd need an A6000 or better to train the 6B model. Something like the following should work:

    deepspeed --num_gpus 1 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=./codegen-6B-finetuned --dataset_name your_dataset --tokenizer_name Salesforce/codegen-6B-multi --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir --deepspeed ds_config.json

Where run_clm.py is this script: run_clm.py

And to prepare a dataset to train on:

Have a look at the datasets library, but as a shortcut, you can just create a file named "my_code.json" in jsonlines format with one line per source file that looks like:

   {"text": "contents_of_source_file_1"}
   {"text": "contents_of_source_file_2"}
   ...

And then pass that my_code.json as the dataset name.

There are also a lot of good resources on the HuggingFace web site. As a warning, fine-tuning or training large models (like CodeGen 16B) takes a lot of GPU resources – we fine-tuned a 16B model on Verilog code, and it took 3xA100 GPUs with 80GB of VRAM each running for six days to do one pass over the 400MB dataset.

1 reply

shwetha-97 Aug 15, 2023

Hi @moyix , thanks a lot for this guide!
To use a custom model, you mentioned that we would need to run the model through the scripts in the converter directory. To confirm:

Case 1:
For a finetuned codegen model, would we have to run in order:

converter/download_and_convert_model.sh for the corresponding model
triton_config_gen.py

Could you share sample commands to run the above for a model named codegen-350M-finetuned?

Case 2:
If I want to use a model other than codegen like llama 2, is this supported by the scripts above? Because I don't see the config files for any models other than codegen. What should I do in this case?

ThatGuySam · 2022-10-04T18:06:04Z

ThatGuySam
Oct 4, 2022
Author

Ah, I see.

It's doable you just to be using something pretty heavy duty like several GTX 4090s, and it may take a week for each pass over.

Are there any specific links or videos for Hugging Face that you would recommend?

0 replies

GavinRay97 · 2022-10-21T15:11:08Z

GavinRay97
Oct 21, 2022

I have a question related to this (forgive my ML ignorance)

Instead of training an entirely new model, is it possible to "tweak" the existing model?

Something I have found myself wishing for a lot with Copilot is the ability to point it towards a folder containing repos of projects for a particular domain, so that it "gains knowledge" of that domain and offers better completions.

For example, when writing a database, pointing it to a folder containing the source code of Postgres, sqlite, BerkeleyDB, etc.
This way it picks up the terminology of tuple, buffer pool, scan, and the code patterns related to these things.

I know the source code to these projects is inside of Copilot's dataset, but I want the "relevance" to be weighted higher.
This could be stored as a project/workspace specific setting, like:

{
    "fauxpilot.project_additional_training_code": "/copilot/datasets/relational-databases"
}

One hack I've been using for this is to keep multiple tabs open that contain the source code to similar projects so that they are in the model's "context" (or whatever you call it).

Is there any way to accomplish something like this?

1 reply

shwetha-97 Aug 14, 2023

I know it's been some time but were you able to find an answer to this? Is this format of referring to data possible?
If you have done it, can you share a link to your dataset so that I can get an idea?

moyix · 2022-10-21T18:16:33Z

moyix
Oct 21, 2022
Maintainer

Tweaking the model on a small amount of code is what's called fine-tuning; it basically updates the model just a little bit based on a much smaller amount of data (the actual mechanics of the training process are identical to training from scratch, but fine-tuning takes much less time since it starts out much closer to the goal and doesn't have as much data to work on). Unfortunately, fine-tuning takes just as much VRAM as training from scratch, which makes it hard to do on a personal machine.

There are some architectures out there like Memorizing Transformers (https://arxiv.org/abs/2203.08913) or RETRO (https://arxiv.org/abs/2112.04426) that would be able to incorporate new information without being retrained at all, but I don't know of anyone who has made a code model that makes use of this.

1 reply

GavinRay97 Oct 22, 2022

Ahh, thanks so much for the explanation.

My personal laptop has 32GB RAM and an RTX 3060.
It sounds like a lot to me as a layman, though maybe in terms of training models it's an order of magnitude too small!

The more difficult problem is that you mention not only is there not a pre-integrated solution for this in FauxPilot, but it appears not to have been done at all yet.

Oh well, something to look forward to for the future at least!
I love modern technology, AI-autocomplete has drastically improved my personal and professional life. 🙂

samhavens · 2022-11-09T19:54:12Z

samhavens
Nov 9, 2022

@moyix I feel like most projects aren't large enough to be worth fine-tuning on. IMO it would be better to extend to context length and then put more of the project into the prompt. lucidrains says that one can replace the attention mechanism with Performer attention and fine tune — it seems like one could fine tune the 16B model on sequences of length 8192 to get "16B-long". Then a preprocessing step could be following imports and inlining relevant parts of the project into the context. That way we'd only have to finetune once

2 replies

GavinRay97 Nov 9, 2022

What about fine-tuning on a body of similar projects

IE, if you want to write a game engine, give it a directory containing open-source game engine code

If you want to write a DB, give it a directory containing Postgres, BerkeleyDB, MariaDB, etc source code

Copilot presumably has all this data in it's model because it exists on Github but the "weight" (pardon my layman's understanding) of it likely doesn't stand out compared to the rest of the data, right? There's no way for the model to know that you want the tokens from those repos to "mean more"/"be more relevant" than the other data it has, if I had to guess?

ThatGuySam Nov 10, 2022
Author

This is actually what originally prompted the discussion.

I was talking with a Mozilla Engineer about Copilot, and they have an internal policy not to use it until the licensing clears up, and they can be sure the code is being sourced by an Open-Source friendly license, such as Mozilla Public License 2.0.

If there were models trained on various licenses so that companies can be sure about liability and their own responsibility, Copy Left, for example, that would open up both larger and Open Source organizations to be more comfortable about using Fauxpilot or similar products.

Guide on how to train new models on an existing codebase? #74

Uh oh!

ThatGuySam Oct 3, 2022

Replies: 5 comments · 5 replies

Uh oh!

moyix Oct 4, 2022 Maintainer

Uh oh!

shwetha-97 Aug 15, 2023

Uh oh!

ThatGuySam Oct 4, 2022 Author

Uh oh!

Uh oh!

GavinRay97 Oct 21, 2022

Uh oh!

Uh oh!

shwetha-97 Aug 14, 2023

Uh oh!

moyix Oct 21, 2022 Maintainer

Uh oh!

GavinRay97 Oct 22, 2022

Uh oh!

samhavens Nov 9, 2022

Uh oh!

GavinRay97 Nov 9, 2022

Uh oh!

ThatGuySam Nov 10, 2022 Author

ThatGuySam
Oct 3, 2022

Replies: 5 comments 5 replies

moyix
Oct 4, 2022
Maintainer

ThatGuySam
Oct 4, 2022
Author

GavinRay97
Oct 21, 2022

moyix
Oct 21, 2022
Maintainer

samhavens
Nov 9, 2022

ThatGuySam Nov 10, 2022
Author