ncoop57 commited on
Commit
32fe3c4
1 Parent(s): 4d4c2ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -1
README.md CHANGED
@@ -2,15 +2,57 @@
2
 
3
  ## Model Description
4
 
5
- PT-Neo-125M-Code-Clippy-Dedup is a [GPT-Neo-125M model](https://huggingface.co/EleutherAI/gpt-neo-125M) finetuned using causal language modeling on our deduplicated version of the Code Clippy Data dataset, which was scraped from public Github repositories (more information in the provided link). This model is specialized to autocomplete methods in multiple programming languages.
 
 
 
6
  ## Training data
7
 
8
  [Code Clippy Data dataset](https://huggingface.co/datasets/code_search_net).
9
 
10
  ## Training procedure
11
 
 
 
12
  The training script used to train this model can be found [here](https://github.com/ncoop57/gpt-code-clippy/blob/camera-ready/training/run_clm_apps.py).
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## Intended Use and Limitations
15
 
16
  The model is finetuned text file from github repositories (mostly programming languages but also markdown and other project related files).
 
2
 
3
  ## Model Description
4
 
5
+ PT-Neo-125M-Code-Clippy-Dedup is a [GPT-Neo-125M model](https://huggingface.co/EleutherAI/gpt-neo-125M) finetuned using causal language modeling on our deduplicated version of the Code Clippy Data dataset, which was scraped from public Github repositories (more information in the provided link). This model is specialized to autocomplete methods in multiple programming languages.
6
+
7
+
8
+
9
  ## Training data
10
 
11
  [Code Clippy Data dataset](https://huggingface.co/datasets/code_search_net).
12
 
13
  ## Training procedure
14
 
15
+ In this model's training we tried to stabilize the training by limiting the types of files we were using to train to only those that contained file extensions for popular programming languages as our dataset contains other types of files as well such as `.txt` or project configuration files. We used the following extensions to filter by:
16
+
17
  The training script used to train this model can be found [here](https://github.com/ncoop57/gpt-code-clippy/blob/camera-ready/training/run_clm_apps.py).
18
 
19
+
20
+
21
+ ```bash
22
+ ./run_clm_streaming_flax.py \
23
+ --output_dir $HOME/gpt-neo-125M-code-clippy \
24
+ --model_name_or_path="flax-community/gpt-neo-125M-code-clippy" \
25
+ --dataset_name $HOME/gpt-code-clippy/code_clippy.py \
26
+ --data_dir /home/shared/code_clippy_data \
27
+ --text_column_name="text" \
28
+ --do_train --do_eval \
29
+ --block_size="2048" \
30
+ --per_device_train_batch_size="8" \
31
+ --per_device_eval_batch_size="16" \
32
+ --preprocessing_num_workers="8" \
33
+ --learning_rate="1e-4" \
34
+ --max_steps 100000 \
35
+ --warmup_steps 2500 \
36
+ --decay_steps 25000 \
37
+ --adam_beta1="0.9" \
38
+ --adam_beta2="0.95" \
39
+ --weight_decay="0.1" \
40
+ --overwrite_output_dir \
41
+ --logging_steps="100" \
42
+ --eval_steps="500" \
43
+ --push_to_hub="False" \
44
+ --report_to="all" \
45
+ --dtype="bfloat16" \
46
+ --skip_memory_metrics="True" \
47
+ --save_steps="500" \
48
+ --save_total_limit 10 \
49
+ --gradient_accumulation_steps 16 \
50
+ --report_to="wandb" \
51
+ --run_name="125m_1e-4lr_1024bs" \
52
+ --max_eval_samples 2000 \
53
+ --save_optimizer true
54
+ ```
55
+
56
  ## Intended Use and Limitations
57
 
58
  The model is finetuned text file from github repositories (mostly programming languages but also markdown and other project related files).