NamCyan commited on
Commit
ccf8da6
1 Parent(s): 2c543e8

first commit

Browse files
README.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ - en
5
+ task_categories:
6
+ - text-classification
7
+ metrics:
8
+ - accuracy
9
+ widget:
10
+ - text: |-
11
+ Sum two integers</s></s>def sum(a, b):
12
+ return a + b
13
+ example_title: Simple toy
14
+ - text: |-
15
+ Look for methods that might be dynamically defined and define them for lookup.</s></s>def respond_to_missing?(name, include_private = false)
16
+ if name == :to_ary || name == :empty?
17
+ false
18
+ else
19
+ return true if mapping(name).present?
20
+ mounting = all_mountings.find{ |mount| mount.respond_to?(name) }
21
+ return false if mounting.nil?
22
+ end
23
+ end
24
+ example_title: Ruby example
25
+ - text: |-
26
+ Method that adds a candidate to the party @param c the candidate that will be added to the party</s></s>public void addCandidate(Candidate c)
27
+ {
28
+ this.votes += c.getVotes();
29
+ candidates.add(c);
30
+ }
31
+ example_title: Java example
32
+ - text: |-
33
+ we do not need Buffer pollyfill for now</s></s>function(str){
34
+ var ret = new Array(str.length), len = str.length;
35
+ while(len--) ret[len] = str.charCodeAt(len);
36
+ return Uint8Array.from(ret);
37
+ }
38
+ example_title: JavaScript example
39
+
40
+ pipeline_tag: text-classification
41
+ ---
42
+
43
+
44
+
45
+ ## Table of Contents
46
+ - [Model Description](#model-description)
47
+ - [Model Details](#model-details)
48
+ - [Usage](#usage)
49
+ - [Limitations](#limitations)
50
+ - [Additional Information](#additional-information)
51
+ - [Licensing Information](#licensing-information)
52
+ - [Citation Information](#citation-information)
53
+
54
+
55
+ ## Model Description
56
+
57
+ This model is trained based on [Codebert](https://github.com/microsoft/CodeBERT) and a 5M subset of [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level) to detect the inconsistency between docstring/comment and function. It is used to remove noise examples in The Vault dataset.
58
+
59
+ More information:
60
+ - **Repository:** [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault)
61
+ - **Paper:** The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
62
+ - **Contact:** [email protected]
63
+
64
+
65
+ ## Model Details
66
+ * Developed by: [Fsoft AI Center](https://www.fpt-aicenter.com/ai-residency/)
67
+ * License: Nan
68
+ * Model type: Transformer-Encoder based Language Model
69
+ * Architecture: BERT-base
70
+ * Data set: [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level)
71
+ * Tokenizer: Byte Pair Encoding
72
+ * Vocabulary Size: 50265
73
+ * Sequence Length: 512
74
+ * Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby)
75
+ * Training details:
76
+ * Self-supervised learning, binary classification
77
+ * Positive class: Original code-docstring pair
78
+ * Negative class: Random pairing code and docstring
79
+
80
+ ## Usage
81
+ The input to the model follows the below template:
82
+ ```python
83
+ """
84
+ Template:
85
+ <s>{docstring}</s></s>{code}</s>
86
+
87
+ Example:
88
+ from transformers import AutoTokenizer
89
+
90
+ #Load tokenizer
91
+ tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")
92
+
93
+ input = "<s>Sum two integers</s></s>def sum(a, b):\n return a + b</s>"
94
+ tokenized_input = tokenizer(input, add_special_tokens= False)
95
+ """
96
+ ```
97
+
98
+ Using model with Jax
99
+ ```python
100
+ from transformers import AutoTokenizer, FlaxAutoModelForSequenceClassification
101
+
102
+ #Load jax model
103
+ model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")
104
+ ```
105
+
106
+ Using model with Pytorch
107
+ ```python
108
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
109
+
110
+ #Load torch model
111
+ model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")
112
+ ```
113
+
114
+ ## Limitations
115
+ This model is trained on a subset of 5M data in The Vault in the self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted.
116
+
117
+ It is hard to evaluate the model due to the unavailable labeled datasets. ChatGPT is adopted as a reference to measure the correlation between the model and ChatGPT's scores. However, the result could be influenced by ChatGPT's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and finetune this model to achieve the best result.
118
+
119
+ ## Additional information
120
+ ### Licensing Information
121
+ ### Citation Information
122
+
123
+ ```
124
+ @article{thevault,
125
+ title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
126
+ author={},
127
+ journal={},
128
+ pages={},
129
+ year={2023}
130
+ }
131
+ ```
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/datadrive/namlh31/Codebert-docstring-inconsistency",
3
+ "architectures": [
4
+ "RobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "Inconsistency",
15
+ "1": "Consistency"
16
+ },
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 3072,
19
+ "label2id": {
20
+ "Consistency": "1",
21
+ "Inconsistency": "0"
22
+ },
23
+ "layer_norm_eps": 1e-05,
24
+ "max_position_embeddings": 514,
25
+ "model_type": "roberta",
26
+ "num_attention_heads": 12,
27
+ "num_hidden_layers": 12,
28
+ "output_past": true,
29
+ "pad_token_id": 1,
30
+ "position_embedding_type": "absolute",
31
+ "torch_dtype": "float32",
32
+ "transformers_version": "4.28.0",
33
+ "type_vocab_size": 1,
34
+ "use_cache": true,
35
+ "vocab_size": 50265
36
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d1cd31f97dc5d2ee4e85922acc7e7e352644436d57e4ff582d4d8df19192c938
3
+ size 498595901
handler.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from typing import Dict, List, Any
3
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
4
+
5
+
6
+ # check for GPU
7
+ device = 0 if torch.cuda.is_available() else -1
8
+
9
+ # id2label = {
10
+ # 0: "Inconsistency",
11
+ # 1: "Consistency"
12
+ # }
13
+
14
+ class EndpointHandler:
15
+ def __init__(self, path=""):
16
+ # load the model
17
+ tokenizer = AutoTokenizer.from_pretrained(path)
18
+ model = AutoModelForSequenceClassification.from_pretrained(path, low_cpu_mem_usage=True)
19
+ # create inference pipeline
20
+ self.pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer, device=device)
21
+
22
+ def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
23
+ inputs = data.pop("inputs", data)
24
+ parameters = data.pop("parameters", None)
25
+
26
+ # pass inputs with all kwargs in data
27
+ if parameters is not None:
28
+ prediction = self.pipeline(inputs, **parameters)
29
+ else:
30
+ prediction = self.pipeline(inputs)
31
+ # postprocess the prediction
32
+
33
+ return prediction
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:481e4699a0589ea0af3e2b36671aa677662e830f597f5d1bc60f3cc8bc5cec45
3
+ size 498659253
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ accelerate
2
+ jax
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
test.ipynb ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 2,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "import os \n",
10
+ "from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding"
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "code",
15
+ "execution_count": 3,
16
+ "metadata": {},
17
+ "outputs": [],
18
+ "source": [
19
+ "model_name_or_path = \"/datadrive/namlh31/codebridge/Codebert-docstring-inconsistency\"\n",
20
+ "config = AutoConfig.from_pretrained(\n",
21
+ " model_name_or_path,\n",
22
+ ")\n",
23
+ "tokenizer = AutoTokenizer.from_pretrained(\n",
24
+ " model_name_or_path\n",
25
+ ")\n",
26
+ "model = AutoModelForSequenceClassification.from_pretrained(\n",
27
+ "model_name_or_path,\n",
28
+ "config=config,\n",
29
+ ")"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "code",
34
+ "execution_count": 5,
35
+ "metadata": {},
36
+ "outputs": [],
37
+ "source": [
38
+ "examples = {'code': \"function(str){\\r\\n var ret = new Array(str.length), len = str.length;\\r\\n while(len--) ret[len] = str.charCodeAt(len);\\r\\n return Uint8Array.from(ret);\\r\\n}\",\n",
39
+ " 'docstring': 'we do not need Buffer pollyfill for now'}"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "code",
44
+ "execution_count": 17,
45
+ "metadata": {},
46
+ "outputs": [],
47
+ "source": [
48
+ "texts = (\n",
49
+ " (examples['docstring'], examples['code'])\n",
50
+ " )\n",
51
+ "result = tokenizer(*texts, padding=\"max_length\", max_length=512, truncation=True, return_tensors= 'pt')"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "code",
56
+ "execution_count": 10,
57
+ "metadata": {},
58
+ "outputs": [
59
+ {
60
+ "name": "stdout",
61
+ "output_type": "stream",
62
+ "text": [
63
+ "512\n"
64
+ ]
65
+ }
66
+ ],
67
+ "source": [
68
+ "tokenizer.decode(result['input_ids'])\n",
69
+ "print(len(result['input_ids']))"
70
+ ]
71
+ },
72
+ {
73
+ "cell_type": "code",
74
+ "execution_count": 22,
75
+ "metadata": {},
76
+ "outputs": [],
77
+ "source": [
78
+ "input = \"\"\"we do not need Buffer pollyfill for now</s></s>function(str){\\r\\n var ret = new Array(str.length), len = str.length;\\r\\n while(len--) ret[len] = str.charCodeAt(len);\\r\\n return Uint8Array.from(ret);\\r\\n}\"\"\"\n",
79
+ "rs_2 = tokenizer(input, padding=\"max_length\", max_length=512, truncation=True, return_tensors= 'pt')"
80
+ ]
81
+ },
82
+ {
83
+ "cell_type": "code",
84
+ "execution_count": 23,
85
+ "metadata": {},
86
+ "outputs": [
87
+ {
88
+ "data": {
89
+ "text/plain": [
90
+ "SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2598, -0.2636]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)"
91
+ ]
92
+ },
93
+ "execution_count": 23,
94
+ "metadata": {},
95
+ "output_type": "execute_result"
96
+ }
97
+ ],
98
+ "source": [
99
+ "model(**rs_2)"
100
+ ]
101
+ },
102
+ {
103
+ "cell_type": "code",
104
+ "execution_count": 24,
105
+ "metadata": {},
106
+ "outputs": [
107
+ {
108
+ "name": "stdout",
109
+ "output_type": "stream",
110
+ "text": [
111
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
112
+ "To disable this warning, you can either:\n",
113
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
114
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
115
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
116
+ "To disable this warning, you can either:\n",
117
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
118
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
119
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
120
+ "To disable this warning, you can either:\n",
121
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
122
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
123
+ ]
124
+ }
125
+ ],
126
+ "source": [
127
+ "from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline\n",
128
+ "import torch\n",
129
+ "device = 0 if torch.cuda.is_available() else -1\n",
130
+ "pipeline = pipeline(\"text-classification\", model=model, tokenizer=tokenizer, device=device)"
131
+ ]
132
+ },
133
+ {
134
+ "cell_type": "code",
135
+ "execution_count": 28,
136
+ "metadata": {},
137
+ "outputs": [
138
+ {
139
+ "name": "stdout",
140
+ "output_type": "stream",
141
+ "text": [
142
+ "[{'label': 'Inconsistency', 'score': 0.5601343512535095}]\n"
143
+ ]
144
+ }
145
+ ],
146
+ "source": [
147
+ "inputs = \"\"\"we do not need Buffer pollyfill for now</s></s>function(str){\n",
148
+ " var ret = new Array(str.length), len = str.length;\n",
149
+ " while(len--) ret[len] = str.charCodeAt(len);\n",
150
+ " return Uint8Array.from(ret);\n",
151
+ "}\"\"\"\n",
152
+ "prediction = pipeline(inputs)\n",
153
+ "print(prediction)"
154
+ ]
155
+ },
156
+ {
157
+ "cell_type": "code",
158
+ "execution_count": null,
159
+ "metadata": {},
160
+ "outputs": [],
161
+ "source": []
162
+ }
163
+ ],
164
+ "metadata": {
165
+ "kernelspec": {
166
+ "display_name": "namlh31",
167
+ "language": "python",
168
+ "name": "python3"
169
+ },
170
+ "language_info": {
171
+ "codemirror_mode": {
172
+ "name": "ipython",
173
+ "version": 3
174
+ },
175
+ "file_extension": ".py",
176
+ "mimetype": "text/x-python",
177
+ "name": "python",
178
+ "nbconvert_exporter": "python",
179
+ "pygments_lexer": "ipython3",
180
+ "version": "3.11.2"
181
+ },
182
+ "orig_nbformat": 4
183
+ },
184
+ "nbformat": 4,
185
+ "nbformat_minor": 2
186
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "cls_token": {
12
+ "__type": "AddedToken",
13
+ "content": "<s>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false
18
+ },
19
+ "eos_token": {
20
+ "__type": "AddedToken",
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "errors": "replace",
28
+ "mask_token": {
29
+ "__type": "AddedToken",
30
+ "content": "<mask>",
31
+ "lstrip": true,
32
+ "normalized": true,
33
+ "rstrip": false,
34
+ "single_word": false
35
+ },
36
+ "model_max_length": 512,
37
+ "pad_token": {
38
+ "__type": "AddedToken",
39
+ "content": "<pad>",
40
+ "lstrip": false,
41
+ "normalized": true,
42
+ "rstrip": false,
43
+ "single_word": false
44
+ },
45
+ "sep_token": {
46
+ "__type": "AddedToken",
47
+ "content": "</s>",
48
+ "lstrip": false,
49
+ "normalized": true,
50
+ "rstrip": false,
51
+ "single_word": false
52
+ },
53
+ "special_tokens_map_file": "/home/namlh31aic/.cache/huggingface/hub/models--microsoft--codebert-base/snapshots/3b0952feddeffad0063f274080e3c23d75e7eb39/special_tokens_map.json",
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": {
57
+ "__type": "AddedToken",
58
+ "content": "<unk>",
59
+ "lstrip": false,
60
+ "normalized": true,
61
+ "rstrip": false,
62
+ "single_word": false
63
+ }
64
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff