AhmetZeer commited on
Commit
f52ce40
1 Parent(s): db6a8f3

readme.init( )

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CosmsoLLaMa GGUFs
2
+
3
+ ## Objective
4
+ Due to the need for quantized models in real-time applications, we introduce our GGUF formatted models. These models are part of
5
+ GGML project with a hope to democratize the use of Large Models. Depending on the quantization type, there are 20+ models.
6
+
7
+ ### Features
8
+ * All quantization details are listed on the right by Hugging Face.
9
+ * All the models have been tested in `llama.cpp` environments, `llama-cli` and `llama-server`.
10
+ * Furthermore, a YouTube video has been made to introduce the basics of using `lmstudio` to utilize these models.
11
+
12
+
13
+
14
+ ### Code Example
15
+ Usage example with `llama-cpp-python`
16
+
17
+ ```py
18
+ from llama_cpp import Llama
19
+
20
+ # Define the inference parameters
21
+ inference_params = {
22
+ "n_threads": 4,
23
+ "n_predict": -1,
24
+ "top_k": 40,
25
+ "min_p": 0.05,
26
+ "top_p": 0.95,
27
+ "temp": 0.8,
28
+ "repeat_penalty": 1.1,
29
+ "input_prefix": "<|start_header_id|>user<|end_header_id|>\\n\\n",
30
+ "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n",
31
+ "antiprompt": [],
32
+ "pre_prompt": "Sen bir yapay zeka asistanısın. Kullanıcı sana bir görev verecek. Amacın görevi olabildiğince sadık bir şekilde tamamlamak.",
33
+ "pre_prompt_suffix": "<|eot_id|>",
34
+ "pre_prompt_prefix": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\n",
35
+ "seed": -1,
36
+ "tfs_z": 1,
37
+ "typical_p": 1,
38
+ "repeat_last_n": 64,
39
+ "frequency_penalty": 0,
40
+ "presence_penalty": 0,
41
+ "n_keep": 0,
42
+ "logit_bias": {},
43
+ "mirostat": 0,
44
+ "mirostat_tau": 5,
45
+ "mirostat_eta": 0.1,
46
+ "memory_f16": True,
47
+ "multiline_input": False,
48
+ "penalize_nl": True
49
+ }
50
+
51
+ # Initialize the Llama model with the specified inference parameters
52
+ llama = Llama.from_pretrained(
53
+ repo_id="ytu-ce-cosmos/Turkish-Llama-8b-Instruct-v0.1-GGUF",
54
+ filename="*Q4_K.gguf",
55
+ verbose=False
56
+ )
57
+ # Example input
58
+ user_input = "Türkiyenin başkenti neresidir?"
59
+
60
+ # Construct the prompt
61
+ prompt = f"{inference_params['pre_prompt_prefix']}{inference_params['pre_prompt']}\n\n{inference_params['input_prefix']}{user_input}{inference_params['input_suffix']}"
62
+
63
+ # Generate the response
64
+ response = llama(prompt)
65
+
66
+ # Output the response
67
+ print(response['choices'][0]['text'])
68
+
69
+ ```
70
+
71
+ The quantization has been made using `llama.cpp`. As we have seen, this method tends to give the most stable results.
72
+
73
+ Obviously, we encountered better inference quality for models with the highest bits. However, the inference time tends to be similar between low-bit models.
74
+
75
+ Each model's memory footprint can be anticipated by the qunatization docs in either [Hugging Face](https://huggingface.co/docs/transformers/main/en/quantization/overview) or [llama.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize).
76
+
77
+ ## Contact
78
+ *Feel free to contact us whenever you confront any problems :)*
79
+
80
+ COSMOS AI Research Group, Yildiz Technical University Computer Engineering Department
81
+ https://cosmos.yildiz.edu.tr/
82
83
+
84
+
85
+