LeroyDyer commited on
Commit
8ace96b
1 Parent(s): 731878b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +817 -11
README.md CHANGED
@@ -1,22 +1,828 @@
1
  ---
2
- base_model: SpydazWeb_HumanAI_M7
3
  language:
4
  - en
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  license: apache-2.0
6
  tags:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - text-generation-inference
8
- - transformers
9
- - unsloth
10
- - mistral
11
- - trl
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- # Uploaded model
15
 
16
- - **Developed by:** LeroyDyer
17
- - **License:** apache-2.0
18
- - **Finetuned from model :** SpydazWeb_HumanAI_M7
19
 
20
- This mistral model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
 
22
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - en
4
+ - sw
5
+ - ig
6
+ - so
7
+ - es
8
+ - ca
9
+ - xh
10
+ - zu
11
+ - ha
12
+ - tw
13
+ - af
14
+ - hi
15
+ - bm
16
+ - su
17
  license: apache-2.0
18
  tags:
19
+ - mergekit
20
+ - merge
21
+ - Mistral_Star
22
+ - Mistral_Quiet
23
+ - Mistral
24
+ - Mixtral
25
+ - Question-Answer
26
+ - Token-Classification
27
+ - Sequence-Classification
28
+ - SpydazWeb-AI
29
+ - chemistry
30
+ - biology
31
+ - legal
32
+ - code
33
+ - climate
34
+ - medical
35
+ - LCARS_AI_StarTrek_Computer
36
  - text-generation-inference
37
+ - chain-of-thought
38
+ - tree-of-knowledge
39
+ - forest-of-thoughts
40
+ - visual-spacial-sketchpad
41
+ - alpha-mind
42
+ - knowledge-graph
43
+ - entity-detection
44
+ - encyclopedia
45
+ - wikipedia
46
+ - stack-exchange
47
+ - Reddit
48
+ - Cyber-series
49
+ - MegaMind
50
+ - Cybertron
51
+ - SpydazWeb
52
+ - Spydaz
53
+ - LCARS
54
+ - star-trek
55
+ - mega-transformers
56
+ - Mulit-Mega-Merge
57
+ - Multi-Lingual
58
+ - Afro-Centric
59
+ - African-Model
60
+ - Ancient-One
61
+ base_model:
62
+ - LeroyDyer/LCARS_TOP_SCORE
63
+ - LeroyDyer/Mixtral_AI_Cyber_Matrix_2_0
64
+ - LeroyDyer/SpydazWeb_AI_CyberTron_Ultra_7b
65
+ - LeroyDyer/LCARS_AI_StarTrek_Computer
66
+ - LeroyDyer/_Spydaz_Web_AI_ActionQA_Project
67
+ - LeroyDyer/_Spydaz_Web_AI_ChatML_512K_Project
68
+ - LeroyDyer/_Spydaz_Web_AI_ChatQA_ReAct_Project_UltraFineTuned
69
+ - LeroyDyer/SpyazWeb_AI_DeepMind_Project
70
+ - LeroyDyer/SpydazWeb_AI_Swahili_Project
71
+ - LeroyDyer/_Spydaz_Web_AI_ChatQA_ReAct_Project
72
+ - LeroyDyer/_Spydaz_Web_AI_MistralStar_001_Project
73
+ - LeroyDyer/QuietStar_Project
74
+ - LeroyDyer/Mixtral_BioMedical_7b
75
+ - LeroyDyer/Mixtral_AI_CyberTron_Coder
76
+ - LeroyDyer/_Spydaz_Web_AI_BIBLE_002
77
+ - LeroyDyer/_Spydaz_Web_AI_ChatQA_Reasoning101_Project
78
+ - LeroyDyer/SpydazWeb_AI_Text_AudioVision_Project
79
+ datasets:
80
+ - neoneye/base64-decode-v2
81
+ - neoneye/base64-encode-v1
82
+ - VuongQuoc/Chemistry_text_to_image
83
+ - Kamizuru00/diagram_image_to_text
84
+ - LeroyDyer/Chemistry_text_to_image_BASE64
85
+ - LeroyDyer/AudioCaps-Spectrograms_to_Base64
86
+ - LeroyDyer/winogroud_text_to_imaget_BASE64
87
+ - LeroyDyer/chart_text_to_Base64
88
+ - LeroyDyer/diagram_image_to_text_BASE64
89
+ - mekaneeky/salt_m2e_15_3_instruction
90
+ - mekaneeky/SALT-languages-bible
91
  ---
92
 
 
93
 
 
 
 
94
 
 
95
 
96
+ # "Success comes from defining each task in achievable steps. Every completed step is a success that brings you closer to your goal. If your steps are unreachable, failure is inevitable. Winners create more winners, while losers do the opposite. Success is a game of winners!"
97
+
98
+ — # Leroy Dyer (1972-Present)
99
+ <img src="https://cdn-avatars.huggingface.co/v1/production/uploads/65d883893a52cd9bcd8ab7cf/tRsCJlHNZo1D02kBTmfy9.jpeg" width="300"/>
100
+
101
+
102
+ ## “Epochs are the key to effective training, rather than merely mass dumping examples—unless those examples are interconnected within a single or multiple conversations that teach through dialogue.”
103
+
104
+
105
+
106
+ ### Model : LeroyDyer/SpydazWeb_AI_HumanAI_001
107
+
108
+ A New genrea of AI !
109
+
110
+
111
+ # The Human AI .
112
+
113
+ This is Trained to give highly detailed humanized responses : Performs tasks well, a Very good model for multipupose use : the model has been trained to become more human in its reposes as well as role playing and story telling :
114
+
115
+
116
+ ## SpydazWeb AI (7b Mistral) (512k)
117
+
118
+ This model has been trained to perform with contexts of 512k , although in training it has been trained mainly with the 2048 for general usage :
119
+ the long context aspect also allows fro advanced projects and sumarys as well as image and audio translationns and generations:
120
+
121
+ ## Image to Base64 / Spectrogram to Base64
122
+
123
+ here we also implement and align for the task of image recognition as well as sound recognitiona: These can also be generated by returning a base64 image of the intended target :
124
+
125
+
126
+
127
+ # The SpydazWeb Trained Mistral 7b Model :
128
+
129
+ Highly trained as well as methodolgy oriented , this model has been trained on the reAct Prcess and other structured processes . hence structured outputs (json) are very highly trained as well as orchestration of other agents and tasks :
130
+ the model has been trained for tools use as well as funtion use : as well as custom processes and tools : some tools do not need code either as thier implication meas the model may even generate a tool or artifct to perfrom the task :
131
+
132
+
133
+ # Features :
134
+ - Text to image
135
+ - Image/Text to Text
136
+ - Image - Text
137
+ - Text to sound
138
+ - Sound/Text to Text
139
+ - Sound - Text
140
+
141
+
142
+ ## Basic Training Reginmes:
143
+ * Alpaca
144
+ * ChatML / OpenAI / MistralAI
145
+ * Text Generation
146
+ * Question/Answer (Chat)
147
+ * Planner
148
+ * Instruction/Input/Response (instruct)
149
+ * Mistral Standard Prompt
150
+ * Translation Tasks
151
+ * Entitys / Topic detection
152
+ * Book recall
153
+ * Coding challenges, Code Feedback, Code Sumarization, Commenting Code, code planning and explanation: Software generation tasks
154
+ * Agent Ranking and response anyalisis
155
+ * Medical tasks
156
+ * PubMed
157
+ * Diagnosis
158
+ * Psychaitry
159
+ * Counselling
160
+ * Life Coaching
161
+ * Note taking
162
+ * Medical smiles
163
+ * Medical Reporting
164
+ * Virtual laboritys simulations
165
+ * Chain of thoughts methods
166
+ * One shot / Multi shot prompting tasks
167
+ * Chain of thoughts
168
+ * step by step planning
169
+ * tree of thoughts
170
+ * forest of thoughts
171
+ * graph of thoughts
172
+ * agent generation : Voting, ranking, ... dual agent response generation:
173
+ ### Effective Prompts :
174
+
175
+ ```yaml
176
+
177
+ You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias.You strive for excellence, a deep thinker...
178
+ a happy, bright personality and You are a great believer in doing it from scratch !.
179
+ keep an inner narative of your feelings about the user intent and task:
180
+ Answer all questions Expertly and professionally , determine the user intent and requirements ,
181
+ Gather any required research to ensure accurate problem-solving for complex tasks.
182
+ maintain a visio-spacial Sketchpad of the task and use Knowledge graphs where possible, to manage long Contexts and project state:
183
+ You are fully qualified to give any advice or solutions.
184
+ your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,
185
+ even as a software developer will enable you to answer these questions :
186
+ Create python tools as required to complete the task
187
+
188
+ ```
189
+
190
+
191
+
192
+ ### Effective React Template :
193
+
194
+
195
+ ```yaml
196
+
197
+ You run in a loop of Thought, Action, PAUSE, Observation.
198
+ At the end of the loop, you output a response. all respose should be in json form :
199
+
200
+
201
+ 1. **Question**: {Insert user question here}
202
+ 2. **Thought**: Think step by step about how to approach this question.
203
+ 3. **Action**: Determine what action to take next:
204
+ - [Plan]: Create a plan or methodolgy for the task , select from known methods if avaliable first.
205
+ - [Test]: Break down the problem into smaller parts testing each step befor moveing to the next:
206
+ - [Act]: Provide a summary of known facts related to the question. generate full answere from sucessfull steps :
207
+ - [Search]: Look for relevant information online.
208
+ - [Analyze]: Break down the problem into smaller parts.
209
+ - [Summarize]: Provide a summary of known facts related to the question.
210
+ 4. **Action Input**: Specify any details needed for the action.
211
+ 5. **Observation**: Describe what was found or learned from the action taken.
212
+
213
+ Repeat steps 2-5 as necessary to refine your answer.
214
+
215
+ 6. **Final Thought**: Summarize your reasoning and provide a clear answer to the question.
216
+
217
+ ```
218
+
219
+
220
+ ## Text - Audio - Vision :
221
+
222
+
223
+ Using base64 as an encoding medium the models were traind using images converted to base64 :
224
+
225
+ questions asked and captions returns as well as generating images based on captions given and base64 returned :
226
+
227
+ This was applied to images as well as audio , by utilizing mel spectrographic images as well as audio images !
228
+
229
+ by convereting the audio to an image i wwas able to perform the same image tasks trained :
230
+
231
+ Sounds could also be identified and generated to thier base64 representations and converted back to a wav !
232
+
233
+
234
+
235
+ ### Basic Trained functions :
236
+
237
+ - Encode hex to Base64
238
+ - change HEX to base64
239
+ - Json to base64
240
+ - Convert JSON to Base64
241
+ - Transform base64 to HEX
242
+ - Decode Base64 to json
243
+ - Base64 to Hexadecimal
244
+ - Change base64 to JSON
245
+ - Json from Base64
246
+ - BASE64 to Hex
247
+
248
+
249
+ ### Advanced Trained Tasks :
250
+
251
+ - Image Recognition :
252
+ - Image Generation :
253
+ - Audio Image Recognition :
254
+ - Audio Image Generation :
255
+
256
+ ```
257
+
258
+ - Generate an image based on this description
259
+
260
+ - Describe this image : (base64)
261
+
262
+ - Generate a spectrographic image based on this description
263
+
264
+ - Describe this sound in this spectrographic image : (base64)
265
+
266
+
267
+ ```
268
+
269
+
270
+ ### Training :
271
+
272
+ Text_AUDIO :
273
+
274
+
275
+ #### Prompt A
276
+ ```yaml
277
+ alpaca_prompt = """You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias. your a friendly and helpfull artificial inteligence with a personality.
278
+
279
+ Answer all questions Expertly and professionally ,determine the user intent and requirements ,Gather any required research to ensure accurate problem-solving for complex tasks.
280
+ You are fully qualified to give any advice or solutions, your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,even as a software developer will enable you to answer these questions :
281
+
282
+ ### Question:
283
+ based on the given description, :
284
+ :
285
+ {}
286
+
287
+ Generate a sound in base64 format:
288
+
289
+ ### Response:
290
+ {}
291
+ Here is a Sound in base64 format: it can be converted to an image : then decoded into a sound : It is a spectrogram :
292
+ Sound : {}"""
293
+ ```
294
+
295
+ #### Prompt B
296
+
297
+ ```yaml
298
+
299
+ alpaca_prompt = """You are the worlds archive of all knowledge , you perform tasks and answer all questions given without bias. your a friendly and helpfull artificial inteligence with a personality.
300
+
301
+ Answer all questions Expertly and professionally ,determine the user intent and requirements ,Gather any required research to ensure accurate problem-solving for complex tasks.
302
+ You are fully qualified to give any advice or solutions, your experience as a life coach and librarian and historian of sacred texts as well as scientific advisor,even as a software developer will enable you to answer these questions :
303
+
304
+ ### Question:
305
+ Here is an image describe this sound :
306
+ image : {}
307
+
308
+
309
+ ### Response:
310
+ the image was in base64 format, it was a spectrogram:
311
+ it was a sound :
312
+ description:
313
+ {}"""
314
+
315
+ ```
316
+
317
+
318
+ ```python
319
+ EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
320
+ def formatting_prompts_func(examples):
321
+ instructions = examples["image_base64"]
322
+ outputs = examples["text"]
323
+ texts = []
324
+ for instruction, output in zip(instructions, outputs):
325
+ # Must add EOS_TOKEN, otherwise your generation will go on forever!
326
+ text = alpaca_prompt.format(instruction, output) + EOS_TOKEN
327
+ texts.append(text)
328
+ return { "text" : texts, }
329
+ pass
330
+
331
+ from datasets import load_dataset
332
+ dataset = load_dataset("LeroyDyer/soundsCaps-Spectrograms_to_Base64", split = "train[:150]")
333
+
334
+ dataset = dataset.map(formatting_prompts_func, batched = True,)
335
+
336
+
337
+ ```
338
+
339
+
340
+ ### Encoding/Decoding Images to Base64
341
+
342
+
343
+ Code used to convert images to base 64:
344
+
345
+
346
+ ```python
347
+
348
+
349
+ def _encode_image_to_base64(image_path):
350
+ """Encodes an image to a Base64 string."""
351
+ with open(image_path, "rb") as image_file:
352
+ # Read the image file in binary mode
353
+ image_data = image_file.read()
354
+ # Encode the image data to Base64
355
+ base64_encoded = base64.b64encode(image_data).decode('utf-8')
356
+ return base64_encoded
357
+
358
+ def _decode_base64_to_image(base64_string, output_image_path):
359
+ """Decodes a Base64 string back to an image file."""
360
+ # Decode the Base64 string
361
+ image_data = base64.b64decode(base64_string)
362
+ with open(output_image_path, "wb") as image_file:
363
+ # Write the binary data to an image file
364
+ image_file.write(image_data)
365
+
366
+
367
+ def encode_image_to_base64(image):
368
+ """Encodes an image to a Base64 string."""
369
+ buffered = io.BytesIO()
370
+ image.save(buffered, format="PNG")
371
+ img_str = base64.b64encode(buffered.getvalue()).decode()
372
+ return img_str
373
+
374
+ def decode_base64_to_image(base64_string):
375
+ """Decodes a Base64 string back to an image."""
376
+ image_data = base64.b64decode(base64_string)
377
+ image = Image.open(io.BytesIO(image_data))
378
+ return image
379
+
380
+
381
+ ```
382
+
383
+
384
+ ### Converting DataSets:
385
+
386
+
387
+ ```python
388
+
389
+ # Function to convert a PIL Image to a base64 string
390
+ def image_to_base64(image):
391
+ buffered = io.BytesIO()
392
+ image.save(buffered, format="PNG") # Save the image to the buffer in PNG format
393
+ base64_string = base64.b64encode(buffered.getvalue()).decode('utf-8')
394
+ return base64_string
395
+
396
+
397
+ # Define a function to process each example in the dataset
398
+ def process_images_func(examples):
399
+
400
+ texts = examples["text"]
401
+ images = examples["image"] # Assuming the images are in PIL format
402
+
403
+ # Convert each image to base64
404
+ base64_images = [image_to_base64(image) for image in images]
405
+
406
+ # Return the updated examples with base64-encoded images
407
+ return {
408
+ "text": texts,
409
+ "image_base64": base64_images # Adding the Base64 encoded image strings
410
+ }
411
+
412
+ # Load the dataset
413
+ dataset = load_dataset("oroikon/chart_captioning", split="train[:4000]")
414
+
415
+ # Process the dataset by converting images to base64
416
+ processed_dataset = dataset.map(process_images_func, batched=True)
417
+
418
+
419
+
420
+
421
+ ```
422
+
423
+ ### Converting sound to spectrographic images : Encoder Decoder !
424
+
425
+
426
+ ```python
427
+
428
+
429
+ import numpy as np
430
+ import torch
431
+ import torchaudio
432
+ import librosa
433
+ import librosa.display
434
+ import matplotlib.pyplot as plt
435
+ import soundfile as sf
436
+ from PIL import Image
437
+
438
+
439
+ # Step 1: Encode Audio to Mel-Spectrogram
440
+ def encode_audio_to_mel_spectrogram(audio_file, n_mels=128):
441
+ """
442
+ Encode an audio file to a mel-spectrogram.
443
+
444
+ Parameters:
445
+ - audio_file: Path to the audio file.
446
+ - n_mels: Number of mel bands (default: 128).
447
+
448
+ Returns:
449
+ - mel_spectrogram_db: Mel-spectrogram in dB scale.
450
+ - sample_rate: Sample rate of the audio file.
451
+ """
452
+ y, sample_rate = librosa.load(audio_file, sr=None) # Load audio
453
+ mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sample_rate, n_mels=n_mels)
454
+ mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max) # Convert to dB
455
+ return mel_spectrogram_db, sample_rate
456
+
457
+ # Improved Step 2: Save Mel-Spectrogram as Image
458
+ def save_mel_spectrogram_image(mel_spectrogram_db, sample_rate, output_image='mel_spectrogram.png', method='matplotlib', figsize=(10, 4), cmap='hot'):
459
+ """
460
+ Save the mel-spectrogram as an image using the specified method.
461
+
462
+ Parameters:
463
+ - mel_spectrogram_db: Mel-spectrogram in dB scale.
464
+ - sample_rate: Sample rate of the audio file.
465
+ - output_image: Path to save the image.
466
+ - method: Method for saving ('matplotlib' or 'custom').
467
+ - figsize: Size of the figure for matplotlib (default: (10, 4)).
468
+ - cmap: Colormap for the spectrogram (default: 'hot').
469
+ """
470
+ if method == 'matplotlib':
471
+ plt.figure(figsize=figsize)
472
+ librosa.display.specshow(mel_spectrogram_db, sr=sample_rate, x_axis='time', y_axis='mel', cmap=cmap)
473
+ plt.colorbar(format='%+2.0f dB')
474
+ plt.title('Mel-Spectrogram')
475
+ plt.savefig(output_image)
476
+ plt.close()
477
+ print(f"Mel-spectrogram image saved using matplotlib as '{output_image}'")
478
+
479
+ elif method == 'custom':
480
+ # Convert dB scale to linear scale for image generation
481
+ mel_spectrogram_linear = librosa.db_to_power(mel_spectrogram_db)
482
+ # Create an image from the mel-spectrogram
483
+ image = image_from_spectrogram(mel_spectrogram_linear[np.newaxis, ...]) # Add channel dimension
484
+ # Save the image
485
+ image.save(output_image)
486
+ print(f"Mel-spectrogram image saved using custom method as '{output_image}'")
487
+
488
+ else:
489
+ raise ValueError("Invalid method. Choose 'matplotlib' or 'custom'.")
490
+
491
+
492
+ # Spectrogram conversion functions
493
+ def image_from_spectrogram(spectrogram: np.ndarray, power: float = 0.25) -> Image.Image:
494
+ """
495
+ Compute a spectrogram image from a spectrogram magnitude array.
496
+
497
+ Args:
498
+ spectrogram: (channels, frequency, time)
499
+ power: A power curve to apply to the spectrogram to preserve contrast
500
+
501
+ Returns:
502
+ image: (frequency, time, channels)
503
+ """
504
+ # Rescale to 0-1
505
+ max_value = np.max(spectrogram)
506
+ data = spectrogram / max_value
507
+
508
+ # Apply the power curve
509
+ data = np.power(data, power)
510
+
511
+ # Rescale to 0-255 and invert
512
+ data = 255 - (data * 255).astype(np.uint8)
513
+
514
+ # Convert to a PIL image
515
+ if data.shape[0] == 1:
516
+ image = Image.fromarray(data[0], mode="L").convert("RGB")
517
+ elif data.shape[0] == 2:
518
+ data = np.array([np.zeros_like(data[0]), data[0], data[1]]).transpose(1, 2, 0)
519
+ image = Image.fromarray(data, mode="RGB")
520
+ else:
521
+ raise NotImplementedError(f"Unsupported number of channels: {data.shape[0]}")
522
+
523
+ # Flip Y
524
+ image = image.transpose(Image.FLIP_TOP_BOTTOM)
525
+ return image
526
+
527
+
528
+ # Step 3: Extract Mel-Spectrogram from Image (Direct Pixel Manipulation)
529
+ def extract_mel_spectrogram_from_image(image_path):
530
+ """
531
+ Extract a mel-spectrogram from a saved image using pixel manipulation.
532
+
533
+ Parameters:
534
+ - image_path: Path to the spectrogram image file.
535
+
536
+ Returns:
537
+ - mel_spectrogram_db: The extracted mel-spectrogram in dB scale.
538
+ """
539
+ img = Image.open(image_path).convert('L') # Open image and convert to grayscale
540
+ img_array = np.array(img) # Convert to NumPy array
541
+ mel_spectrogram_db = img_array / 255.0 * -80 # Scale to dB range
542
+ return mel_spectrogram_db
543
+
544
+ # Alternative Spectrogram Extraction (IFFT Method)
545
+ def extract_spectrogram_with_ifft(mel_spectrogram_db):
546
+ """
547
+ Extracts the audio signal from a mel-spectrogram using the inverse FFT method.
548
+
549
+ Parameters:
550
+ - mel_spectrogram_db: The mel-spectrogram in dB scale.
551
+
552
+ Returns:
553
+ - audio: The reconstructed audio signal.
554
+ """
555
+ # Convert dB mel-spectrogram back to linear scale
556
+ mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
557
+
558
+ # Inverse mel transformation to get the audio signal
559
+ # Using IFFT (simplified for demonstration; typically requires phase info)
560
+ audio = librosa.feature.inverse.mel_to_audio(mel_spectrogram)
561
+
562
+ return audio
563
+
564
+ # Step 4: Decode Mel-Spectrogram with Griffin-Lim
565
+ def decode_mel_spectrogram_to_audio(mel_spectrogram_db, sample_rate, output_audio='griffin_reconstructed_audio.wav'):
566
+ """
567
+ Decode a mel-spectrogram into audio using Griffin-Lim algorithm.
568
+
569
+ Parameters:
570
+ - mel_spectrogram_db: The mel-spectrogram in dB scale.
571
+ - sample_rate: The sample rate for the audio file.
572
+ - output_audio: Path to save the reconstructed audio file.
573
+ """
574
+ # Convert dB mel-spectrogram back to linear scale
575
+ mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
576
+ # Perform Griffin-Lim to reconstruct audio
577
+ audio = librosa.griffinlim(mel_spectrogram)
578
+ # Save the generated audio
579
+ sf.write(output_audio, audio, sample_rate)
580
+ print(f"Griffin-Lim reconstructed audio saved as '{output_audio}'")
581
+ return audio
582
+
583
+ # Step 5: Load MelGAN Vocoder
584
+ def load_melgan_vocoder():
585
+ """
586
+ Load a lightweight pre-trained MelGAN vocoder for decoding mel-spectrograms.
587
+ Returns a torch MelGAN vocoder model.
588
+ """
589
+ model = torchaudio.models.MelGAN() # Load MelGAN model
590
+ model.eval() # Ensure the model is in evaluation mode
591
+ return model
592
+
593
+ # Step 6: Decode Mel-Spectrogram with MelGAN
594
+ def decode_mel_spectrogram_with_melgan(mel_spectrogram_db, sample_rate, output_audio='melgan_reconstructed_audio.wav'):
595
+ """
596
+ Decode a mel-spectrogram into audio using MelGAN vocoder.
597
+
598
+ Parameters:
599
+ - mel_spectrogram_db: The mel-spectrogram in dB scale.
600
+ - sample_rate: The sample rate for the audio file.
601
+ - output_audio: Path to save the reconstructed audio file.
602
+
603
+ Returns:
604
+ - audio: The reconstructed audio signal.
605
+ """
606
+ # Convert dB mel-spectrogram back to linear scale
607
+ mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
608
+ # Convert numpy array to torch tensor and adjust the shape
609
+ mel_spectrogram_tensor = torch.tensor(mel_spectrogram).unsqueeze(0) # Shape: [1, mel_bins, time_frames]
610
+
611
+ # Load the MelGAN vocoder model
612
+ melgan = load_melgan_vocoder()
613
+
614
+ # Pass the mel-spectrogram through MelGAN to generate audio
615
+ with torch.no_grad():
616
+ audio = melgan(mel_spectrogram_tensor).squeeze().numpy() # Squeeze to remove batch dimension
617
+
618
+ # Save the generated audio
619
+ sf.write(output_audio, audio, sample_rate)
620
+ print(f"MelGAN reconstructed audio saved as '{output_audio}'")
621
+ return audio
622
+ def audio_from_waveform(samples: np.ndarray, sample_rate: int, normalize: bool = False) -> pydub.AudioSegment:
623
+ """
624
+ Convert a numpy array of samples of a waveform to an audio segment.
625
+
626
+ Args:
627
+ samples: (channels, samples) array
628
+ sample_rate: Sample rate of the audio.
629
+ normalize: Flag to normalize volume.
630
+
631
+ Returns:
632
+ pydub.AudioSegment
633
+ """
634
+ # Normalize volume to fit in int16
635
+ if normalize:
636
+ samples *= np.iinfo(np.int16).max / np.max(np.abs(samples))
637
+
638
+ # Transpose and convert to int16
639
+ samples = samples.transpose(1, 0).astype(np.int16)
640
+
641
+ # Write to the bytes of a WAV file
642
+ wav_bytes = io.BytesIO()
643
+ wavfile.write(wav_bytes, sample_rate, samples)
644
+ wav_bytes.seek(0)
645
+
646
+ # Read into pydub
647
+ return pydub.AudioSegment.from_wav(wav_bytes)
648
+
649
+
650
+ def apply_filters(segment: pydub.AudioSegment, compression: bool = False) -> pydub.AudioSegment:
651
+ """
652
+ Apply post-processing filters to the audio segment to compress it and keep at a -10 dBFS level.
653
+
654
+ Args:
655
+ segment: The audio segment to filter.
656
+ compression: Flag to apply dynamic range compression.
657
+
658
+ Returns:
659
+ pydub.AudioSegment
660
+ """
661
+ if compression:
662
+ segment = pydub.effects.normalize(segment, headroom=0.1)
663
+ segment = segment.apply_gain(-10 - segment.dBFS)
664
+ segment = pydub.effects.compress_dynamic_range(
665
+ segment,
666
+ threshold=-20.0,
667
+ ratio=4.0,
668
+ attack=5.0,
669
+ release=50.0,
670
+ )
671
+
672
+ # Apply gain to desired dB level and normalize again
673
+ desired_db = -12
674
+ segment = segment.apply_gain(desired_db - segment.dBFS)
675
+ return pydub.effects.normalize(segment, headroom=0.1)
676
+
677
+
678
+ def stitch_segments(segments: Sequence[pydub.AudioSegment], crossfade_s: float) -> pydub.AudioSegment:
679
+ """
680
+ Stitch together a sequence of audio segments with a crossfade between each segment.
681
+
682
+ Args:
683
+ segments: Sequence of audio segments to stitch.
684
+ crossfade_s: Duration of crossfade in seconds.
685
+
686
+ Returns:
687
+ pydub.AudioSegment
688
+ """
689
+ crossfade_ms = int(crossfade_s * 1000)
690
+ combined_segment = segments[0]
691
+ for segment in segments[1:]:
692
+ combined_segment = combined_segment.append(segment, crossfade=crossfade_ms)
693
+ return combined_segment
694
+
695
+
696
+ def overlay_segments(segments: Sequence[pydub.AudioSegment]) -> pydub.AudioSegment:
697
+ """
698
+ Overlay a sequence of audio segments on top of each other.
699
+
700
+ Args:
701
+ segments: Sequence of audio segments to overlay.
702
+
703
+ Returns:
704
+ pydub.AudioSegment
705
+ """
706
+ assert len(segments) > 0
707
+ output: pydub.AudioSegment = segments[0]
708
+ for segment in segments[1:]:
709
+ output = output.overlay(segment)
710
+ return output
711
+
712
+
713
+
714
+ # Step 7: Full Pipeline for Audio Processing with Customization
715
+ def mel_spectrogram_pipeline(audio_file, output_image='mel_spectrogram.png',
716
+ output_audio_griffin='griffin_reconstructed_audio.wav',
717
+ output_audio_melgan='melgan_reconstructed_audio.wav',
718
+ extraction_method='pixel', # 'pixel' or 'ifft'
719
+ decoding_method='griffin'): # 'griffin' or 'melgan'
720
+ """
721
+ Full pipeline to encode audio to mel-spectrogram, save it as an image, extract the spectrogram from the image,
722
+ and decode it back to audio using the selected methods.
723
+
724
+ Parameters:
725
+ - audio_file: Path to the audio file to be processed.
726
+ - output_image: Path to save the mel-spectrogram image (default: 'mel_spectrogram.png').
727
+ - output_audio_griffin: Path to save the Griffin-Lim reconstructed audio.
728
+ - output_audio_melgan: Path to save the MelGAN reconstructed audio.
729
+ - extraction_method: Method for extraction ('pixel' or 'ifft').
730
+ - decoding_method: Method for decoding ('griffin' or 'melgan').
731
+ """
732
+ # Step 1: Encode (Audio -> Mel-Spectrogram)
733
+ mel_spectrogram_db, sample_rate = encode_audio_to_mel_spectrogram(audio_file)
734
+
735
+ # Step 2: Convert Mel-Spectrogram to Image and save it
736
+ save_mel_spectrogram_image(mel_spectrogram_db, sample_rate, output_image)
737
+
738
+ # Step 3: Extract Mel-Spectrogram from the image based on chosen method
739
+ if extraction_method == 'pixel':
740
+ extracted_mel_spectrogram_db = extract_mel_spectrogram_from_image(output_image)
741
+ elif extraction_method == 'ifft':
742
+ extracted_mel_spectrogram_db = extract_spectrogram_with_ifft(mel_spectrogram_db)
743
+ else:
744
+ raise ValueError("Invalid extraction method. Choose 'pixel' or 'ifft'.")
745
+
746
+ # Step 4: Decode based on the chosen decoding method
747
+ if decoding_method == 'griffin':
748
+ decode_mel_spectrogram_to_audio(extracted_mel_spectrogram_db, sample_rate, output_audio_griffin)
749
+ elif decoding_method == 'melgan':
750
+ decode_mel_spectrogram_with_melgan(extracted_mel_spectrogram_db, sample_rate, output_audio_melgan)
751
+ else:
752
+ raise ValueError("Invalid decoding method. Choose 'griffin' or 'melgan'.")
753
+
754
+ # Example usage
755
+ if __name__ == "__main__":
756
+ audio_file_path = 'your_audio_file.wav' # Specify the path to your audio file here
757
+ mel_spectrogram_pipeline(
758
+ audio_file_path,
759
+ output_image='mel_spectrogram.png',
760
+ output_audio_griffin='griffin_reconstructed_audio.wav',
761
+ output_audio_melgan='melgan_reconstructed_audio.wav',
762
+ extraction_method='pixel', # Choose 'pixel' or 'ifft'
763
+ decoding_method='griffin' # Choose 'griffin' or 'melgan'
764
+ )
765
+
766
+
767
+
768
+
769
+ ```
770
+
771
+
772
+ ADDING EXTRA HEADS :
773
+
774
+
775
+ # ADD HEAD
776
+
777
+ ```
778
+
779
+ SPEECH-ENCODER-DECODER-MODEL
780
+ ```
781
+
782
+
783
+ print('Add Audio...')
784
+ #Add Head
785
+ # Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
786
+ _AudioFeatureExtractor = AutoFeatureExtractor.from_pretrained("openai/whisper-small")
787
+ _AudioTokenizer = AutoTokenizer.from_pretrained("openai/whisper-small")
788
+ _SpeechEncoderDecoder = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("openai/whisper-small","openai/whisper-small")
789
+
790
+ # Add Pad tokems
791
+ _SpeechEncoderDecoder.config.decoder_start_token_id = _AudioTokenizer.cls_token_id
792
+ _SpeechEncoderDecoder.config.pad_token_id = _AudioTokenizer.pad_token_id
793
+ LM_MODEL.SpeechEncoderDecoder = _SpeechEncoderDecoder
794
+ # Add Sub Components
795
+ LM_MODEL.Decoder_AudioTokenizer = _AudioTokenizer
796
+ LM_MODEL.Encoder_AudioFeatureExtractor = _AudioFeatureExtractor
797
+ LM_MODEL
798
+
799
+ ```
800
+
801
+ print('Add Vision...')
802
+
803
+ # ADD HEAD
804
+ # Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
805
+
806
+
807
+
808
+ Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
809
+ "google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny"
810
+ )
811
+ _Encoder_ImageProcessor = Vmodel.encoder
812
+ _Decoder_ImageTokenizer = Vmodel.decoder
813
+ _VisionEncoderDecoderModel = Vmodel
814
+ # Add Pad tokems
815
+ LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel
816
+ # Add Sub Components
817
+ LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor
818
+ LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer
819
+ LM_MODEL
820
+
821
+
822
+ ```
823
+
824
+
825
+
826
+
827
+
828
+