JuanjoSG5 commited on
Commit
39cf578
1 Parent(s): 651c146

feat: increased the efficiency of the transcription

Browse files
Files changed (3) hide show
  1. README.md +4 -3
  2. app.py +29 -17
  3. requirements.txt +1 -1
README.md CHANGED
@@ -9,14 +9,15 @@ app_file: app.py
9
  pinned: false
10
  short_description: Transcribes an audio and creates a summary
11
  ---
 
12
  # Limitations
13
 
14
  I have tested the application with audio files of varying lengths. Initially, I attempted processing audios of 1 to 2 hours,
15
  but due to hardware constraints, my PC was unable to handle files of that size effectively.
 
 
16
 
17
- After testing, I found that the application operates best with audio files under 20 minutes, although this 20 minutes should be consider the longest length I would recommend, since the app processes shorter audios much more effectively. For example, a stereo audio file that is around 20 minutes long usually takes about 15 to 18 minutes to process. This processing time may vary depending on the capabilities of your PC.
18
-
19
- For users with high-performance computers, it may be possible to process longer audio files. However, for consistent and reliable results, I recommend audios around the length of 10 to 15 minutes.
20
 
21
  # Main Use
22
 
 
9
  pinned: false
10
  short_description: Transcribes an audio and creates a summary
11
  ---
12
+
13
  # Limitations
14
 
15
  I have tested the application with audio files of varying lengths. Initially, I attempted processing audios of 1 to 2 hours,
16
  but due to hardware constraints, my PC was unable to handle files of that size effectively.
17
+ S
18
+ After testing, I found that the application operates best with audio files under 20 minutes, although this 20 minutes should be consider the longest length I would recommend, since the app processes shorter audios much more effectively. For example, a stereo audio file that is around 20 minutes long usually takes about 10 to 12 minutes to process, but again i wouldn't recommend suing this model for such audio files. This processing time may vary depending on the capabilities of your PC.
19
 
20
+ For users with high-performance computers, it may be possible to process longer audio files. However, for consistent and reliable results, I recommend audios around the length of 10 to 15 minutes, which it usually takes 3 minutes for 10 minute files and around 5 min for 15 minutes.
 
 
21
 
22
  # Main Use
23
 
app.py CHANGED
@@ -1,7 +1,7 @@
1
  import gradio as gr
2
  from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, AutoTokenizer, BartForConditionalGeneration
3
  import torch
4
- import librosa
5
 
6
  # Load BART tokenizer and model for summarization
7
  tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
@@ -16,24 +16,37 @@ device = "cuda" if torch.cuda.is_available() else "cpu"
16
  model.to(device)
17
  summarizer.to(device)
18
 
 
 
 
 
19
  def transcribe_and_summarize(audioFile):
20
- # Load audio as an array
21
- audio, sampling_rate = librosa.load(audioFile, sr=16000) # Ensure it's 16kHz for Wav2Vec2
22
- values = processor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_values
 
 
 
 
 
23
 
24
- # Move tensors to GPU if available
25
- values = values.to(device)
 
26
 
27
- # Transcription
28
- with torch.no_grad():
29
- logits = model(values).logits
30
- predictedIDs = torch.argmax(logits, dim=-1)
31
- transcription = processor.batch_decode(predictedIDs, skip_special_tokens=True)[0]
32
 
33
- # Summarization
34
- inputs = tokenizer(transcription, return_tensors="pt", truncation=True, max_length=1024)
35
- inputs = inputs.to(device) # Move inputs to GPU
 
 
36
 
 
 
 
37
  result = summarizer.generate(
38
  inputs["input_ids"],
39
  min_length=10,
@@ -41,12 +54,12 @@ def transcribe_and_summarize(audioFile):
41
  no_repeat_ngram_size=2,
42
  encoder_no_repeat_ngram_size=2,
43
  repetition_penalty=2.0,
44
- num_beams=4,
45
  early_stopping=True,
46
  )
47
  summary = tokenizer.decode(result[0], skip_special_tokens=True)
48
 
49
- return transcription, summary
50
 
51
  # Gradio interface
52
  iface = gr.Interface(
@@ -58,4 +71,3 @@ iface = gr.Interface(
58
  )
59
 
60
  iface.launch()
61
-
 
1
  import gradio as gr
2
  from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, AutoTokenizer, BartForConditionalGeneration
3
  import torch
4
+ import torchaudio # Replace librosa for faster audio processing
5
 
6
  # Load BART tokenizer and model for summarization
7
  tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
 
16
  model.to(device)
17
  summarizer.to(device)
18
 
19
+ model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
20
+ summarizer = torch.quantization.quantize_dynamic(summarizer, {torch.nn.Linear}, dtype=torch.qint8)
21
+
22
+
23
  def transcribe_and_summarize(audioFile):
24
+ # Load audio using torchaudio
25
+ audio, sampling_rate = torchaudio.load(audioFile)
26
+
27
+ # Resample audio to 16kHz if necessary
28
+ if sampling_rate != 16000:
29
+ resample_transform = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
30
+ audio = resample_transform(audio)
31
+ audio = audio.squeeze()
32
 
33
+ # Process audio in chunks for large files
34
+ chunk_size = int(16000 * 30) # 10-second chunks
35
+ transcription = ""
36
 
37
+ for i in range(0, len(audio), chunk_size):
38
+ chunk = audio[i:i+chunk_size].numpy()
39
+ inputs = processor(chunk, sampling_rate=16000, return_tensors="pt").input_values.to(device)
 
 
40
 
41
+ # Transcription
42
+ with torch.no_grad():
43
+ logits = model(inputs).logits
44
+ predicted_ids = torch.argmax(logits, dim=-1)
45
+ transcription += processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] + " "
46
 
47
+ # Summarization
48
+ inputs = tokenizer(transcription, return_tensors="pt", truncation=True, max_length=1024).to(device)
49
+
50
  result = summarizer.generate(
51
  inputs["input_ids"],
52
  min_length=10,
 
54
  no_repeat_ngram_size=2,
55
  encoder_no_repeat_ngram_size=2,
56
  repetition_penalty=2.0,
57
+ num_beams=2, # Reduced beams for faster inference
58
  early_stopping=True,
59
  )
60
  summary = tokenizer.decode(result[0], skip_special_tokens=True)
61
 
62
+ return transcription.strip(), summary.strip()
63
 
64
  # Gradio interface
65
  iface = gr.Interface(
 
71
  )
72
 
73
  iface.launch()
 
requirements.txt CHANGED
@@ -1,4 +1,4 @@
1
  gradio
2
  transformers
3
  torch
4
- librosa
 
1
  gradio
2
  transformers
3
  torch
4
+ torchaudio