m7mdal7aj commited on
Commit
7aaa8ae
1 Parent(s): 530fdf0

Update my_model/tabs/model_arch.py

Browse files
Files changed (1) hide show
  1. my_model/tabs/model_arch.py +44 -8
my_model/tabs/model_arch.py CHANGED
@@ -24,13 +24,49 @@ def run_model_arch() -> None:
24
  components.html(model_arch_html, height=1400)
25
  with col2:
26
  st.markdown("#### Abstract")
27
- st.write("""\n\nNavigating the frontier of the Visual Turing Test, this research delves into multimodal learning to bridge the gap between visual perception and linguistic interpretation, a foundational challenge in artificial intelligence. It scrutinizes the integration of visual cognition and external knowledge, emphasizing the pivotal role of the Transformer model in enhancing language processing and supporting complex multimodal tasks.
28
- This research explores the task of Knowledge-Based Visual Question Answering (KB-VQA), it examines the influence of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle complex tasks, thereby enhancing KB-VQA systems.
29
- \nAn examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined approach that converts visual content into the linguistic domain, creating detailed captions and object enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The research refines the fine-tuning of PT-LLMs by integrating specialized tokens, enhancing the models’ ability to interpret visual contexts. The research also reviews current image representation techniques and knowledge sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not require specialized expertise.
30
- \nRigorous ablation experiments conducted to assess the impact of various visual context elements on model performance, with a particular focus on the importance of image descriptions generated during the captioning phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus, and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment with practical application needs.
31
- \nThe evaluation results underscore the developed model’s competent and competitive performance. It achieves a VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further, semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%, respectively. These results demonstrate that the model effectively applies reasoning over the visual context and successfully retrieves the necessary knowledge to answer visual questions.""")
32
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  st.markdown("#### Design")
35
- st.write("""As illustrated in the architecture, the model operates through a sequential pipeline, beginning with the Image to Language Transformation Module, in this module, the image undergoes simultaneous processing via image captioning and object detection frozen models, aiming to comprehensively capture the visual context and cues. These models, selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological advancement. Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects, along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model (PT-LLMs), which is responsible for performing reasoning and deriving the required knowledge to formulate an informed response to the question.
36
- """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  components.html(model_arch_html, height=1400)
25
  with col2:
26
  st.markdown("#### Abstract")
27
+ st.markdown("""
28
+ <div style="text-align: justify;">
29
+ Navigating the frontier of the Visual Turing Test, this research delves into multimodal learning to bridge
30
+ the gap between visual perception and linguistic interpretation, a foundational challenge in artificial
31
+ intelligence. It scrutinizes the integration of visual cognition and external knowledge, emphasizing the
32
+ pivotal role of the Transformer model in enhancing language processing and supporting complex multimodal tasks.
33
+ This research explores the task of Knowledge-Based Visual Question Answering (KB-VQA), examining the influence
34
+ of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have
35
+ transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle
36
+ complex tasks, thereby enhancing KB-VQA systems.
37
+ An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined
38
+ approach that converts visual content into the linguistic domain, creating detailed captions and object
39
+ enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The
40
+ research refines the fine-tuning of PT-LLMs by integrating specialized tokens, enhancing the models’ ability
41
+ to interpret visual contexts. The research also reviews current image representation techniques and knowledge
42
+ sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not
43
+ require specialized expertise.
44
+ Rigorous ablation experiments conducted to assess the impact of various visual context elements on model
45
+ performance, with a particular focus on the importance of image descriptions generated during the captioning
46
+ phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus,
47
+ and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment
48
+ with practical application needs.
49
+ The evaluation results underscore the developed model’s competent and competitive performance. It achieves a
50
+ VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further,
51
+ semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%,
52
+ respectively. These results demonstrate that the model effectively applies reasoning over the visual context
53
+ and successfully retrieves the necessary knowledge to answer visual questions.
54
+ </div>
55
+ """, unsafe_allow_html=True)
56
 
57
  st.markdown("#### Design")
58
+ st.markdown("""
59
+ <div style="text-align: justify;">
60
+ As illustrated in architecture, the model operates through a sequential pipeline, beginning with the Image to
61
+ Language Transformation Module. In this module, the image undergoes simultaneous processing via image captioning
62
+ and object detection frozen models, aiming to comprehensively capture the visual context and cues. These models,
63
+ selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more
64
+ advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological
65
+ advancement.
66
+ Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects,
67
+ along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing
68
+ a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model
69
+ (PT-LLMs), which is responsible for performing reasoning and deriving the required knowledge to formulate an
70
+ informed response to the question.
71
+ </div>
72
+ """, unsafe_allow_html=True)