Spaces:
Running
Running
update documentation
Browse files- README.md +16 -9
- streamlit_app.py +6 -11
README.md
CHANGED
@@ -1,19 +1,26 @@
|
|
1 |
-
# DocumentIQA: Scientific Document Insight
|
2 |
|
3 |
## Introduction
|
4 |
|
5 |
-
Question/Answering on scientific documents.
|
6 |
-
|
7 |
-
|
8 |
-
This is just the beginning and publishing might help gathering more feedback.
|
9 |
-
|
10 |
-
**NOTE**: This project focus on scientific articles. Uploading books or other large document might not work as expected.
|
11 |
|
12 |
**Work in progress**
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
-
|
|
|
|
|
|
|
17 |
|
18 |
|
19 |
### Acknolwedgement
|
|
|
1 |
+
# DocumentIQA: Scientific Document Insight QA
|
2 |
|
3 |
## Introduction
|
4 |
|
5 |
+
Question/Answering on scientific documents using LLMs (OpenAI, Mistral, LLama2).
|
6 |
+
This application is the frontend for testing the RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS.
|
7 |
+
Differently to most of the project, we focus on scientific articles and we are using [Grobid](https://github.com/kermitt2/grobid) for text extraction instead of the raw PDF2Text converter allow to extract only full-text.
|
|
|
|
|
|
|
8 |
|
9 |
**Work in progress**
|
10 |
|
11 |
+
- Select the model+embedding combination you want ot use.
|
12 |
+
- Enter your API Key (Open AI or Huggingface).
|
13 |
+
- Upload a scientific article as PDF document. You will see a spinner or loading indicator while the processing is in progress.
|
14 |
+
- Once the spinner stops, you can proceed to ask your questions
|
15 |
+
|
16 |
+
### Query mode (LLm vs Embeddings)
|
17 |
+
By default, the mode is set to LLM (Language Model) which enables question/answering. You can directly ask questions related to the document content, and the system will answer the question using content from the document.
|
18 |
+
If you switch the mode to "Embedding," the system will return specific chunks from the document that are semantically related to your query. This mode helps to test why sometimes the answers are not satisfying or incomplete.
|
19 |
|
20 |
+
## Demo
|
21 |
+
The demo is deployed with streamlit and, depending on the model used, requires either OpenAI or HuggingFace **API KEYs**.
|
22 |
+
|
23 |
+
https://document-insights.streamlit.app/
|
24 |
|
25 |
|
26 |
### Acknolwedgement
|
streamlit_app.py
CHANGED
@@ -118,16 +118,14 @@ if not st.session_state['api_key']:
|
|
118 |
else:
|
119 |
is_api_key_provided = st.session_state['api_key']
|
120 |
|
121 |
-
st.title("π Document
|
122 |
-
st.subheader("Upload a PDF
|
123 |
|
124 |
upload_col, radio_col, context_col = st.columns([7, 2, 2])
|
125 |
with upload_col:
|
126 |
uploaded_file = st.file_uploader("Upload an article", type=("pdf", "txt"), on_change=new_file,
|
127 |
disabled=not is_api_key_provided,
|
128 |
-
help="The
|
129 |
-
"embeddings of each paragraph which are then stored to a Db for be picked "
|
130 |
-
"to answer specific questions. ")
|
131 |
with radio_col:
|
132 |
mode = st.radio("Query mode", ("LLM", "Embeddings"), disabled=not uploaded_file, index=0,
|
133 |
help="LLM will respond the question, Embedding will show the "
|
@@ -147,20 +145,17 @@ with st.sidebar:
|
|
147 |
st.header("Documentation")
|
148 |
st.markdown("https://github.com/lfoppiano/document-qa")
|
149 |
st.markdown(
|
150 |
-
"""After entering your API Key (Open AI or Huggingface). Upload a scientific article as PDF document
|
151 |
-
|
152 |
-
st.markdown(
|
153 |
-
"""After uploading, please wait for the PDF to be processed. You will see a spinner or loading indicator while the processing is in progress. Once the spinner stops, you can proceed to ask your questions.""")
|
154 |
|
155 |
st.markdown("**Revision number**: [" + st.session_state[
|
156 |
'git_rev'] + "](https://github.com/lfoppiano/grobid-magneto/commit/" + st.session_state['git_rev'] + ")")
|
157 |
|
158 |
st.header("Query mode (Advanced use)")
|
159 |
st.markdown(
|
160 |
-
"""By default, the mode is set to LLM (Language Model) which enables question/answering. You can directly ask questions related to the
|
161 |
|
162 |
st.markdown(
|
163 |
-
"""If you switch the mode to "Embedding," the system will return specific
|
164 |
|
165 |
if uploaded_file and not st.session_state.loaded_embeddings:
|
166 |
with st.spinner('Reading file, calling Grobid, and creating memory embeddings...'):
|
|
|
118 |
else:
|
119 |
is_api_key_provided = st.session_state['api_key']
|
120 |
|
121 |
+
st.title("π Scientific Document Insight Q&A")
|
122 |
+
st.subheader("Upload a scientific article in PDF, ask questions, get insights.")
|
123 |
|
124 |
upload_col, radio_col, context_col = st.columns([7, 2, 2])
|
125 |
with upload_col:
|
126 |
uploaded_file = st.file_uploader("Upload an article", type=("pdf", "txt"), on_change=new_file,
|
127 |
disabled=not is_api_key_provided,
|
128 |
+
help="The full-text is extracted using Grobid. ")
|
|
|
|
|
129 |
with radio_col:
|
130 |
mode = st.radio("Query mode", ("LLM", "Embeddings"), disabled=not uploaded_file, index=0,
|
131 |
help="LLM will respond the question, Embedding will show the "
|
|
|
145 |
st.header("Documentation")
|
146 |
st.markdown("https://github.com/lfoppiano/document-qa")
|
147 |
st.markdown(
|
148 |
+
"""After entering your API Key (Open AI or Huggingface). Upload a scientific article as PDF document. You will see a spinner or loading indicator while the processing is in progress. Once the spinner stops, you can proceed to ask your questions.""")
|
|
|
|
|
|
|
149 |
|
150 |
st.markdown("**Revision number**: [" + st.session_state[
|
151 |
'git_rev'] + "](https://github.com/lfoppiano/grobid-magneto/commit/" + st.session_state['git_rev'] + ")")
|
152 |
|
153 |
st.header("Query mode (Advanced use)")
|
154 |
st.markdown(
|
155 |
+
"""By default, the mode is set to LLM (Language Model) which enables question/answering. You can directly ask questions related to the document content, and the system will answer the question using content from the document.""")
|
156 |
|
157 |
st.markdown(
|
158 |
+
"""If you switch the mode to "Embedding," the system will return specific chunks from the document that are semantically related to your query. This mode helps to test why sometimes the answers are not satisfying or incomplete. """)
|
159 |
|
160 |
if uploaded_file and not st.session_state.loaded_embeddings:
|
161 |
with st.spinner('Reading file, calling Grobid, and creating memory embeddings...'):
|