Spaces:

MachineLearningReply
/

docwhiz

Runtime error

App Files Files Community

Anirudh Madhigiri Gopinath commited on Apr 5

Commit

fa2034d

•

1 Parent(s): 4053944

pusing

Browse files

Files changed (23) hide show

.DS_Store +0 -0
README 2.md +111 -0
app.py +284 -0
generate_keys.py +15 -0
hashed_password.pkl +0 -0
ml_logo.png +0 -0
requirements.txt +127 -0
utils/.DS_Store +0 -0
utils/__pycache__/check_pydantic_version.cpython-310.pyc +0 -0
utils/__pycache__/check_pydantic_version.cpython-311.pyc +0 -0
utils/__pycache__/check_pydantic_version.cpython-39.pyc +0 -0
utils/__pycache__/config.cpython-310.pyc +0 -0
utils/__pycache__/config.cpython-311.pyc +0 -0
utils/__pycache__/config.cpython-39.pyc +0 -0
utils/__pycache__/haystack.cpython-310.pyc +0 -0
utils/__pycache__/haystack.cpython-311.pyc +0 -0
utils/__pycache__/haystack.cpython-39.pyc +0 -0
utils/__pycache__/ui.cpython-310.pyc +0 -0
utils/__pycache__/ui.cpython-311.pyc +0 -0
utils/check_pydantic_version.py +26 -0
utils/config.py +43 -0
utils/haystack.py +124 -0
utils/ui.py +16 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

README 2.md ADDED Viewed

	@@ -0,0 +1,111 @@

+---
+title: Document Insights - Extractive & Generative Methods
+emoji: 👑
+colorFrom: indigo
+colorTo: indigo
+sdk: streamlit
+sdk_version: 1.23.0
+app_file: app.py
+pinned: false
+---
+# Template Streamlit App for Haystack Search Pipelines
+This template [Streamlit](https://docs.streamlit.io/) app set up for simple [Haystack search applications](https://docs.haystack.deepset.ai/docs/semantic_search). The template is ready to do QA with **Retrievel Augmented Generation**, or **Ectractive QA**
+See the ['How to use this template'](#how-to-use-this-template) instructions below to create a simple UI for your own Haystack search pipelines.
+Below you will also find instructions on how you could [push this to Hugging Face Spaces 🤗](#pushing-to-hugging-face-spaces-).
+## Installation and Running
+To run the bare application which does _nothing_:
+1. Install requirements: `pip install -r requirements.txt`
+2. Run the streamlit app: `streamlit run app.py`
+This will start up the app on `localhost:8501` where you will find a simple search bar. Before you start editing, you'll notice that the app will only show you instructions on what to edit.
+### Optional Configurations
+You can set optional cofigurations to set the:
+-  `--task` you want to start the app with: `rag` or `extractive` (default: rag)
+-  `--store` you want to use: `inmemory`, `opensearch`, `weaviate` or `milvus` (default: inmemory)
+-  `--name` you want to have for the app. (default: 'My Search App')
+E.g.:
+```bash
+streamlit run app.py -- --store opensearch --task extractive --name 'My Opensearch Documentation Search'
+```
+In a `.env` file, include all the config settings that you would like to use based on:
+- The DocumentStore of your choice
+- The Extractive/Generative model of your choice
+While the `/utils/config.py` will create default values for some configurations, others have to be set in the `.env` such as the `OPENAI_KEY`
+Example `.env`
+```
+OPENAI_KEY=YOUR_KEY
+EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L12-v2
+GENERATIVE_MODEL=text-davinci-003
+```
+## How to use this template
+1. Create a new repository from this template or simply open it in a codespace to start playing around 💙
+2. Make sure your `requirements.txt` file includes the Haystack and Streamlit versions you would like to use.
+3. Change the code in `utils/haystack.py` if you would like a different pipeline.
+4. Create a `.env`file with all of your configuration settings.
+5. Make any UI edits you'd like to and [share with the Haystack community](https://haystack.deepeset.ai/community)
+6. Run the app as show in [installation and running](#installation-and-running)
+### Repo structure
+- `./utils`: This is where we have 3 files:
+    - `config.py`: This file extracts all of the configuration settings from a `.env` file. For some config settings, it uses default values. An example of this is in [this demo project](https://github.com/TuanaCelik/should-i-follow/blob/main/utils/config.py).
+    - `haystack.py`: Here you will find some functions already set up for you to start creating your Haystack search pipeline. It includes 2 main functions called `start_haystack()` which is what we use to create a pipeline and cache it, and `query()` which is the function called by `app.py` once a user query is received.
+    - `ui.py`: Use this file for any UI and initial value setups.
+- `app.py`: This is the main Streamlit application file that we will run. In its current state it has a simple search bar, a 'Run' button, and a response that you can highlight answers with.
+### What to edit?
+There are default pipelines both in `start_haystack_extractive()` and `start_haystack_rag()`
+- Change the pipelines to use the embedding models, extractive or generative models as you need.
+- If using the `rag` task, change the `default_prompt_template` to use one of our available ones on [PromptHub](https://prompthub.deepset.ai) or create your own `PromptTemplate`
+## Pushing to Hugging Face Spaces 🤗
+Below is an example GitHub action that will let you push your Streamlit app straight to the Hugging Face Hub as a Space.
+A few things to pay attention to:
+1. Create a New Space on Hugging Face with the Streamlit SDK.
+2. Create a Hugging Face token on your HF account.
+3. Create a secret on your GitHub repo called `HF_TOKEN` and put your Hugging Face token here.
+4. If you're using DocumentStores or APIs that require some keys/tokens, make sure these are provided as a secret for your HF Space too!
+5. This readme is set up to tell HF spaces that it's using streamlit and that the app is running on `app.py`, make any changes to the frontmatter of this readme to display the title, emoji etc you desire.
+6. Create a file in `.github/workflows/hf_sync.yml`. Here's an example that you can change with your own information, and an [example workflow](https://github.com/TuanaCelik/should-i-follow/blob/main/.github/workflows/hf_sync.yml) working for the [Should I Follow demo](https://huggingface.co/spaces/deepset/should-i-follow)
+```yaml
+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [main]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push --force https://{YOUR_HF_USERNAME}:$HF_TOKEN@{YOUR_HF_SPACE_REPO} main
+```

app.py ADDED Viewed

	@@ -0,0 +1,284 @@

+from utils.check_pydantic_version import use_pydantic_v1
+use_pydantic_v1() #This function has to be run before importing haystack. as haystack requires pydantic v1 to run
+from operator import index
+import streamlit as st
+import logging
+import os
+from annotated_text import annotation
+from json import JSONDecodeError
+from markdown import markdown
+from utils.config import parser
+from utils.haystack import start_document_store, query, initialize_pipeline, start_preprocessor_node, start_retriever, start_reader
+from utils.ui import reset_results, set_initial_state
+import pandas as pd
+import haystack
+from datetime import datetime
+import streamlit.components.v1 as components
+import streamlit_authenticator as stauth
+import pickle
+from streamlit_modal import Modal
+import numpy as np
+names = ['mlreply']
+usernames = ['docwhiz']
+with open('hashed_password.pkl','rb') as f:
+    hashed_passwords = pickle.load(f)
+# Whether the file upload should be enabled or not
+DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
+def show_documents_list(retrieved_documents):
+    data = []
+    for i, document in enumerate(retrieved_documents):
+        data.append([document.meta['name']])
+    df = pd.DataFrame(data, columns=['Uploaded Document Name'])
+    df.drop_duplicates(subset=['Uploaded Document Name'], inplace=True)
+    df.index = np.arange(1, len(df) + 1)
+    return df
+# Define a function to handle file uploads
+def upload_files():
+    uploaded_files = upload_container.file_uploader(
+            "upload", type=["pdf", "txt", "docx"], accept_multiple_files=True, label_visibility="hidden", key=1
+        )
+    return uploaded_files
+# Define a function to process a single file
+def process_file(data_file, preprocesor, document_store):
+    # read file and add content
+    file_contents = data_file.read().decode("utf-8")
+    docs = [{
+        'content': str(file_contents),
+        'meta': {'name': str(data_file.name)}
+    }]
+    try:
+        names = [item.meta.get('name') for item in document_store.get_all_documents()]
+        #if args.store == 'inmemory':
+        # doc = converter.convert(file_path=files, meta=None)
+        if data_file.name in names:
+            print(f"{data_file.name} already processed")
+        else:
+            print(f'preprocessing uploaded doc {data_file.name}.......')
+            #print(data_file.read().decode("utf-8"))
+            preprocessed_docs = preprocesor.process(docs)
+            print('writing to document store.......')
+            document_store.write_documents(preprocessed_docs)
+            print('updating emebdding.......')
+            document_store.update_embeddings(retriever)
+    except Exception as e:
+        print(e)
+# Define a function to upload the documents to haystack document store
+def upload_document():
+    if data_files is not None:
+        for data_file in data_files:
+            # Upload file
+            if data_file:
+                try:
+                    #raw_json = upload_doc(data_file)
+                    # Call the process_file function for each uploaded file
+                    if args.store == 'inmemory':
+                        processed_data = process_file(data_file, preprocesor, document_store)
+                    #upload_container.write(str(data_file.name) + " &nbsp;&nbsp; ✅ ")
+                except Exception as e:
+                    upload_container.write(str(data_file.name) + " &nbsp;&nbsp; ❌ ")
+                    upload_container.write("_This file could not be parsed, see the logs for more information._")
+# Define a function to reset the documents in haystack document store
+def reset_documents():
+    print('\nReseting documents list at ' + str(datetime.now()) + '\n')
+    st.session_state.data_files = None
+    document_store.delete_documents()
+try:
+    args = parser.parse_args()
+    preprocesor = start_preprocessor_node()
+    document_store = start_document_store(type=args.store)
+    document_store.get_all_documents()
+    retriever = start_retriever(document_store)
+    reader = start_reader()
+    st.set_page_config(
+        page_title="MLReplySearch",
+        layout="centered",
+        page_icon=":shark:",
+        menu_items={
+            'Get Help': 'https://www.extremelycoolapp.com/help',
+            'Report a bug': "https://www.extremelycoolapp.com/bug",
+            'About': "# This is a header. This is an *extremely* cool app!"
+        }
+    )
+    st.sidebar.image("ml_logo.png", use_column_width=True)
+    authenticator = stauth.Authenticate(names, usernames, hashed_passwords, "document_search", "random_text", cookie_expiry_days=1)
+    name, authentication_status, username = authenticator.login("Login", "main")
+    if authentication_status == False:
+        st.error("Username/Password is incorrect")
+    if authentication_status == None:
+        st.warning("Please enter your username and password")
+    if authentication_status:
+        # Sidebar for Task Selection
+        st.sidebar.header('Options:')
+        # OpenAI Key Input
+        openai_key = st.sidebar.text_input("Enter LLM-authorization Key:", type="password")
+        if openai_key:
+            task_options = ['Extractive', 'Generative']
+        else:
+            task_options = ['Extractive']
+        task_selection = st.sidebar.radio('Select the task:', task_options)
+        # Check the task and initialize pipeline accordingly
+        if task_selection == 'Extractive':
+            pipeline_extractive = initialize_pipeline("extractive", document_store, retriever, reader)
+        elif task_selection == 'Generative' and openai_key:  # Check for openai_key to ensure user has entered it
+            pipeline_rag = initialize_pipeline("rag", document_store, retriever, reader, openai_key=openai_key)
+        set_initial_state()
+        modal = Modal("Manage Files", key="demo-modal")
+        open_modal = st.sidebar.button("Manage Files", use_container_width=True)
+        if open_modal:
+            modal.open()
+        st.write('# ' + args.name)
+        if modal.is_open():
+            with modal.container():
+                if not DISABLE_FILE_UPLOAD:
+                    upload_container = st.container()
+                    data_files = upload_files()
+                    upload_document()
+                    st.session_state.sidebar_state = 'collapsed'
+                st.table(show_documents_list(document_store.get_all_documents()))
+        # File upload block
+       # if not DISABLE_FILE_UPLOAD:
+        #    upload_container = st.sidebar.container()
+         #   upload_container.write("## File Upload:")
+          #  data_files = upload_files()
+            # Button to update files in the documentStore
+           # upload_container.button('Upload Files', on_click=upload_document, args=())
+        # Button to reset the documents in DocumentStore
+        st.sidebar.button("Reset documents", on_click=reset_documents, args=(), use_container_width=True)
+        if "question" not in st.session_state:
+            st.session_state.question = ""
+        # Search bar
+        question = st.text_input("Question", value=st.session_state.question, max_chars=100, on_change=reset_results, label_visibility="hidden")
+        run_pressed = st.button("Run")
+        run_query = (
+            run_pressed or question != st.session_state.question #or task_selection != st.session_state.task
+        )
+        # Get results for query
+        if run_query and question:
+            if task_selection == 'Extractive':
+                reset_results()
+                st.session_state.question = question
+                with st.spinner("🔎 &nbsp;&nbsp; Running your pipeline"):
+                    try:
+                        st.session_state.results_extractive = query(pipeline_extractive, question)
+                        st.session_state.task = task_selection
+                    except JSONDecodeError as je:
+                        st.error(
+                            "👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?"
+                        )
+                    except Exception as e:
+                        logging.exception(e)
+                        st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+            elif task_selection == 'Generative':
+                reset_results()
+                st.session_state.question = question
+                with st.spinner("🔎 &nbsp;&nbsp; Running your pipeline"):
+                    try:
+                        st.session_state.results_generative = query(pipeline_rag, question)
+                        st.session_state.task = task_selection
+                    except JSONDecodeError as je:
+                        st.error(
+                            "👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?"
+                        )
+                    except Exception as e:
+                        if "API key is invalid" in str(e):
+                            logging.exception(e)
+                            st.error("🐞 &nbsp;&nbsp; incorrect API key provided. You can find your API key at https://platform.openai.com/account/api-keys.")
+                        else:
+                            logging.exception(e)
+                            st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+        # Display results
+        if (st.session_state.results_extractive or st.session_state.results_generative) and run_query:
+            # Handle Extractive Answers
+            if task_selection == 'Extractive':
+                results = st.session_state.results_extractive
+                st.subheader("Extracted Answers:")
+                if 'answers' in results:
+                    answers = results['answers']
+                    treshold = 0.2
+                    higher_then_treshold = any(ans.score > treshold for ans in answers)
+                    if not higher_then_treshold:
+                        st.markdown(f"<span style='color:red'>Please note none of the answers achieved a score higher then {int(treshold) * 100}%. Which probably means that the desired answer is not in the searched documents.</span>", unsafe_allow_html=True)
+                    for count, answer in enumerate(answers):
+                        if answer.answer:
+                            text, context = answer.answer, answer.context
+                            start_idx = context.find(text)
+                            end_idx = start_idx + len(text)
+                            score = round(answer.score, 3)
+                            st.markdown(f"**Answer {count + 1}:**")
+                            st.markdown(
+                                context[:start_idx] + str(annotation(body=text, label=f'SCORE {score}', background='#964448', color='#ffffff')) + context[end_idx:],
+                                unsafe_allow_html=True,
+                            )
+                        else:
+                            st.info(
+                                "🤔 &nbsp;&nbsp; Haystack is unsure whether any of the documents contain an answer to your question. Try to reformulate it!"
+                            )
+            # Handle Generative Answers
+            elif task_selection == 'Generative':
+                results = st.session_state.results_generative
+                st.subheader("Generated Answer:")
+                if 'results' in results:
+                    st.markdown("**Answer:**")
+                    st.write(results['results'][0])
+            # Handle Retrieved Documents
+            if 'documents' in results:
+                retrieved_documents = results['documents']
+                st.subheader("Retriever Results:")
+                data = []
+                for i, document in enumerate(retrieved_documents):
+                    # Truncate the content
+                    truncated_content = (document.content[:150] + '...') if len(document.content) > 150 else document.content
+                    data.append([i + 1, document.meta['name'], truncated_content])
+                # Convert data to DataFrame and display using Streamlit
+                df = pd.DataFrame(data, columns=['Ranked Context', 'Document Name', 'Content'])
+                st.table(df)
+except SystemExit as e:
+    os._exit(e.code)

generate_keys.py ADDED Viewed

	@@ -0,0 +1,15 @@

+# -*- coding: utf-8 -*-
+import pickle
+from pathlib import Path
+import streamlit_authenticator as stauth
+names = ['mlreply']
+usernames = ['docwhiz']
+passwords = ['Docwhiz']
+hashed_passwords = stauth.Hasher((passwords)).generate()
+with open('hashed_password.pkl','wb') as f:
+    pickle.dump(hashed_passwords, f)

hashed_password.pkl ADDED Viewed

Binary file (78 Bytes). View file

ml_logo.png ADDED Viewed

requirements.txt ADDED Viewed

	@@ -0,0 +1,127 @@

+accelerate==0.24.1
+aiohttp==3.8.6
+aiosignal==1.3.1
+altair==5.1.2
+annotated-types==0.6.0
+appdirs==1.4.4
+argon2-cffi==23.1.0
+argon2-cffi-bindings==21.2.0
+async-timeout==4.0.3
+attrs==23.1.0
+Authlib==1.2.1
+backoff==2.2.1
+blinker==1.7.0
+boilerpy3==1.0.7
+cachetools==5.3.2
+canals==0.7.0
+cattrs==23.1.2
+certifi==2023.7.22
+cffi==1.16.0
+charset-normalizer==3.3.2
+click==8.1.7
+cryptography==41.0.5
+datasets==2.15.0
+dill==0.3.7
+docopt==0.6.2
+environs==9.5.0
+Events==0.5
+farm-haystack==1.20.0
+filelock==3.13.1
+frozenlist==1.4.0
+fsspec==2023.10.0
+gitdb==4.0.11
+GitPython==3.1.40
+grpcio==1.58.0
+htbuilder==0.6.2
+huggingface-hub==0.19.4
+idna==3.4
+importlib-metadata==6.8.0
+inflect==7.0.0
+Jinja2==3.1.2
+joblib==1.3.2
+jsonschema==4.20.0
+jsonschema-specifications==2023.11.1
+lazy-imports==0.3.1
+Markdown==3.5.1
+markdown-it-py==3.0.0
+MarkupSafe==2.1.3
+marshmallow==3.20.1
+mdurl==0.1.2
+milvus-haystack==0.0.2
+minio==7.2.0
+monotonic==1.6
+more-itertools==10.1.0
+mpmath==1.3.0
+multidict==6.0.4
+multiprocess==0.70.15
+networkx==3.2.1
+nltk==3.8.1
+num2words==0.5.13
+numpy==1.26.2
+opensearch-py==2.4.1
+packaging==23.2
+pandas==2.1.3
+Pillow==9.5.0
+platformdirs==4.0.0
+posthog==3.0.2
+prompthub-py==4.0.0
+protobuf==4.25.1
+psutil==5.9.6
+pyarrow==14.0.1
+pyarrow-hotfix==0.5
+pycparser==2.21
+pycryptodome==3.19.0
+pydantic==1.10.13
+pydantic_core==2.14.3
+pydeck==0.8.1b0
+Pygments==2.16.1
+pymilvus==2.3.3
+Pympler==1.0.1
+python-dateutil==2.8.2
+python-dotenv==1.0.0
+pytz==2023.3.post1
+pytz-deprecation-shim==0.1.0.post0
+PyYAML==6.0.1
+quantulum3==0.9.0
+rank-bm25==0.2.2
+referencing==0.31.0
+regex==2023.10.3
+requests==2.31.0
+requests-cache==0.9.8
+rich==13.7.0
+rpds-py==0.13.0
+safetensors==0.3.3.post1
+scikit-learn==1.3.2
+scipy==1.11.3
+sentence-transformers==2.2.2
+sentencepiece==0.1.99
+six==1.16.0
+smmap==5.0.1
+sseclient-py==1.8.0
+st-annotated-text==4.0.1
+streamlit==1.23.0
+sympy==1.12
+tenacity==8.2.3
+threadpoolctl==3.2.0
+tiktoken==0.5.1
+tokenizers==0.13.3
+toml==0.10.2
+toolz==0.12.0
+torch==2.1.1
+torchvision==0.16.1
+tornado==6.3.3
+tqdm==4.66.1
+transformers==4.32.1
+typing_extensions==4.8.0
+tzdata==2023.3
+tzlocal==4.3.1
+ujson==5.8.0
+url-normalize==1.4.3
+urllib3==2.1.0
+validators==0.22.0
+weaviate-client==3.25.3
+xxhash==3.4.1
+yarl==1.9.2
+zipp==3.17.0
+streamlit-authenticator==0.1.5
+streamlit-modal==0.1.0

utils/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

utils/__pycache__/check_pydantic_version.cpython-310.pyc ADDED Viewed

Binary file (1.04 kB). View file

utils/__pycache__/check_pydantic_version.cpython-311.pyc ADDED Viewed

Binary file (2.04 kB). View file

utils/__pycache__/check_pydantic_version.cpython-39.pyc ADDED Viewed

Binary file (1.02 kB). View file

utils/__pycache__/config.cpython-310.pyc ADDED Viewed

Binary file (1.51 kB). View file

utils/__pycache__/config.cpython-311.pyc ADDED Viewed

Binary file (2.51 kB). View file

utils/__pycache__/config.cpython-39.pyc ADDED Viewed

Binary file (1.51 kB). View file

utils/__pycache__/haystack.cpython-310.pyc ADDED Viewed

Binary file (3.61 kB). View file

utils/__pycache__/haystack.cpython-311.pyc ADDED Viewed

Binary file (5.81 kB). View file

utils/__pycache__/haystack.cpython-39.pyc ADDED Viewed

Binary file (3.61 kB). View file

utils/__pycache__/ui.cpython-310.pyc ADDED Viewed

Binary file (739 Bytes). View file

utils/__pycache__/ui.cpython-311.pyc ADDED Viewed

Binary file (1.14 kB). View file

utils/check_pydantic_version.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import pydantic
+import os
+import fileinput
+def replace_string_in_files(folder_path, old_str, new_str):
+    for subdir, dirs, files in os.walk(folder_path):
+        for file in files:
+            file_path = os.path.join(subdir, file)
+            # Check if the file is a text file (you can modify this condition based on your needs)
+            if file.endswith(".txt") or file.endswith(".py"):
+                # Open the file in place for editing
+                with fileinput.FileInput(file_path, inplace=True) as f:
+                    for line in f:
+                        # Replace the old string with the new string
+                        print(line.replace(old_str, new_str), end='')
+def use_pydantic_v1():
+    module_file_path = pydantic.__file__
+    module_file_path = module_file_path.split('pydantic')[0] + 'haystack'
+    with open(module_file_path+'/schema.py','r') as f:
+        haystack_schema_file = f.read()
+    if 'from pydantic.v1' not in haystack_schema_file:
+        replace_string_in_files(module_file_path, 'from pydantic', 'from pydantic.v1')

utils/config.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import argparse
+import os
+import os
+from dotenv import load_dotenv
+load_dotenv()
+parser = argparse.ArgumentParser(description='This app lists animals')
+document_store_choices = ('inmemory', 'weaviate', 'milvus', 'opensearch')
+parser.add_argument('--store', choices=document_store_choices, default='inmemory', help='DocumentStore selection (default: %(default)s)')
+parser.add_argument('--name', default="Document Insights: Extractive & Generative Methods")
+model_configs = {
+    'EMBEDDING_MODEL': os.getenv("EMBEDDING_MODEL", "sentence-transformers/all-MiniLM-L12-v2"),
+    'GENERATIVE_MODEL': os.getenv("GENERATIVE_MODEL", "gpt-4"),
+    #'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "deepset/roberta-base-squad2"),
+    'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "deepset/gelectra-large-germanquad"),
+    #'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "MachineLearningReply/bert-base-german-legal-qa"),
+    'OPENAI_KEY': os.getenv("OPENAI_KEY"),
+    'COHERE_KEY': os.getenv("COHERE_KEY"),
+}
+document_store_configs = {
+# Weaviate Config
+'WEAVIATE_HOST':  os.getenv("WEAVIATE_HOST", "http://localhost"),
+'WEAVIATE_PORT': os.getenv("WEAVIATE_PORT", 8080),
+'WEAVIATE_INDEX': os.getenv("WEAVIATE_INDEX", "Document"),
+'WEAVIATE_EMBEDDING_DIM': os.getenv("WEAVIATE_EMBEDDING_DIM", 768),
+# OpenSearch Config
+'OPENSEARCH_SCHEME': os.getenv("OPENSEARCH_SCHEME",  "https"),
+'OPENSEARCH_USERNAME': os.getenv("OPENSEARCH_USERNAME", "admin"),
+'OPENSEARCH_PASSWORD': os.getenv("OPENSEARCH_PASSWORD", "admin"),
+'OPENSEARCH_HOST': os.getenv("OPENSEARCH_HOST", "localhost"),
+'OPENSEARCH_PORT': os.getenv("OPENSEARCH_PORT", 9200),
+'OPENSEARCH_INDEX':  os.getenv("OPENSEARCH_INDEX", "document"),
+'OPENSEARCH_EMBEDDING_DIM': os.getenv("OPENSEARCH_EMBEDDING_DIM", 768),
+# Milvus Config
+'MILVUS_URI': os.getenv("MILVUS_URI", "http://localhost:19530/default"),
+'MILVUS_INDEX':  os.getenv("MILVUS_INDEX", "document"),
+'MILVUS_EMBEDDING_DIM': os.getenv("MILVUS_EMBEDDING_DIM", 768),
+}

utils/haystack.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import streamlit as st
+from utils.config import document_store_configs, model_configs
+from haystack import Pipeline
+from haystack.schema import Answer
+from haystack.document_stores import BaseDocumentStore
+from haystack.document_stores import InMemoryDocumentStore, OpenSearchDocumentStore, WeaviateDocumentStore
+from haystack.nodes import EmbeddingRetriever, FARMReader, PromptNode, PreProcessor
+#from haystack.nodes import TextConverter, FileTypeClassifier, PDFToTextConverter
+from milvus_haystack import MilvusDocumentStore
+#Use this file to set up your Haystack pipeline and querying
+@st.cache_resource(show_spinner=False)
+def start_preprocessor_node():
+    print('initializing preprocessor node')
+    processor = PreProcessor(
+        clean_empty_lines= True,
+        clean_whitespace=True,
+        clean_header_footer=True,
+        #remove_substrings=None,
+        split_by="word",
+        split_length=100,
+        split_respect_sentence_boundary=True,
+        #split_overlap=0,
+        #max_chars_check= 10_000
+    )
+    return processor
+    #return docs
+@st.cache_resource(show_spinner=False)
+def start_document_store(type: str):
+    #This function starts the documents store of your choice based on your command line preference
+    print('initializing document store')
+    if type == 'inmemory':
+        document_store = InMemoryDocumentStore(use_bm25=True, embedding_dim=384)
+        '''
+        documents = [
+            {
+                'content': "Pi is a super dog",
+                'meta': {'name': "pi.txt"}
+            },
+            {
+                'content': "The revenue of siemens is 5 milion Euro",
+                'meta': {'name': "siemens.txt"}
+            },
+        ]
+        document_store.write_documents(documents)
+        '''
+    elif type == 'opensearch':
+        document_store = OpenSearchDocumentStore(scheme = document_store_configs['OPENSEARCH_SCHEME'],
+                                                 username = document_store_configs['OPENSEARCH_USERNAME'],
+                                                 password = document_store_configs['OPENSEARCH_PASSWORD'],
+                                                 host = document_store_configs['OPENSEARCH_HOST'],
+                                                 port = document_store_configs['OPENSEARCH_PORT'],
+                                                 index = document_store_configs['OPENSEARCH_INDEX'],
+                                                 embedding_dim = document_store_configs['OPENSEARCH_EMBEDDING_DIM'])
+    elif type == 'weaviate':
+        document_store = WeaviateDocumentStore(host = document_store_configs['WEAVIATE_HOST'],
+                                                port = document_store_configs['WEAVIATE_PORT'],
+                                                index = document_store_configs['WEAVIATE_INDEX'],
+                                                embedding_dim = document_store_configs['WEAVIATE_EMBEDDING_DIM'])
+    elif type == 'milvus':
+        document_store = MilvusDocumentStore(uri = document_store_configs['MILVUS_URI'],
+                                            index = document_store_configs['MILVUS_INDEX'],
+                                            embedding_dim = document_store_configs['MILVUS_EMBEDDING_DIM'],
+                                            return_embedding=True)
+    return document_store
+# cached to make index and models load only at start
+@st.cache_resource(show_spinner=False)
+def start_retriever(_document_store: BaseDocumentStore):
+    print('initializing retriever')
+    retriever = EmbeddingRetriever(document_store=_document_store,
+                                   embedding_model=model_configs['EMBEDDING_MODEL'],
+                                   top_k=5)
+    #
+    #_document_store.update_embeddings(retriever)
+    return retriever
+@st.cache_resource(show_spinner=False)
+def start_reader():
+    print('initializing reader')
+    reader = FARMReader(model_name_or_path=model_configs['EXTRACTIVE_MODEL'])
+    return reader
+# cached to make index and models load only at start
+@st.cache_resource(show_spinner=False)
+def start_haystack_extractive(_document_store: BaseDocumentStore, _retriever: EmbeddingRetriever, _reader: FARMReader):
+    print('initializing pipeline')
+    pipe = Pipeline()
+    pipe.add_node(component=_retriever, name="Retriever", inputs=["Query"])
+    pipe.add_node(component= _reader, name="Reader", inputs=["Retriever"])
+    return pipe
+@st.cache_resource(show_spinner=False)
+def start_haystack_rag(_document_store: BaseDocumentStore, _retriever: EmbeddingRetriever, openai_key):
+    prompt_node = PromptNode(default_prompt_template="deepset/question-answering",
+                             model_name_or_path=model_configs['GENERATIVE_MODEL'],
+                             api_key=openai_key,
+                             max_length=500)
+    pipe = Pipeline()
+    pipe.add_node(component=_retriever, name="Retriever", inputs=["Query"])
+    pipe.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])
+    return pipe
+#@st.cache_data(show_spinner=True)
+def query(_pipeline, question):
+    params = {}
+    results = _pipeline.run(question, params=params)
+    return results
+def initialize_pipeline(task, document_store, retriever, reader, openai_key = ""):
+    if task == 'extractive':
+        return start_haystack_extractive(document_store, retriever, reader)
+    elif task == 'rag':
+        return start_haystack_rag(document_store, retriever, openai_key)

utils/ui.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import streamlit as st
+def set_state_if_absent(key, value):
+    if key not in st.session_state:
+        st.session_state[key] = value
+def set_initial_state():
+    set_state_if_absent("question", "Ask something here?")
+    set_state_if_absent("results_extractive", None)
+    set_state_if_absent("results_generative", None)
+    set_state_if_absent("task", None)
+def reset_results(*args):
+    st.session_state.results_extractive = None
+    st.session_state.results_generative = None
+    st.session_state.task = None