This file shows how to upload the files, create the vector index or update it, then ask questions to the LLM

In [1]:
import os, json
import requests
from rich import print
from rich.pretty import pprint

In [2]:
data_dir = "data"  # data to play with inside notebooks 

In [3]:
filelist = [
    'ATT_SEC_AnnualReport_2022.pdf',
    'ATT_StockAnalystNote_Annual_20230125.pdf',
    'ATT_CompanyReport_Annual_20230126.pdf',
    'AMZN_MS_CompanyReport_Annual_20230203.pdf',
    'AMZN_Morning Star_StockAnalystNote_20230203.pdf',
    'AMZN_Moodys_CreditRating_2023.pdf',
    'AMZN_Morning Star_Transcript_Annual.pdf'
 ]

## Upload files

In [10]:
def upload_files(data_dir, filelist, url, show_content=False, n=2):
    if isinstance(filelist, str):
        filelist = [filelist]
    for filename in filelist[:n]:
        file_path = os.path.join(data_dir, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'rb') as f:
                files = {'file': (filename, f)}
                response = requests.post(url, files=files)
                pprint(f"Uploaded {filename} with response {response.status_code}")
                if show_content:
                    pprint(json.loads(response.text))

In [7]:
!curl -X GET http://localhost:8003/ping/

{"answer":"78"}

In [40]:
# total wipe clean of files in 'data' and the vectorstore 
!curl -X DELETE http://localhost:8003/erase_data/
!curl -X DELETE http://localhost:8003/empty_collection/

{"message":"All data has been erased"}{"message":"Collection erased!"}

In [12]:
upload_url = 'http://localhost:8003/upload/'upload_url = 'http://localhost:8003/upload/'

upload_files(data_dir, filelist[1], upload_url, show_content=True)

In [8]:
!curl -X GET http://localhost:8003/list_files/

{"files":["ATT_StockAnalystNote_Annual_20230125.pdf"]}

In [15]:
# uploading files creates the embeddings in a parquet file
# when one is satisfied with the nb of files uploaded, he can create the index
# the parquet file is then destroyed to allow uploading files incrementally
!curl -X POST http://localhost:8003/create_index/

416.89s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


{"message":"Index creation successful"}

In [22]:
response = !curl -X POST http://localhost:8003/ask/ -H "Content-Type: application/json" -d '{"question": "Amazon 2024 forecast"}' 
pprint(json.loads(response[-1]))

In [None]:
response = !curl -X POST "http://localhost:8003/ragit/"  -H "Content-Type: application/json" -d '{"question": "Amazon 2024 forecast?"}'

In [26]:
print(json.loads(response[-1])['answer'])

We see that despite high similarity scores, the vector search is completely off the mark because Amazon is not in the provided data at all.  I don't see how a vector search could overcome this situation.  But with one more step, RAG, I could instruct the model to look for relevance, and decide to answer or not.

Now, let's add a file with Amazon's financial report.

In [None]:
upload_url = 'http://localhost:8003/upload/'
upload_files(data_dir, filelist[3], upload_url, show_content=True)
!curl -X POST http://localhost:8003/create_index/
response = !curl -X POST "http://localhost:8003/ask/"  -H "Content-Type: application/json" -d '{"question": "Amazon 2024 forecast?"}'

In [13]:
response = !curl -X POST "http://localhost:8003/ask/"  -H "Content-Type: application/json" -d '{"question": "Amazon 2024 forecast?"}'

We can see that Guardrails 'QA Relevance LLM Eval' doesn't work very well at all since it uses a LLM anyway, I prefer to use my own prompting (with ragit)

In [14]:
print(json.loads(response[-1])['answer'])

In [20]:
filelist[3]

'AMZN_MS_CompanyReport_Annual_20230203.pdf'

In [12]:
upload_url = 'http://localhost:8003/upload/'
upload_files(data_dir, filelist[3], upload_url, show_content=True)
!curl -X POST http://localhost:8003/create_index/
response = !curl -X POST "http://localhost:8003/ragit/"  -H "Content-Type: application/json" -d '{"question": "Amazon 2024 forecast?"}'
print(json.loads(response[-1])['answer'])

{"message":"Index creation successful"}

### All files at once

In [16]:
# does not affect the vectorstore, but it will destroy the parquet file with the embeddings
# so make sure to create the index first
!curl -X DELETE http://localhost:8003/erase_data/

{"message":"All data has been erased"}

In [17]:
!curl -X DELETE http://localhost:8003/empty_collection/

{"message":"Collection erased!"}

In [23]:
upload_files(data_dir, filelist, upload_url, show_content=False, n=100)

All files and their embeddings are now in the parquet file.  We can decide to push it into the vectorstore.  

In [24]:
!curl -X GET http://localhost:8003/list_files/

{"files":["ATT_SEC_AnnualReport_2022.pdf","text_vectors.parquet",".DS_Store","ATT_StockAnalystNote_Annual_20230125.pdf","ATT_CompanyReport_Annual_20230126.pdf","AMZN_MS_CompanyReport_Annual_20230203.pdf","AMZN_Morning Star_StockAnalystNote_20230203.pdf","AMZN_Moodys_CreditRating_2023.pdf","AMZN_Morning Star_Transcript_Annual.pdf"]}

In [25]:
# uploading files creates the embeddings in a parquet file
# when one is satisfied with the nb of files uploaded, he can create the index
# the parquet file is then destroyed to allow uploading files incrementally
!curl -X POST http://localhost:8003/create_index/

{"message":"Index creation successful"}

Now, we can ask questions

In [27]:
response = !curl -X POST "http://localhost:8003/ask/"  -H "Content-Type: application/json" -d '{"question": "Amazon 2024 forecast?"}'
print(json.loads(response[-1])['answer'])

In [28]:
response = !curl -X POST "http://localhost:8003/ragit/"  -H "Content-Type: application/json" -d '{"question": "Amazon 2024 forecast?"}'
print(json.loads(response[-1])['answer'])

In [14]:
!curl -X GET "https://jpbianchi-mr.hf.space/ping/"

{"answer":"3"}

In [12]:
!curl -X POST "https://jpbianchi-finrag.hf.space/ragit/"  -H "Content-Type: application/json" -d '{"question": "Does ATT have postpaid phone customers?"}'

{"answer":"Yes, AT&T does have postpaid phone customers. The company added 813,000 postpaid phone customers during the quarter, marking the strongest second quarter performance in a decade. Additionally, the average revenue per postpaid phone customer grew by 1.1% compared to the previous year, with further improvements expected in the second half."}

In [13]:
!curl -X POST "https://jpbianchi-finrag.hf.space/ragit/"  -H "Content-Type: application/json" -d '{"question": "what is Amazon loss?"}'

{"answer":"Amazon reported a pre-tax valuation loss of $2.3 billion in the fourth quarter, which was included in non-operating income from their common stock investment in Rivian Automotive. This loss was not related to Amazon's ongoing operations but rather stemmed from quarter-to-quarter fluctuations in Rivian's stock price."}