This file shows how to upload the files, create the vector index or update it, then ask questions to the LLM

In [16]:
import os, json
import requests
from rich import print
from rich.pretty import pprint

In [1]:
data_dir = "data"  # assignment files, permanent, for testing
# the files uploaded are put in a different 'data' folder to keep track, but it can be cleaned up

In [18]:
filelist = [
    'ATT_SEC_AnnualReport_2022.pdf',
    'ATT_StockAnalystNote_Annual_20230125.pdf',
    'ATT_CompanyReport_Annual_20230126.pdf',
    'AMZN_MS_CompanyReport_Annual_20230203.pdf',
    'AMZN_Morning Star_StockAnalystNote_20230203.pdf',
    'AMZN_Moodys_CreditRating_2023.pdf',
    'AMZN_Morning Star_Transcript_Annual.pdf'
 ]

## Upload files

In [19]:
def upload_files(data_dir, filelist, url, show_content=False):
    if isinstance(filelist, str):
        filelist = [filelist]
    for filename in filelist:
        file_path = os.path.join(data_dir, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'rb') as f:
                files = {'file': (filename, f)}
                response = requests.post(url, files=files)
                pprint(f"Uploaded {filename} with response {response.status_code}")
                if show_content:
                    pprint(json.loads(response.text))

In [None]:
# total wipe clean of files in 'data' and the vectorstore 
!curl -X DELETE http://localhost:80/erase_data/
!curl -X DELETE http://localhost:80/empty_collection/

{"message":"All data has been erased"}

In [117]:
upload_url = 'http://localhost:80/upload/'

upload_files(data_dir, filelist[1], upload_url, show_content=True)

### All files at once

In [118]:
upload_files(data_dir, filelist, upload_url, show_content=False)

All files and their embeddings are now in the parquet file.  We can decide to push it into the vectorstore.  

In [114]:
# does not affect the vectorstore, but it will destroy the parquet file with the embeddings
# so make sure to create the index first
!curl -X DELETE http://localhost:80/erase_data/

{"message":"No data to erase"}

In [119]:
!curl -X GET http://localhost:80/list_files/

{"files":["ATT_SEC_AnnualReport_2022.pdf","text_vectors.parquet","ATT_StockAnalystNote_Annual_20230125.pdf","ATT_CompanyReport_Annual_20230126.pdf","AMZN_MS_CompanyReport_Annual_20230203.pdf","AMZN_Morning Star_StockAnalystNote_20230203.pdf","AMZN_Moodys_CreditRating_2023.pdf","AMZN_Morning Star_Transcript_Annual.pdf"]}

In [116]:
!curl -X DELETE http://localhost:80/empty_collection/

["message\": \"Collection erased!"]

In [120]:
# uploading files creates the embeddings in a parquet file
# when one is satisfied with the nb of files uploaded, he can create the index
# the parquet file is then destroyed to allow uploading files incrementally
!curl -X POST http://localhost:80/create_index/

{"message":"Index creation successful"}

Now, we can ask questions

In [121]:
!curl -X POST http://localhost:80/ask/ -H "Content-Type: application/json" -d '{"question": "what is Amazon loss"}' 

{"answer":["Lastly, during the quarter, we increased our reserves for general product and automobile self-\ninsurance liabilities, driven by changes in our estimates about the cost of asserted and unasserted \nclaims, resulting in additional expense of $1.3 billion. This impact is primarily recorded in cost of \nsales on our income statement. As our business has grown quickly over the last several years, \nparticularly as we've built out our fulfillment and transportation network, and claim amounts have \nseen industry-wide inflation, we've continued to evaluate and adjust this reserve for both asserted \nclaims, as well as our estimate for unasserted claims.\nWe reported overall net income of $278 million in the fourth quarter. While we primarily focus our \ncomments on operating income, I'd point out that this net income includes a pre-tax valuation loss \nof $2.3 billion included in non-operating income from our common stock investment in Rivian \nAutomotive. As we've noted in recen

In [124]:
!curl -X POST http://localhost:80/ask/ -H "Content-Type: application/json" -d '{"question": "what is Amazon loss"}' 

{"answer":["Lastly, during the quarter, we increased our reserves for general product and automobile self-\ninsurance liabilities, driven by changes in our estimates about the cost of asserted and unasserted \nclaims, resulting in additional expense of $1.3 billion. This impact is primarily recorded in cost of \nsales on our income statement. As our business has grown quickly over the last several years, \nparticularly as we've built out our fulfillment and transportation network, and claim amounts have \nseen industry-wide inflation, we've continued to evaluate and adjust this reserve for both asserted \nclaims, as well as our estimate for unasserted claims.\nWe reported overall net income of $278 million in the fourth quarter. While we primarily focus our \ncomments on operating income, I'd point out that this net income includes a pre-tax valuation loss \nof $2.3 billion included in non-operating income from our common stock investment in Rivian \nAutomotive. As we've noted in recen

In [125]:
!curl -X POST http://localhost:80/ask/ -H "Content-Type: application/json" -d '{"question": "Is ATT financially healthy?"}' 

{"answer":["In addition, AT&T has only begrudgingly invested to \nexpand its fiber optic network in the past. New CEO John Stankey has increased investment to retain customers \nand has made fiber construction a top priority, which should improve AT&T’s position but will also dent cash flow \nover at least the next couple of years.  \nAT&T has placed a priority on debt reduction since the Time Warner merger closed, using asset sales as a part of \nthis effort. Not all these sales have made strategic sense, in our view. For example, the sale of its wireless assets \nin Puerto Rico seemed odd, given the territory’s strong ties to the U.S. and AT&T’s presence elsewhere in Latin \nAmerica. Management has also been less than forthright, in our view, concerning the debt load, using preferred \nshares, receivables securitization, and vendor financing to cloud its financial picture.  \nShareholders have suffered because of AT&T’s choices. The stock returned only 2% annually over the 20 years \

In [123]:
!curl -X POST http://localhost:80/ask/ -H "Content-Type: application/json" -d '{"question": "Is Google financially healthy?"}' 

{"answer":["Antitrust, data \nprivacy, and section 230 have been repeatedly invoked.\nFrom an environmental, social, and governance perspective, data breaches and service outages are a concern for \nany type of cloud service provider. As a retailer, Amazon has personal information for hundreds of millions of \nconsumers around the world, while AWS hosts proprietary mission-critical data for enterprises.\nFinancial Strength  Dan Romanoff, Senior Equity Analyst, 3 Feb 2023\nWe believe Amazon is financially sound. Revenue is growing rapidly, margins are expanding, the company has \nunrivaled scale, and the balance sheet is in great shape. In our view, the marketplace will remain attractive to \nthird-party sellers, as Prime continues to tightly weave consumers to Amazon. We also see AWS and advertising \ndriving overall corporate growth and continued margin expansion.\nAs of Dec. 31, 2022, Amazon had $70.0 billion in cash and marketable securities, offset by $67.2 billion in debt. \nWe al

In [134]:
!curl -X POST http://localhost:80/ragit/ -H "Content-Type: application/json" -d '{"question": "Is Google financially healthy?"}' 

{"answer":"The context provided does not contain specific information regarding Google's financial health."}

In [133]:
!curl -X POST http://localhost:80/ragit/ -H "Content-Type: application/json" -d '{"question": "what is Amazon loss"}' 

{"answer":"Amazon reported a pre-tax valuation loss of $2.3 billion included in non-operating income from their common stock investment in Rivian Automotive."}

In [136]:
!curl -X POST http://localhost:80/ragit/ -H "Content-Type: application/json" -d '{"question": "Does ATT have postpaid phone customers?"}' 

{"answer":"Yes, AT&T does have postpaid phone customers. The company added 813,000 postpaid phone customers during the quarter, marking the strongest second quarter performance in a decade. Additionally, the average revenue per postpaid phone customer grew by 1.1% compared to the previous year, indicating a positive trend in this customer segment."}

In [139]:
!curl -X POST http://localhost:80/ragit/ -H "Content-Type: application/json" -d '{"question": "Does Google have postpaid phone customers?"}' 

{"answer":"Yes, AT&T has postpaid phone customers according to the information provided in the context. The data shows that AT&T added a specific number of postpaid phone customers during a quarter, indicating that AT&T offers postpaid phone services."}

This error comes from the fact that the search results do not always contain 'ATT' in them.  And Cisco sells phones too, so it's easy for the LLM to make an error.

In [141]:
!curl -X POST "https://jpbianchi-finrag.hf.space/ragit/"  -H "Content-Type: application/json" -d '{"question": "Does ATT have postpaid phone customers?"}'

Your space is in error, check its status on hf.co

In [2]:
!curl -X GET "https://jpbianchi-finrag.hf.space/ping/"

Your space is in error, check its status on hf.co