Optimizing Dataset Loading Time in Spaces
Hello HF Community,
I am reaching out for insights or solutions regarding an issue I have with dataset loading time in Hugging Face Spaces.
Context:
- Goal: I am working on a space designed for user corrections where real-time dataset updates are crucial.
- Issue: Currently, there is a significant delay (about 3 minutes) every time I enter the space, primarily due to the time taken to load the dataset.
- Current Setup: The dataset updates are managed using the parquet scheduler mentioned here.
Question:
- Is there a way to minimize or avoid this delay in loading the dataset whenever the space is accessed?
- What are the best practices or alternative approaches for managing real-time data updates in Hugging Face Spaces without experiencing such delays?
First you can remove the "force_redownload" parameter, this way it will reuse cached data (unless the remote dataset changed, in this case it redownloads it). It should help already, depending on the dataset updates frequency
@Ali-C137 Please invalidate the token used in this Space as soon as possible. You can do that on this page: https://huggingface.co/settings/tokens by clicking on "Manage" > "Invalidate and refresh".
The correct way of setting a token on a Space is to set a secret called HF_TOKEN
in the Space settings's tab.
EDIT: we invalidated the token for you. You'll need to create a new one and use a secret for the Space. Sorry for being so quick but anyone finding it would have been capable of starting GPUs in your name (hence on your credit card).
EDIT 2: once passed as a HF_TOKEN
secret, you won't need to explicitly login at the beginning of your script (so no need for login(token=access_token_write)
)
Tnx @Wauplin for taking care of it but it was already invalidated (i think) ๐ just forgot to update the code since i had soo many bugs at first and later never payed attention to it again ๐
@Ali-C137 I quickly reviewed the Space code and here are a few tips:
- I think you already know but by design Streamlit will rerun the whole script each time you interact with the page. This is ok for light-weight interactions but it becomes unusable for heavy tasks like redownloading a dataset. In that sense Gradio can be better suited because code can be run once at startup time and then on each user interaction only a subset of the code is run. In streamlit you could use a global boolean variable to run a certain action only once (not an expert here though).
- In general I would separate 2 things:
- you want to persist data to a dataset on the Hub => this is good to avoid loosing data labelled by one of your users. Using a
ParquetScheduler
is correct here but please remember that it pushes the data to the Hub only once every 5 minutes. So the last 5 min of work can be lost in case of a Space full restart
- a. however I don't think
ParquetScheduler
plays nicely with streamlit. Under the hood is started a separate thread that takes care of the saving but I think streamlit might shut down this thread each time the user interacts with the page. Also, I'm not sure if the thread is persisted when the user quit the Space. These uncertainties are mainly due to how streamlit works internally and I don't really know that to be honest
- you want to persist data to a dataset on the Hub => this is good to avoid loosing data labelled by one of your users. Using a
- you also want to have real-time data for your users. What I would do is to separate the real-time database (the data you use to decide what to show to your user) and the saved dataset. The first time you load the dataset, you download everything. Then each time a user saves a label, you need to save it twice. Once in your local temporary dataset (it can be a simple pandas dataframe) and once in your
ParquetScheduler
that will take care of persisting it.
- you also want to have real-time data for your users. What I would do is to separate the real-time database (the data you use to decide what to show to your user) and the saved dataset. The first time you load the dataset, you download everything. Then each time a user saves a label, you need to save it twice. Once in your local temporary dataset (it can be a simple pandas dataframe) and once in your
- another possibility is to use Persistent Storage to save locally the data in your Space. The data will be persisted between each Space restart, meaning you don't even need to export it to a remote dataset (or at least not in realtime).
Hope those few points will help you building your Space ๐ค
Thanks a lot @Wauplin ๐๐ป i will go through these remarks ounce I'm home ! We are currently exploring the transition to Gradio or one of Argilla templates maybe, the solution i suggested to the team before was to have the dataset local in the space and apparently we will go back to implement this solution ๐ฅ
Good! Let me know how it goes :)
was to have the dataset local in the space
This is perfectly valid but be careful of having persistent storage otherwise you'll loose everything each time it restarts.