Orchestrating Small Language Models (SLM) using JavaScript and the Hugging Face Inference API
PT-BR
Hello!
I’ll show you how I used the Hugging Face Inference API, a Space, Docker, and less than 500 lines of JavaScript code to orchestrate several small LLMs, having them respond to me about attempts in a small interactive neural network simulator.
You can see the result in this IATalking blog post. Move the slider and try to reduce the error; you’ll see that an AI will generate comments based on your attempt (and attempt history). These comments come from models like Phi3, Llama, Mystral, etc., all small, with a few billion parameters.
Whenever a new request is sent, the code I will present here chooses one of these models. If a model starts to fail, the chances of it being chosen decrease the next time a request is made. This way, I can achieve a sort of "high availability" of LLMs, using different models, taking into account the quality of the response, as well as adjusting the temperature of a single LLM to be a bit more creative with each request.
This post will show the details of how I did this, explaining the files involved.
The Space
You will find the code for this model in the Space: Jay Trainer. The API responds at this link: https://iatalking-jaytrainer.hf.space/error?error=123
I named it Jay Trainer in a reference of Jay Alammar (https://jalammar.github.io/), who made the original post and brought the idea of creating a simple neural network simulator to facilitate learning. His post is amazing: Visual Interactive Guide Basics Neural Networks.
Looking at the files, you will notice the following:
- A Dockerfile
- docker-compose.yml
- server.js
- README.md
This is a Docker Space, and I chose Docker for its simplicity. Hugging Face allows you to create services using Docker, which is sensational because it opens up endless possibilities: you can create an API in your preferred language and host in a Space!
If you look at the Dockerfile, you will see a typical Node.js server:
The docker-compose.yml is just to facilitate my local tests. Before sending it to Hugging Face, I run the same code with just a docker compose up
. This saves me the trouble of typing "docker -p etc. etc."
The README.md contains only the necessary requirements demanded by Hugging Face, and the star is the server.js
. Here is where all the logic and endpoint code is!
Provided API
The API, which is implemented in server.js, is an Express application. The following endpoints are created:
- /error
This will the most used: you pass the error number and attempt history, and it returns a text! - /
This is the default endpoint. It just displays a simple message, indicating that the service is up! - /models
This is a debug endpoint. With it, I can see the history of the involved LLMs and execution statistics. This allows me to know who is generating more errors and who is being executed more, along with some other information just for monitoring and debugging! - /test
This is an endpoint just to test if Express is really working and responding.
Server initialization
When starting, the server makes some configurations, and the main one is defining an object containing the list of models I want to use:
The global constant MODELS
will contain my model! Each object is an "id" of an LLM that will be used. id
here is something internal that I defined for this code, not a Hugging Face ID. It’s just a kind of "alias" to facilitate identification.
So, for each of them, I need to define: the model name and the prompt template. The name is the unique and exclusive name on Hugging Face, in the format ORG/MODEL. This is the name you find in the Model Card:
prompt
is a method that I will call when I need to generate a prompt for this model. Each model can have a different prompt format, so I cannot use the same prompt for everyone. Through this function, I can create a dynamic mechanism to generate prompts using the same call: I just need to call the prompt()
method and pass the text I want.
I have never studied the Hugging Face Transformers code, but it is very likely that this follows the same idea behind the apply_chat_template methods. Here, it is just a very simple version of the process!
Therefore, with just these two members, I can easily add or remove LLMs! I chose to leave it hard-coded to have control and simplify things. But I could have put this externally to a file or API, to be able to include or remove LLMs at runtime. As this is a simple PoC, I preferred to keep it simple for now.
Lastly, it is important to remember that the code also has validation for the HF_TOKEN environment variable, which is the Hugging Face token. This token is configured as a secret in the Space. In local tests, I generate a test token and use it in my docker compose up, creating an environment variable here in my shell. Another convenience that having a ready docker compose brings me!
Model Initialization
Still on the server startup, for each of the models in MODELS
, I will generate a third attribute: stats
. Here, we will have some LLM execution metrics, such as the total executions, the total errors, and the percentage chances of it generating errors.
The description of the properties is:
- total
Total executions. - errors
A number indicating how much of the total was of low quality. - share
Error share. This will be calculated based on the total errors. - errop
% of errors: errors/total. - pok
% of okay (opposite of errop): 1 - errop.
And, in this part, I create an array with all the LLMs, which is in the ModelList
variable.
The rest of the code is dedicated to the Express endpoints and the logic functions, which will be detailed further below.
/error Endpoint
This is the main endpoint. It is this endpoint that the above page calls. It expects the following parameters:
error
Error value of the user's current attempt.tentativas
Attempt history. These are the error numbers, separated by ",".
The code starts by validating the parameter values. Basically, I am ensuring that error is a number and that tentativas
(tentativas = attempts, in pt-BR) is a list of comma-separated numbers. Anything different from that, I return to the user. The idea is to avoid prompt injection, since (as you will see in a moment), I will concatenate these values directly.
Then, it will call the Prompt()
function (which we will talk about shortly), which is responsible for assembling the prompt and asking the LLM using the Hugging Face Inference API.
The function will return the LLM response. Here some validations will be done.
Among the validations, the most important is to extract the LLM text only up to the "end" mark. You will see that I ask the LLM to always end the text with a mark that I called |fim|
(fim = end, in pt-BR). This is an attempt to make it return a point indicating that up to that point it generated the correct text. Beyond that, there is no guarantee. So, I will only take the result up to before |fim|
.
Another validation I do is on the number of characters. Here, I assume a standard of at most 8 characters per word, and as I left a default of 20 words, I do the simple calculation of 20*8.
And, later on, you will understand why, but note that there are some sections where we are altering the error variable of the responding model. It is a mechanism I created to "penalize" responses that deviated from the expected quality standard.
This validation has many loopholes and could be better. But again: this is a small PoC for a blog, with the goal of learning to use the API more and interact with small LLMs. Therefore, I did not go much beyond the basics.
prompt
Function
The prompt
function is the function that will be called whenever the /error
endpoint is used.
The mission of this function is to assemble the prompt based on the error I have. My goal is to generate messages based on the user's attempt. He is trying to produce a value lower than 450, so I generate a prompt based on the current value. The idea is to try to make the LLM funny and joke with the user if he is far below or far above the value. If he is close, I generate a more motivational than funny message.
I could have put everything in a single prompt, but I think that would be inefficient on several levels. First, I would generate examples for conditions that would not be used. For example, if the error is 2000, there is no need to send the prompt explaining to generate a motivational message. I only need the prompt that generates a joke message (defined as anything above 2000).
With this, I save the context of my LLM, which is small and has limitations! Another advantage of doing this is that it becomes more accurate. With fewer tokens in the middle of the examples, the probability of it generating tokens similar to my examples is much better. The difference was absurd in the response when I did this. It started to generate much closer to the examples for the error range than if I put everything. Here, pure prompt engineering helped me extract the best from the LLM.
And, as I mentioned, besides the character limit (which I left fixed at 20 words, for now), I also ask it to end with a "|end|". Adding this marking further filtered the cases where it hallucinated after the first 20 words. It generally starts hallucinating after the "|end|". And since I take everything before this mark, it considerably reduced the cases of unrelated messages.
And then, once my prompt is ready, I call the GetModelAnswer
function, which takes care of the technical part, that is, choosing the best LLM and managing when an LLM does not respond!
GetModelAnswer
Function
This function is responsible for sending my prompt to one of the LLMs in the MODELS
list.
The idea is simple: if the user did not explicitly choose a model in the parameter, then I will try to choose the best model and invoke it. If it fails, I try the next one until I run out of options!
We will soon see the UpdateProbs
, which calculates the best model to be called. In the loop, which will repeat at most the number of times equivalent to the number of models I have in the list (the ModelList
array, which we loaded during initialization, as shown above), it will obtain the object that represents our model:
It assembles the URL for the Hugging Face Inference API (which could be an environment variable, in case it changes one day). Remember the "name" we defined in the MODELS
constant? So, this is where we will use it! Therefore, it has to be exactly the same name.
Then, it will assemble the data to send to the Hugging Face API:
The Inference API has several options. In our case, we need to use inputs
, which is our prompt. Here, note that I am calling the prompt
method of the current model, which will assemble the specific prompt for this model (remember, we talked about it above!). Thanks to this method, I can format the prompt according to the needs of each LLM.
The parameters
and options
keys are configurations that I left hard coded: a maximum of 70 tokens and a temperature of 50% to avoid too much hallucination but still have variations between calls.
Finally, a simple fetch
to send the request to the Hugging Face API and wait for the response! Here, it would also be worth adding some handling, such as timeout, etc., but I chose to keep it simple.
Note that at this point, I also increment the total request counter in this model's statistics. This will help generate accuracy probabilities.
Additionally, I will measure the time it took, in milliseconds, for this to run. And I will use this shortly to penalize or reward the model.
If the response is an error (different from HTTP 200), we increment the error counter and try the next model in the list. I will repeat this process until I receive an HTTP 200 or run out of models to try. The entire logic within this IF is just for that: to get the next model in the list and repeat the whole loop!
Further down, I perform some quality checks: If the response time was greater than 2.5 seconds, I slightly increment the error variable. Or, if the response time is less than 900ms, I slightly decrement the error. This way, I can penalize or reward the model based on response time. You will see that this will affect the most chosen model. I could add more checks here. Just manipulate the error variable of each model, and it will reflect in the algorithm that determines the best model.
And, if it returns an HTTP 200, it means I have the response, and then I return immediately, which ends my loop!
UpdateProbs
Function
This is the function that contains the logic to determine which would be the best LLM to use. The idea here is to choose the model that is most likely to produce a quality response. The chances are controlled by the stats.erros property of each model. This value is a percentage relative to stats.total, which is the total number of attempts. The idea here is: If stats.erros is equal to stats.total, then the chance of the model producing a low-quality response (or failing) is 100%.
To decide which of the models is the best, we calculate the total errors and divide a percentage among all the models. For example, let's consider 3 LLMs: Gemini, Phi3, and Llama. Let's assume the attempts and errors were as follows:
- Gemini, errors = 2, attempts = 2
- Llama, errors = 0, attempts = 2
- Phi3, errors = 1, attempts = 2
Note that Gemini has a 0% accuracy (2 errors in 2 attempts), Llama has a 100% accuracy, and Phi3 has a 50% accuracy.
Therefore, if you distribute the total accuracy, you will have the following:
- Gemini: 0% ( 0/(0+100+50) )
- Phi3: 33% ( 50/(0+100+50) )
- Llama: 66% ( 100/(0+100+50) )
Choose a random number between 0% and 100%. Pick the first model that covers the range of the random number. For example, if you get 30%, you can choose Phi3 because it covers the range from 0 to 33%. If you get 35%, only Llama would be suitable because it covers the range from 33 to 100%.
With this, we manage to prioritize those who hit the most. If 2 LLMs have equal chances, a small trick will make one of them be chosen:
So, we are able to randomly choose the best model based on its performance, which in this case is measured by some simple checks. But the code is ready to allow more metrics (just by incrementing the error when these additional error conditions are met).
Other Endpoints
The other endpoints, like /models
, are just for debugging. So, I believe no additional explanation is needed. You can check it directly in the code or access it in real-time:
I hope you enjoyed it and that this implementation can help you generate more ideas for using several LLMs on Hugging Face!!!
If you have any questions, just reach out!
X: @IATalking
LinkedIn: @rodrigoribeirogomes