Awww yes!
So much yes! Finally 128K context in the browser!
- Is any part of this WebGPU accelerated in V2 or V3?
- The file is surprisingly small at 1.45Gb? Since the original Microsoft file is 2.5 Gb, does that mean it's heavily quantified? (https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32)
- Isn't implementation rather complicated? For example, llama.cpp still doesn't have support for this model. Or does the onnx-format solve those issues somehow? The Llama.cpp issue on Github: https://github.com/ggerganov/llama.cpp/issues/6849
it's possible to overdo it on the "yes".
Unsupported model type: phi3
I guess I need to be patient a little bit longer.
Hey! Indeed, we're still working on this, and we'll make an announcement once it's working 100%! To answer your questions:
1 ) Yes, this will be part of v3 w/ WebGPU acceleration
2) The model is split into two parts: ~830MB + 1.45GB. Both need to be below 2GB to be cacheable.
3) We're relying on MSFT's official ONNX export/implementation, which simplifies a lot for us! :)
Super cool on all fronts. Thanks for explaining!
If I can help test, let me know.
I couldn't resist and tried to get it running with the latest version of V3, but only got 404's no matter which dtype I tried.
What will be the correct incantation? Is this close?
const generator = await pipeline('text-generation', 'Xenova/Phi-3-mini-128k-instruct', {
dtype: 'q4', // fp32, fp16, q8, int8, uint8, q4, bnb4
progress_callback: (data) => {
if (data.status !== 'progress') return;
setLoadProgress(data);
},
});
You need to use this revision: https://huggingface.co/Xenova/Phi-3-mini-128k-instruct/discussions/3
By setting revision: 'refs/pr/3'
.
You also need to set use_external_data_format: true
, which has been introduced by the latest commits.
I can share some example code in a few hours, but I’m still getting erroneous output. Hopefully will get it working soon :)
Here's some HIGHLY EXPERIMENTAL WORK IN PROGRESS code:
import { env, AutoModelForCausalLM, AutoTokenizer } from '@xenova/transformers';
// disable proxying for now (much slower)
env.backends.onnx.wasm.proxy = false;
const model_id = 'Xenova/Phi-3-mini-128k-instruct';
const tokenizer = await AutoTokenizer.from_pretrained(model_id, {
legacy: true, // TODO: update config
});
const prompt = `<|user|>
Tell me a joke<|end|>
<|assistant|>
`;
const inputs = tokenizer(prompt);
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
dtype: 'q4',
// device: 'webgpu', // NOTE: webgpu produces incorrect results
use_external_data_format: true,
revision: 'refs/pr/3',
});
// { // warm up
// const outputs = await model.generate({ ...inputs, max_new_tokens: 1 });
// }
{ // run + time execution
const start = performance.now();
const outputs = await model.generate({ ...inputs, max_new_tokens: 5 }); // TODO: increase max new tokens
const end = performance.now();
console.log(tokenizer.batch_decode(outputs, { skip_special_tokens: false }));
console.log('Execution Time:', end - start);
}
NOTE: to get it working, you need to use the latest commit of transformers.js v3.
And just for now, you also need to replace this in src/models.js
.
- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();
(Will not be necessary once we update the model).
WASM produces correct results, while WebGPU does not. Will continue to investigate.
Figured out the problem: the latest version of ONNXRuntime-web hadn't yet been published to NPM.
Here's a demo of phi-3-mini-128k-instruct running at ~20 tokens per second on an RTX 2080:
yup this model is a quanti... run on phone.
I trained this model:
NickyNicky/Phi-3-mini-4k-instruct_orpo_V2 --- >> https://huggingface.co/NickyNicky/Phi-3-mini-4k-instruct_orpo_V2
Then I quantized it with onnx but it gave me 10 GB, how did you compress it so much?
The latest versions of ONNXRuntime support two forms of 4-bit quantization (for certain weights):
You should also be able to quantize the other weights to fp16 of q8.
Hope that helps!
Awesome!
Is this still needed?
- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();
// yep :-)
Whoop!
<s><|user|> Why is the sky blue?<|end|><|assistant|> The sky appears blue to
@Xenova How did you split it into 2 pieces below 2 GB ?
Would you say the 128K context version is now it's ready for implementation? Or are there still workarounds needed?
When can I use this model in web?
still "Unsupported model type: phi3"
@webjjin You need to install transformers.js v3 from the dev branch:
npm install xenova/transformers.js#v3
See here for example code: https://github.com/xenova/transformers.js/blob/e32d4ebb6fe715e6634335123c07a96d0dc62ac8/examples/webgpu-chat/src/worker.js
I have tried it. but I got an error.
There is no file below.
https://cdn.jsdelivr.net/npm/[email protected]/dist/ort-wasm-simd-threaded.jsep.mjs
There is no [email protected]
https://www.npmjs.com/package/onnxruntime-web
How can I get [email protected] ?
I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?
I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?
Thank you for the reply. but I'd better to wait for the stable version of transformer.js#v3
I just tried today, with the following steps:
- Cloned the transformers.js project locally.
- Switched to the v3 branch.
- Installed dependencies and built the project:
npm install
andnpm run build
. - Copied the
dist
folder to a test project folder. - Tried running Phi-3 with this sample code:
import { pipeline, env } from './dist/transformers.js';
const model_id = 'Xenova/Phi-3-mini-4k-instruct';
env.backends.onnx.wasm.proxy = false;
const pipe = await pipeline('text-generation', model_id, {
dtype: "q4",
device: 'webgpu',
use_external_data_format: true,
});
But I'm blocked at this error:
Error: Can't create a session. ERROR_CODE: 1, ERROR_MESSAGE: Deserialize tensor model.layers.1.mlp.up_proj.MatMul.weight_Q4 failed.Failed to load external data file "./model_q4.onnx_data", error: Module.MountedFiles is not available.
at We (ort.webgpu.min.js:22:13223)
at Pd (ort.webgpu.min.js:2309:19615)
Any idea how I could solve this?