Xenova/Phi-3-mini-4k-instruct

BoscoTheDog

May 3

•

edited May 3

So much yes! Finally 128K context in the browser!

Is any part of this WebGPU accelerated in V2 or V3?
The file is surprisingly small at 1.45Gb? Since the original Microsoft file is 2.5 Gb, does that mean it's heavily quantified? (https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32)
Isn't implementation rather complicated? For example, llama.cpp still doesn't have support for this model. Or does the onnx-format solve those issues somehow? The Llama.cpp issue on Github: https://github.com/ggerganov/llama.cpp/issues/6849

BoscoTheDog changed discussion title from Yes yes yes yes yes! to Awww yes! May 3

BoscoTheDog

May 3

it's possible to overdo it on the "yes".

BoscoTheDog

May 3

Unsupported model type: phi3

I guess I need to be patient a little bit longer.

Xenova

Owner May 4

Hey! Indeed, we're still working on this, and we'll make an announcement once it's working 100%! To answer your questions:
1 ) Yes, this will be part of v3 w/ WebGPU acceleration
2) The model is split into two parts: ~830MB + 1.45GB. Both need to be below 2GB to be cacheable.
3) We're relying on MSFT's official ONNX export/implementation, which simplifies a lot for us! :)

BoscoTheDog

May 4

•

edited May 4

Super cool on all fronts. Thanks for explaining!

If I can help test, let me know.

BoscoTheDog

May 4

I couldn't resist and tried to get it running with the latest version of V3, but only got 404's no matter which dtype I tried.

What will be the correct incantation? Is this close?

const generator = await pipeline('text-generation', 'Xenova/Phi-3-mini-128k-instruct', {
                dtype: 'q4',    // fp32, fp16, q8, int8, uint8, q4, bnb4
                progress_callback: (data) => {
                  	if (data.status !== 'progress') return;
                    setLoadProgress(data);
                },
            });

Xenova

Owner May 4

•

edited May 4

You need to use this revision: https://huggingface.co/Xenova/Phi-3-mini-128k-instruct/discussions/3

By setting revision: 'refs/pr/3'.

You also need to set use_external_data_format: true, which has been introduced by the latest commits.

Xenova

Owner May 4

I can share some example code in a few hours, but I’m still getting erroneous output. Hopefully will get it working soon :)

Xenova

Owner May 4

Here's some HIGHLY EXPERIMENTAL WORK IN PROGRESS code:

import { env, AutoModelForCausalLM, AutoTokenizer } from '@xenova/transformers';

// disable proxying for now (much slower)
env.backends.onnx.wasm.proxy = false;

const model_id = 'Xenova/Phi-3-mini-128k-instruct';
const tokenizer = await AutoTokenizer.from_pretrained(model_id, {
    legacy: true, // TODO: update config
});

const prompt = `<|user|>
Tell me a joke<|end|>
<|assistant|>
`;

const inputs = tokenizer(prompt);

const model = await AutoModelForCausalLM.from_pretrained(model_id, {
    dtype: 'q4',
    // device: 'webgpu', // NOTE: webgpu produces incorrect results
    use_external_data_format: true,
    revision: 'refs/pr/3',
});

// { // warm up
//     const outputs = await model.generate({ ...inputs, max_new_tokens: 1 });
// }
{ // run + time execution
    const start = performance.now();
    const outputs = await model.generate({ ...inputs, max_new_tokens: 5 }); // TODO: increase max new tokens
    const end = performance.now();
    console.log(tokenizer.batch_decode(outputs, { skip_special_tokens: false }));
    console.log('Execution Time:', end - start);
}

NOTE: to get it working, you need to use the latest commit of transformers.js v3.

And just for now, you also need to replace this in src/models.js.

- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();

(Will not be necessary once we update the model).

WASM produces correct results, while WebGPU does not. Will continue to investigate.

Xenova

Owner May 4

Figured out the problem: the latest version of ONNXRuntime-web hadn't yet been published to NPM.

Here's a demo of phi-3-mini-128k-instruct running at ~20 tokens per second on an RTX 2080:

NickyNicky

May 5

•

edited May 5

yup this model is a quanti... run on phone.

I trained this model:
NickyNicky/Phi-3-mini-4k-instruct_orpo_V2 --- >> https://huggingface.co/NickyNicky/Phi-3-mini-4k-instruct_orpo_V2

Then I quantized it with onnx but it gave me 10 GB, how did you compress it so much?

Xenova

Owner May 5

The latest versions of ONNXRuntime support two forms of 4-bit quantization (for certain weights):

You should also be able to quantize the other weights to fp16 of q8.

Hope that helps!

BoscoTheDog

May 5

•

edited May 5

Awesome!

Is this still needed?

- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();

// yep :-)

BoscoTheDog

May 5

Whoop!

<s><|user|> Why is the sky blue?<|end|><|assistant|> The sky appears blue to

omaryshchenko

May 10

@Xenova How did you split it into 2 pieces below 2 GB ?

Xenova

Owner May 10

•

edited May 10

@omaryshchenko See instructions here: https://huggingface.co/Xenova/Phi-3-mini-4k-instruct/discussions/3#66364ac6f7acbb051b7fa9f9

BoscoTheDog

May 11

Would you say the 128K context version is now it's ready for implementation? Or are there still workarounds needed?

webjjin

May 24

When can I use this model in web?
still "Unsupported model type: phi3"

Xenova

Owner May 24

@webjjin You need to install transformers.js v3 from the dev branch:

npm install xenova/transformers.js#v3

See here for example code: https://github.com/xenova/transformers.js/blob/e32d4ebb6fe715e6634335123c07a96d0dc62ac8/examples/webgpu-chat/src/worker.js

webjjin

May 27

•

edited May 27

I have tried it. but I got an error.

There is no file below.
https://cdn.jsdelivr.net/npm/[email protected]/dist/ort-wasm-simd-threaded.jsep.mjs

There is no [email protected]
https://www.npmjs.com/package/onnxruntime-web

How can I get [email protected] ?

BoscoTheDog

May 27

I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?

webjjin

May 28

I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?

Thank you for the reply. but I'd better to wait for the stable version of transformer.js#v3

patrickbrosset

Jul 15

I just tried today, with the following steps:

Cloned the transformers.js project locally.
Switched to the v3 branch.
Installed dependencies and built the project: npm install and npm run build.
Copied the dist folder to a test project folder.
Tried running Phi-3 with this sample code:

import { pipeline, env } from './dist/transformers.js';

const model_id = 'Xenova/Phi-3-mini-4k-instruct';
env.backends.onnx.wasm.proxy = false;

const pipe = await pipeline('text-generation', model_id, {
  dtype: "q4",
  device: 'webgpu',
  use_external_data_format: true,
});

But I'm blocked at this error:

Error: Can't create a session. ERROR_CODE: 1, ERROR_MESSAGE: Deserialize tensor model.layers.1.mlp.up_proj.MatMul.weight_Q4 failed.Failed to load external data file "./model_q4.onnx_data", error: Module.MountedFiles is not available.
    at We (ort.webgpu.min.js:22:13223)
    at Pd (ort.webgpu.min.js:2309:19615)

Any idea how I could solve this?