Post
2199
All You Need To Know About Apple Intelligence Architecture And Models!!
One key challenge with running llms on device is a balance between compute, performance and model size. Apple Intelligence solves this using small/specialized chunks (Adapters) of the on-device foundation model when needed.
For compute, they engineered a new framework that uses LoRA adapters of rank 16, allowing a merged 2-bit and 4-bit config that yields up to 3.5 bits per weight, achieving the same performance as the uncompressed models.
With the help of an OSS model latency and power analysis tool (Talaria), they were able to optimize the bit rate selection for each operation. This along with activation & embedding quantizations plus efficient key-value caching, achieved up to 30 tokens/sec on iPhone 15 pro.
When the model is prompted (e.g to rewrite an email in the mail app), the app draws from the app intents toolbox which sends the prompt to the adapter specialized for writing, the model responds through the same pipeline with a real-time update of the text to rewrite.
The coolest feature of these models is their ability to adapt and dynamically specialize on user’s everyday activities. For this they adapt the attention matrices, the attention projection matrix, and the fully connected layers in the point-wise feedforward networks for a suitable set of the decoding layers of the transformer architecture.
For tasks that require more capable models, the arch utilizes server/larger models on a private cloud compute infrastructure that delivers SOTA secured and verifiable privacy experience.
More on the private cloud compute: https://developer.apple.com/videos/play/wwdc2024/102/
One key challenge with running llms on device is a balance between compute, performance and model size. Apple Intelligence solves this using small/specialized chunks (Adapters) of the on-device foundation model when needed.
For compute, they engineered a new framework that uses LoRA adapters of rank 16, allowing a merged 2-bit and 4-bit config that yields up to 3.5 bits per weight, achieving the same performance as the uncompressed models.
With the help of an OSS model latency and power analysis tool (Talaria), they were able to optimize the bit rate selection for each operation. This along with activation & embedding quantizations plus efficient key-value caching, achieved up to 30 tokens/sec on iPhone 15 pro.
When the model is prompted (e.g to rewrite an email in the mail app), the app draws from the app intents toolbox which sends the prompt to the adapter specialized for writing, the model responds through the same pipeline with a real-time update of the text to rewrite.
The coolest feature of these models is their ability to adapt and dynamically specialize on user’s everyday activities. For this they adapt the attention matrices, the attention projection matrix, and the fully connected layers in the point-wise feedforward networks for a suitable set of the decoding layers of the transformer architecture.
For tasks that require more capable models, the arch utilizes server/larger models on a private cloud compute infrastructure that delivers SOTA secured and verifiable privacy experience.
More on the private cloud compute: https://developer.apple.com/videos/play/wwdc2024/102/