Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update May 20
Post
1116
After spending some time practicing tokenization, I have come to realize that the difficulties we face in understanding each other are analogous to the challenges faced by LLMs in processing and interpreting tokens - as in untrained tokens lead to out of distribution qualms.

One could think of how we understand as a process that involves trained tokens (known/learned facts) grappling with prompts/tweets/lessons from someone else. This process is distinct for each person - with unique encoding, decoding, merging and splitting patterns.

This distinction might as well be categorized in gpt levels lol, which brings the question what level of tokenizer are you? GPT-2, GPT-3, GPT-4 or GPT-4o tokenizer:)

Papers:
- Neural Machine Translation of Rare Words with Subword Units (https://arxiv.org/abs/1508.07909)
- Learning to Compress Prompts with Gist Tokens
(https://arxiv.org/abs/2304.08467)
- Language Models are Few-Shot Learners
(https://arxiv.org/abs/2005.14165)
- Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
(https://arxiv.org/abs/2405.05417)
- Language Models are Unsupervised Multitask Learners

Code:
https://github.com/karpathy/minbpe
https://github.com/openai/tiktoken
https://github.com/openai/gpt-2
https://github.com/google/sentencepiece

yes :
I found out that in fact it does not really mattter which tokenizer we use , untill it gets to longcontext and large sliding windows : the training of the tokenizer is very fast so we have a big area to play with:

we can see that kaparty used a simple char tokenizer ! (it was enough and worked well in a toy model) ..
and we can see with the bpe algorithm we ave something!

but as we proceed to usage we understand that word tokenization is key to understanding relations between words : but we also need to understand relationships between sentences and Chunks of information , hence sentence tokenizers :
to say sentence peice or word peice (BPE).... we can now understand the relationships between words: and yet this is not enough!
So we can go BIG! paragraph to paragraph and document to document!

so when we tokenize from here .... we tokenize down .... ie we use sub sentence tokenizer(word) and subword tokenizer....
so now we can tokenize any portion of text , from document to char... and discover meaning in each level of the vocabulary merges ... ie the vocabulary is seperated into , chars, words, sentences, paragraphs etc...
yes that big@
as we scale upwards , we understand the desire of making the model small but this big boy tokenizer can handle even books!
when we embed information we can now get very fast reponses and can even discover whole simularity between books and manuals in the embedding space ... the models need to be in sync ! so that the tokenizer ... (id/Embeddings) can be used in the transformer ... as if it takse embeddings as input and embeddings as output .... viola ?
the transformer has its own embeddings table also essentally chunking the data based on the tokenizer embedding space (hybrid)....

ie: what is a token ?
it is only some form of information , which a value can be attributed to ... in our vocabulary we would be using the bpe algorithm so only Common sequence of tokens as well as sequence lengths would be in the tokenizer ... so phrases such as hello world may even become a single token (comonly used sequence) by world hello may not be !!
as if you understand how the tokenizer is "Grown" after you have chosen your vocabulary size and learned its optimun... the tokenizer would contain hybrid chunks !

sometime we dont have to understand what the end product of the vocabulary is ! we just need to train the model and after we can look at what was created (corpus deppendant) .... so when training the modle on code tokenization many (functions) may find themselves as a singles chunk !
as well as keywords from the programming language as the common tokens .... and words like supercalifradlaist.... would never apeare ... but due to the process it would be chunked same as any other bpe!

understanding the process of tokenization only leaves you with more choice and more possible decisions to make :

but the technique of BPE <<<< is the iceing on the cake ! as it IS a onefit for all and covers all bases !

Now thats solve .... its off to choose your embedding strategey !! (LOL)....in truth a lot has been spent on this topic and we can see some key figures rising ...

so we need to be able to ADD the extra step !
Extra embeddings ! stage adding even more meaning to the model ...as its not for the models sake as , when we begin our training .... these models are FIXED!

hence its strange that many models use bert but also many use llama ?? ie mistral ? why not bert! as this is where the main interop and research is focused ? as many people use the bert models as a starting point only to find out later that these models have to be EXTRA Resources ... like embedding models which are also taking up SPACE!<< but are providing a service which the llm (can provide) in fact we should always use our model to embedd our data into our rag ! not an outside model : as the query produced on the rag is prodcing content which the model is supposed to provide , but would not provide the same actual pool of information if asked ? this augmentation could be increased eficincey by tokenizing it with the model embeddings essentailly preemedding the query ! <<
and still , with python we should be able to merge these models together into a single model... ie train an embeddings model and install into the llm tensor stack ! and execute it the same as well as have the extra capabilitys to fine tune these internal tokneier embeddings also !

hmm...
with all the differnt model types...
who will win in the end ! << as more forcus is on vision now and trainable audio feature modalitys which should really be also included in the tensor stack and accessed with thier various automodel !??

hmm ... anyway great journey and learning experince to have inside !

make the mind boggle!! LOL (inspries to create amzing stuff)

·

yep, all this just shows at both low and high level how complex language is, even with abstractions we still fall at the mercy of undesired outcomes. So far tiktoken and sentence piece are viable choices for larger models.