added info about CLIP interrogation results
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@ To give you a sense of what I mean, this model is Stable Diffusion v1.5 fine-tun
|
|
11 |
|
12 |
In a similar manner lists of words are presented with the concept/class "word" included in the caption, but the words themselves are not spelled out. Can Stable Diffusion learn to put together legible letters, words, and sentences simply by "learning" from data presented in this manner? This is what I aim to understand.
|
13 |
|
14 |
-
The sample images here include those that contain legible/partially legible words, including words (usually) from the image generation prompt. I suspect these are words that are "abundant", visually, in the dataset, but at this point I cannot and will not draw any conclusions about whether or not, for instance, the Stable Diffusion model has learned to identify the word "tree" with its visual representation, both in picture form (a picture of a tree) and "written" form (a picture of the word "tree"). I have found by interrogating CLIP/BLIP to generate captions for some of these images that it picks up on the words written in the picture, even nonsense words and/or misspellings and those with letters whose form isn't well-defined.
|
15 |
|
16 |
v1 of Storytime was trained using 130 512x512 images of alphabet letters, flashcards, word charts, pages of text, and pages of text with images on a computer running Windows 10 with a single Nvidia RTX a4000 GPU for 300 epochs with a batch size of 8. No prior preservations images were used.
|
17 |
|
|
|
11 |
|
12 |
In a similar manner lists of words are presented with the concept/class "word" included in the caption, but the words themselves are not spelled out. Can Stable Diffusion learn to put together legible letters, words, and sentences simply by "learning" from data presented in this manner? This is what I aim to understand.
|
13 |
|
14 |
+
The sample images here include those that contain legible/partially legible words, including words (usually) from the image generation prompt. I suspect these are words that are "abundant", visually, in the dataset, but at this point I cannot and will not draw any conclusions about whether or not, for instance, the Stable Diffusion model has learned to identify the word "tree" with its visual representation, both in picture form (a picture of a tree) and "written" form (a picture of the word "tree"). I have found by interrogating CLIP/BLIP to generate captions for some of these images that it picks up on the words written in the picture, even nonsense words and/or misspellings and those with letters whose form isn't well-defined. One of those results is shown below.
|
15 |
|
16 |
v1 of Storytime was trained using 130 512x512 images of alphabet letters, flashcards, word charts, pages of text, and pages of text with images on a computer running Windows 10 with a single Nvidia RTX a4000 GPU for 300 epochs with a batch size of 8. No prior preservations images were used.
|
17 |
|