File size: 7,264 Bytes
44f0629 e697237 44f0629 e697237 f758fd4 e697237 eb3fc17 e697237 8ec2466 bf7e26d 8ec2466 4f90398 8ec2466 e697237 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
license: other
license_name: stable-cascade-nc-community
license_link: https://huggingface.co/stabilityai/stable-cascade/blob/main/LICENSE
language:
- en
tags:
- stable-cascade
- SDXL
- art
- artstyle
- fantasy
- anime
- aiart
- ketengan
- SomniumSC
pipeline_tag: text-to-image
library_name: diffusers
---
# SomniumSC-v1,1 Model Showcase
<p align="center">
<img src="01.png" width=70% height=70%>
</p>
`Ketengan-Diffusion/SomniumSC-v1.1` is a fine tuned stage C Stable Cascade model [stabilityai/stable-cascade](https://huggingface.co/stabilityai/stable-cascade).
A fine-tuned model from all new stabilityAI model, Stable Cascade (Or we could say Würstchen v3) with a 2D (cartoonish) style is trained at Stage C 3.6B model. This model also trains the text encoder to generate a 2D style, so this model not only could generate using booru tag prompt, also you can use the natural language.
The model uses same amount and method of AnySomniumXL v2 used which has 33,000+ curated images from hundreds of thousands of images from various sources. The dataset is built by saving images that have an aesthetic score of at least 19 and a maximum of 50 (to maintain the cartoonish model and not too realistic. The scale is based on our proprietary aesthetic scoring mechanism), and do not have text and watermarks such as signatures or comic/manga images. Thus, images that have an aesthetic score of less than 17 and more than 50 will be discarded, as well as images that have watermarks or text will be discarded.
# Demo
Huggingface Space: [spaces/Ketengan-Diffusion/SomniumSC-v1.1-Demo](https://huggingface.co/spaces/Ketengan-Diffusion/SomniumSC-v1.1-Demo)
Our Official Demo (Temporary Backup): somniumscdemo.ketengan.com
# Training Process
SomniumSC v1.1 Technical Specifications:
Training per 1 Epoch 30 Epoch (Results from SomniumSC using Epoch 40)
Captioned by proprietary multimodal LLM, better than LLaVA
Trained with a bucket size of 1024x1024; 1536x1536 (Multi Resoutin)
Shuffle Caption: Yes
Clip Skip: 0
Trained with 1x NVIDIA A100 80GB
# Our Dataset Process Curation
<p align="center">
<img src="Curation.png" width=70% height=70%>
</p>
Image source: [Source1](https://danbooru.donmai.us/posts/3143351) [Source2](https://danbooru.donmai.us/posts/3272710) [Source3](https://danbooru.donmai.us/posts/3320417)
Our dataset is scored using Pretrained CLIP+MLP Aesthetic Scoring model by https://github.com/christophschuhmann/improved-aesthetic-predictor, and We made adjusment into our script to detecting any text or watermark by utilizing OCR by pytesseract
<p align="center">
<img src="Chart.png" width=70% height=70%>
</p>
This scoring method has scale between -1-100, we take the score threshold around 17 or 20 as minimum and 50-75 as maximum to pretain the 2D style of the dataset, Any images with text will returning -1 score. So any images with score below 17 or above 65 is deleted
The dataset curation proccess is using Nvidia T4 16GB Machine and takes about 7 days for curating 1.000.000 images.
# Captioning process
We using combination of proprietary Multimodal LLM and open source multimodal LLM such as LLaVa 1.5 as the captioning process which is resulting more complex result than using normal BLIP2. Any detail like the clothes, atmosphere, situation, scene, place, gender, skin, and others is generated by LLM.
# Tagging Process
We simply using booru tags, that retrieved from booru boards so this could be tagged by manually by human hence make this tags more accurate.
# Limitations:
✓ Still requires broader dataset training for more variation of poses and style
✓ Text cannot generated correctly, and seems ruined
✓ This optimized for human or mutated human generation. Non human like SCP, Ponies, and more maybe could resulting not what you expecting
✓ The faces maybe looks compressed. Generate the image at 1536px could be better
Smaller half size and stable cascade lite version will be released soon
# How to use SomniumSC:
Currently Stable Cascade only supported by ComfyUI.
Currently Stable Cascade only supported by ComfyUI.
You can use tutorial in [here](https://gist.github.com/comfyanonymous/0f09119a342d0dd825bb2d99d19b781c#file-stable_cascade_workflow_test-json) or [here](https://medium.com/@codeandbird/run-new-stable-cascade-model-in-comfyui-now-officially-supported-f66a37e9a8ad)
To simplify which model should you download, I will provide you the where's to download model directly
For stage A you can download from [Official stabilityai/stable-cascade repo](https://huggingface.co/stabilityai/stable-cascade).
For stage B you can download from [Official stabilityai/stable-cascade repo](https://huggingface.co/stabilityai/stable-cascade).
For stage C you can download the safetensors on huggingface repo that you find on files tab
And the text encoder you download from our huggingface repo on text_encoder folder
# Deplying SomniumSC v1.1 with Diffusers 🧨
⚠️ Warning: You must install this diffusers branch to make the code working to using Stable Cascade architecture
```
git+https://github.com/kashif/diffusers.git@a3dc21385b7386beb3dab3a9845962ede6765887
```
Deploying the simple SomniumSC-V1.1 inference
```import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
num_images_per_prompt = 1
print(f"Running on: {device}")
prior = StableCascadePriorPipeline.from_pretrained("Ketengan-Diffusion/SomniumSC-v1.1", torch_dtype=torch.bfloat16).to(device) # point to the fine tuned model that you desired (stage C)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", torch_dtype=torch.float16).to(device) # point to the "Mother" model which is from stabilityai (Stage A and B)
prompt = "An Astronout riding a horse"
negative_prompt = ""
prior_output = prior(
prompt=prompt,
height=1024,
width=1024,
negative_prompt=negative_prompt,
guidance_scale=12.0,
num_images_per_prompt=num_images_per_prompt,
num_inference_steps=50
)
decoder_output = decoder(
image_embeddings=prior_output.image_embeddings.half(),
prompt=prompt,
negative_prompt=negative_prompt,
guidance_scale=1.0,
output_type="pil",
num_inference_steps=10
).images
```
# SomniumSC Pro tips:
Negative prompt is a must to get better quality output. The recommended negative prompt is lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name
If the model producing pointy ears on the character, just add elf or pointy ears.
If the model producing "Compressed Face" use 1536px resolution, so the model can produce the face clearly.
# Disclaimer:
This model is under STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE. Which this model cannot be sold, and the derivative works cannot be commercialized. Except As far as I know, you can buy the membership of StabilityAI here To commercialize your derivative works based on this model. Please support StabilityAI, so they can always provide open source model for us. But still you can merge our model freely |