ComfyUIでStable Diffusion 3.5に触れてみる（2-Medium/Large Turbo編）

2024年11月4日 20:34

※ Last update 11-4-2024
※ SD 3.5 MediumとLarge/Large Turboは互換性がありませんが、Text EncoderとVAEは共通です。
※ Mediumは2.～3.に、Large Turboは4.～5.に掲載しています。
※ Comfy Org版のMediumモデルはText Encoderも内蔵していますが、ワークフローでは使用していません（内蔵のt5xxlはfp8版です）。

■ 0. 概要

▼ 0-0. はじめに

　本記事では「Stable Diffusion 3.5」の「Medium」と「Large Turbo」を利用してみます。「Large」については前回の記事を参照してください。

▼ 0-1. Stable Diffusion 3.5の情報

　日本時間の10-22-2024夜に、「Stable Diffusion 3.5 Large」と「Stable Diffusion 3.5 Large Turbo」が公開され、一週間後に「Stable Diffusion 3.5 Medium」も公開されました。

　参考まで、ComfyUI_examplesにはLargeとMediumの、ワークフロー込みの画像が掲載されています（Large Turboは設定の変更方法についてのみ言及）。

■ 1. 利用の手順（Medium）

▼ 1-1. 留意点

　流れは前回の記事と同じなので、不要な箇所は読み飛ばしてください。

　モデルの種類が混乱しないよう、Text Encoder以外はデフォルトの場所にディレクトリを作成して、その下に設置する方針をとっています。

▼ 1-1. モデルのダウンロード（Comfy Org版）

　下記のURLへアクセスします。

stable-diffusion-3.5-fp8
https://huggingface.co/Comfy-Org/stable-diffusion-3.5-fp8

　モデルをダウンロードして設置していきます。

sd3.5_medium_incl_clips_t5xxlfp8scaled.safetensors
→ ComfyUI\models\checkpoints\sd35m へ移動
（※補足：このモデルはclip_g、clip_l、t5xxl_fp8を含んでいるので、外付の代わりに用いることもできます）

　Text Encoderは、既にダウンロードしている場合は流用できます。なお、「t5xxl_fp8_e4m3fn_scaled」については「the scaled one is an experimental checkpoint by us」との説明がありましたが、詳細は不明です。

text_encoders\clip_g.safetensors
text_encoders\clip_l.safetensors
text_encoders\t5xxl_*.safetensors （いずれか、メインRAMが十分ならfp16を選択）
→ ComfyUI\models\clip へ移動

　最後にワークフローです。

sd3.5-t2i-fp16-workflow.json
sd3.5-t2i-fp8-scaled-workflow.json
→ いずれかをダウンロード（違いはText Encoderモデルの初期設定値のみ？）

▼ 1-2. 画像の生成

　ComfyUIを更新してから起動します。先ほどダウンロードしたワークフローをUI上にドラッグ＆ドロップしてください。

　モデル設定の「ckpt_name」とText Encoderの設定の「t5xxl」を、実在するファイル名に設定しないとエラーが出ます。

　最後に「Queue Prompt」をクリックすると生成が開始されます。

　先述のとおり、このモデルはText Encorderを含んでいる（t5xxlはfp8）ので、CheckpointのCLIPをつなげても動作します。

■ 2. シンプルなワークフロー（Medium）

▼ 2-1. 概要

　Comfy OrgとStability AIのワークフローを参考に、単純なワークフローを作成しました。Comfy Org版には含まれていないノードを追加して、画質が向上していると思います。改造、再公開等は自由です。

▼ 2-2. 画像の生成

　さきほどのワークフローを使用すると、表紙と同じ画像が生成できると思います。異なるモデルを使用した場合は出力も変わります。

Create a vibrant, anime-style illustration of a crisp autumn day in a maple-filled park. In the foreground, place a cheerful little girl with big, expressive eyes and auburn braids, wearing a cozy sweater dress and a red scarf. Have her holding a wooden sign that's slightly too big for her, decorated with falling maple leaves. On the sign, write in adorable, wobbly handwritten text: "SD 3.5 Medium" Include seasonal park elements in the background, such as maple trees with red and gold leaves, a stone path covered in fallen leaves, and other visitors taking photos of the autumn colors. The scene should be illuminated by warm afternoon sunlight filtering through the colorful canopy, creating a magical autumn atmosphere.

▼ 2-3. 分割モデル版

　Text EncoderやVAEを含まないModelに対応したバージョンも用意しました。Model、Text Encoder、VAEを選択した上でご利用ください。VAEのダウンロードは4-4.を参照してください。

Model → ComfyUI\models\unet\sd35m へ設置
VAE → ComfyUI\models\vae\sd35 へ設置

　ModelのLoaderを差し替えることで、GGUFやNF4にも対応できます。ComfyUIの対応方法は、下記の記事を参考にしてください。

■ 3. おまけ（Medium）

▼ 3-1. おまけ画像（SD 3.5 M）

　はじめの2枚は前回の記事、つまりLargeモデルと同じプロンプトを使用しています。残りの2枚はFLUX.1の記事です。SD 3.5はFLUX.1と同様に様々なスタイルを持っているようなので、これらを探ってみる楽しみ方もあると思います。

A young Japanese woman hiking in the mountains. She's a 'yama girl' (mountain girl), wearing fashionable yet functional hiking gear - colorful quick-dry pants, a lightweight jacket, and a cute mountaineering hat. She carries a stylish backpack and trekking poles. The background shows a breathtaking Japanese mountain landscape with lush greenery, perhaps a glimpse of Mt. Fuji in the distance. The hiker is smiling, enjoying the scenic view from a mountain trail. The scene captures the essence of the yama girl trend - a blend of outdoor activity and fashion.

A cheerful anime girl with long, flowing dark brown hair, crouching with knees pulled close. She's wearing a cute outfit: a pastel pink frilly blouse draping over her knees, and a white tulle mini skirt with lace trim. Her bow-accented ankle boots are visible as she balances on her toes. Viewed diagonally, her bright eyes and warm smile are prominent as she looks at the viewer with friendliness and curiosity. Behind her, an alpine meadow stretches out, dotted with wildflowers. Snow-capped Alps pierce the sky, with a quaint resort town in the valley. The late afternoon sun casts a warm glow, highlighting her silhouette. The image is in a soft, watercolor style, emphasizing the dreamy atmosphere of the scene.

A delicate watercolor painting on textured, rough paper. Large, blurred blooms of various summer flowers dominate the composition. The colors are muted and pastel, creating a soft, ethereal atmosphere. The edges of the flowers are slightly undefined, giving a dreamlike quality to the image.

A vibrant live stream screen featuring a 3D animated character in the style of a Three Kingdoms era general. They are positioned in the center of the frame against a backdrop of traditional Chinese architecture. A chat window is located on the right side of the screen, displaying colorful emotes and messages.

■ 4. 利用の手順（Large Turbo）

▼ 4-1. 概要等

　Stable Diffusion 3.5 Large Turboは、4 Stepsで生成できるLargeモデルです。通常版のLargeモデルと共通のLoRAが使える可能性があります。

　こちらはComfy Org版が無く、ModelとText EncoderとVAEをそれぞれ用意する形をとっています。

▼ 4-2. モデルのダウンロード（Model）

（※ 公式モデルはComfyUI上にて「checkpoints」扱いをしているようです。しかし、VAEを含み、Text Encoderを含まない中途半端なものをそちらに設置することに抵抗があります。幸い、VAEは小さいので無視して、筆者は公式モデルを「unet」扱いとしますのでご了承ください。なお、Comfy Org版のLargeモデルやMediumモデルはどちらも含んでいるので問題ありません。）

　正規のモデルは下記のURLにあります。許可を得るとダウンロードができるようになります。

Stable Diffusion 3.5 Large Turbo
https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo
sd3.5_large_turbo.safetensors
→ ComfyUI\models\unet\sd35l へ移動

　Civitaiにコピーがあります。こちら（Hash AUTOV1: 294D4BDF）は、筆者がオリジナルと同一であることを確認済みです。ファイル名が異なるので、気になる場合は修正してください。

https://civitai.com/models/878645/stable-diffusion-35-large-turbo?modelVersionId=983611
stableDiffusion35_largeTurbo.safetensors
→ ComfyUI\models\unet\sd35l へ移動

▼ 4-3. モデルのダウンロード（Text Encoder）

　設置済みの方は読み飛ばしてください。SD 3.5用のText Encoderは前回の記事同様、Comfy Orgがアップロードしたものを紹介します。

stable-diffusion-3.5-fp8
https://huggingface.co/Comfy-Org/stable-diffusion-3.5-fp8
text_encoders\clip_g.safetensors
text_encoders\clip_l.safetensors
text_encoders\t5xxl_*.safetensors （いずれか、メインRAMが十分ならfp16を選択）
→ ComfyUI\models\clip へ移動

　なお、「t5xxl_fp8_e4m3fn_scaled」については「the scaled one is an experimental checkpoint by us」との説明がありましたが、詳細は不明です。

▼ 4-4. モデルのダウンロード（VAE）

　ファイル名の取り決めは特にありませんが、掲載したワークフローでは「sd3.5-vae.safetensors」としています。

　正規のモデルは下記のURLにあります。許可を得るとダウンロードができるようになります。ファイル名を変更するかどうかはお任せします。

https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo
vae\diffusion_pytorch_model.safetensors
→ ComfyUI\models\vae\sd35 へ移動

　Civitaiにコピーがあります。こちら（Hash AUTOV1: 36EEBF78）は、筆者がオリジナルと同一であることを確認済みです。ファイル名を変更するかどうかはお任せします。

https://civitai.com/models/880208?modelVersionId=986506
sd35Fusion8StepsMergeFull_vae.safetensors
→ ComfyUI\models\vae\sd35 へ移動

▼ 4-5. ワークフロー（分割モデル版）

　Stable Diffusion 3.5 Large Turbo向けに設定を調整したワークフローです。Model、Text Encoder、VAEを選択した上でご利用ください。

▼ 4-6. 画像の生成

　さきほどのワークフローを使用すると、表紙と同じ画像が生成できると思います。異なるモデルを使用した場合は出力も変わります。

■ 5. おまけ（Large Turbo）

▼ 5-1. おまけ画像（SD 3.5 L Turbo）

　3-1.と同じプロンプトを使用して、samplerをeulerに統一してみました。

■ 6. 所感

▼ 6-1. SD 3.5 Mediumについて

　筆者の環境では生成に36秒かかります（1024x1024、Steps 30、モデルはロード済みとする）。Largeの98秒と比べれば十分早いのですが、破綻が目立ちます。

　SD 3.5もSDXLのように時間をかけて成熟できれば良いのですが、互換性のないLargeとMediumの二本立てであるのが気がかりです。利用状況によっては、片方がフェードアウトする可能性もあります。

　Mediumのモデルはコンパクトです。試しにfp8の統合モデル（All-In-One）を作ってみたところ、8.41GBになりました（Comfy Org版は10.8GB）。

▼ 6-1. SD 3.5 Large Turboについて

　筆者の環境では生成に8秒かかります（1024x1024、Steps 4、モデルはロード済みとする）。とにかく早いのが特徴で、アイデアを出すために大量に生成する等の目的で使えそうです。

　Medium同様に破綻も多く、さらには質感も微妙です。ただし、時間をかけて高い打率が取れるのでなければ、高速で低コストな生成技術は必須であると考えています（FLUX 1.1 Proも高速化しました）。今後、Largeの追加学習とTurboの速度を組み合わせた、いいとこ取りのモデルが登場することを期待します。

■ 7. その他

　私が書いた他の記事は、メニューよりたどってください。

　noteのアカウントはメインの@Mayu_Hiraizumiに紐付けていますが、記事に関することはサブアカウントの@riddi0908までお願いします。