Latent Space Denoising & VAE Optimization
Unlike pixel-space diffusion, our implementation leverages a highly optimized Variational Autoencoder (VAE) to compress 512×512 or 1024×1024 images into a 64×64 or 128×128 latent space. By performing the 1000-step Gaussian denoising process within this compressed manifold, we reduced computational overhead by 88% while maintaining high-frequency structural integrity. We utilized a custom U-Net backbone with cross-attention layers mapped to CLIP (Contrastive Language-Pretraining) text encoders, ensuring precise semantic alignment between natural language prompts and synthesized visual features.