Black Forest Labs
Company
Member of Technical Staff - Multimodal VLM/LLM
Job Description
What if the future of generative AI isn't just better images or better text, but models that understand both—and use that understanding to create in ways neither modality could alone?
Our founding team pioneered Latent Diffusion and Stable Diffusion - breakthroughs that made generative AI accessible to millions. Today, our FLUX models power creative tools, design workflows, and products across industries worldwide.
Our FLUX models are best-in-class not only for their capability, but for ease of use in developing production applications. We top public benchmarks and compete at the frontier - and in most instances we're winning.
If you're relentlessly curious and driven by high agency, we want to talk.
With a team of ~50, we move fast and punch above our weight. From our labs in Freiburg - a university town in the Black Forest - and San Francisco, we're building what comes next.
But here's the frontier we're exploring: vision-language models that don't just caption images or generate from prompts, but truly understand the relationship between visual and linguistic information. Models that can enhance prompts intelligently, moderate content contextually, and unlock generative capabilities we haven't imagined yet. That's the research you'll lead.
What You'll Pioneer
You'll run cutting-edge projects in multimodal vision-language and large language models, integrating them into our media generation pipeline in ways that push beyond what either modality could achieve alone. This isn't about implementing existing VLMs—it's about developing novel approaches that make FLUX more powerful, more controllable, and more aligned with what creators actually need.
You'll be the person who:
- Leads the development and training of state-of-the-art multimodal vision-language models within the FLUX technology stack—not just applying existing architectures, but innovating on them
- Designs and implements specialized fine-tuning strategies for VLMs to address specific use cases and performance requirements that general-purpose models can't handle
- Develops and optimizes LLM implementations for prompt enhancement, content moderation, and novel applications that improve how people interact with generative models
- Drives innovation by integrating VLM/LLM capabilities into our media generation pipeline in creative ways that enhance generative capabilities
- Conducts research to creatively combine vision and language models—exploring questions about how these modalities can inform and improve each other
- Maintains cutting-edge knowledge of the latest developments in multimodal AI and LLM research, evaluating emerging models and architectures for potential integration
- Collaborates with cross-functional teams to implement and deploy models at scale, contributing to architectural decisions and technical roadmap planning
- Documents and shares research findings with the broader team, translating breakthroughs into practical improvements
Questions We're Wrestling With
- How can vision-language models improve prompt understanding in ways that make generation more controllable and aligned with user intent?
- What's the right architecture for integrating VLMs into diffusion model workflows without creating computational bottlenecks?
- How do you fine-tune vision-language models for specialized creative tasks that weren't in the training data?
- Where can LLMs enhance the generative pipeline—prompt rewriting, content moderation, parameter suggestion—and where would they add more friction than value?
- What novel capabilities emerge when you deeply integrate vision and language understanding into generative workflows?
- How do you evaluate whether multimodal models are actually improving generation quality versus just adding complexity?
These aren't solved problems—they're research directions we're actively exploring.
Who Thrives Here
You've trained and fine-tuned large-scale vision-language models and understand the nuances of multimodal learning. You have strong intuitions about what makes VLMs work well, backed by either publications or practical projects that pushed the field forward. You're comfortable operating at the intersection of research and production, where models need to be both innovative and deployable.
You likely have:
- Demonstrated expertise in training and fine-tuning large-scale vision-language models—not just using pre-trained ones, but developing them
- Strong publication record or practical experience with relevant projects in multimodal AI research that shows you can push the frontier
- Proficiency in PyTorch or similar deep learning frameworks with deep understanding of their capabilities and limitations
- Experience with distributed training systems and large-scale model optimization—because VLMs don't fit on one GPU
- Track record of implementing and scaling AI models in production environments where research meets real-world constraints
We'd be especially excited if you:
- Have experience with diffusion models and generative AI architectures alongside autoregressive modeling—understanding how different paradigms can complement each other
- Bring a background in computer vision that informs your approach to multimodal models
- Contribute to open-source AI projects and understand the community
- Have worked in fast-paced startup environments where iteration speed matters
- Bring strong software engineering practices and system design skills
- Have experience with open-source VLM inference frameworks like vLLM
What We're Building Toward
We're not just adding VLMs to our stack—we're exploring fundamental questions about how vision and language understanding can make generative models more powerful and more aligned with human intent. Every model you train teaches us something about multimodal learning. Every integration reveals new capabilities. Every research finding shapes where the field goes next. If that sounds more compelling than applying existing techniques, we should talk.
We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.
Black Forest Labs
4 jobs posted
About the job
Similar Jobs
Discover more opportunities that match your interests
- 27 days ago
Member of Technical Staff - Multimodal VLM/LLM
Black Forest Labs
Freiburg (Germany)View details - 17 days ago
Member of Technical Staff - Platform Engineer (LLM Infrastructure & Backend Systems)
Inflection AI
Palo Alto, CAView details - 22 days ago
Member of Technical Staff - AI Experts
xAI
Palo Alto, CAView details - 1 day ago
Member of Technical Staff, AI Training Infrastructure
Fireworks
San Mateo, CAView details - 29 days ago
Member of Technical Staff Intern (2026), Artificial General Intelligence (AGI)
Amazon
US, CA, San FranciscoView details - 22 days ago
Member of Technical Staff - Macrohard / Computer Control - Applied AI
xAI
Palo Alto, CAView details - 2 days ago
Technical Lead Manager, LLM/VLM Foundation Model
Waymo
Mountain View, California, USAView details - 13 days ago
Staff AI Software Engineer and Technical Educator
Udacity
United StatesView details - 10 days ago
Staff AI Software Engineer and Technical Educator
Udacity
United StatesView details - 23 days ago
Staff Data Scientist
Databricks
San Francisco, CaliforniaView details
Looking for something different?
Browse all AI jobs