vibecon 2026 · workshop
prompt engineering for image & video generation
Derrick Schultz · Canyon NYC · June 17–18 2026
speaker



Ultra-realistic vertical photo (9:16).a stylish young man with a lean,toned physique (around 57 kg,5’6 tall) — his beard remains unchanged,and his head is clean-shaven,giving him a modern,confident bald look.He wears one small,silver hoop earring in each ear,adding a subtle yet refined touch of individuality and style.sits on indoor stairs beside a matte concrete wall.A rectangular beam of golden sunlight from a window hits the wall,creating a crisp shadow silhouette inside the bright frame.He wears a black ribbed knit sweater,tapered grey chinos,and chunky white sneakers.Pose: seated,elbows on thighs,hands loosely clasped,chin slightly lifted,eyes looking toward the light,calm and confident expression.Lighting: hard warm sunlight from camera-right as key,soft ambient bounce fill,high contrast with long shadows,cinematic golden-hour mood.Camera & look: low-mid angle from a few steps below,50–85mm f/2.2 lens,shallow depth of field,clean optics,realistic skin texture,fine film grain,subtle vignette.Style: minimalist background,no clutter,fashion editorial realism.Exclude: cartoon,CGI,AI-artifacts,over-smoothing,plastic skin,excessive sharpening,motion blur,warped anatomy,extra fingers,disfigured hands,double shadow,blown highlights,banding,watermark,logo,text,bad perspective,dirty wall,clutter.
schedule
01 how prompting works
02 explore
03 expand
04 more to explore
01
The model starts from random noise and refines it, step by step, into an image.
how prompting works
It all begins as noise — pure random static.
Generation is just carving an image out of it.
how prompting works
Your prompt conditions each step — it biases what the noise resolves into.
You don’t draw the image; you steer the denoising.
tags — keywords
natural language — sentences
structured data — JSON
02
titles.xyz
one model · explore together
Type nothing — on titles, a single comma. See what the model makes from nothing: its raw default.
A subject with a style — two or three words is enough to point it somewhere.
03
Pick a direction instead of writing it all out by hand.
[ your Replit tool ]
rough idea in · expanded prompt out
Lighting, lens, mood, composition — added for you, without writing it all out.
Structured, repeatable prompts — change one field, hold the rest constant.
04
Now you direct motion and camera — not just the frame, but what happens over time.



Bake a style or subject into the model so you barely have to describe it.
bonus
inversion
Input: an image — reference, frame, or target.
Output: a prompt that approximates it.
A VLM reads the image and produces the text.
definition
Vision-Language Model — accepts image input, returns text.
(gpt-4o, claude, gemini, qwen-vl…)
Query: “produce a prompt that would generate this image.”
pipeline
image → VLM (“describe as a prompt for this model”) → base model → generate → compare → revise the query.
failure mode
VLMs default to content — “a woman in a red coat.”
You often want style — “grainy 35mm, blown highlights, teal shadows.”
Specify which to extract.
constrain the VLM
“describe only lighting and color.”
“ignore the subject; capture rendering style.”
“output as sdxl tags.”
The VLM’s output is controllable — constrain it.
[ your Replit tool ]
image in · formatted prompt out · regenerate
applications
• reproduce a target look
• maintain one style across many images
• convert a film still into a generatable prompt
find me