Setting up your very first prompts
When I first tried to test Midjourney prompt variations for A B visual testing, I thought it was just like plugging two slightly different sentences into a box and clicking go. The output felt random, though, and sometimes not even comparable. For someone like me who has about twenty browser tabs open and three different Notion docs half edited at once, it was messy. The exact same prompt with and without a single word tweak produced way more than a small difference — sometimes the whole style shifted. That was my first sign that testing with Midjourney is more like working with a fussy design intern than a math equation.
A good trick I learned early was to keep the shared core of the prompt completely consistent. I literally copy paste the base phrase into two text lines and only swap a single adjective. Example table below shows how I tested this by isolating one variation word at a time:
Prompt A | Prompt B
——— | ———
a rainy street watercolor | a rainy street oil painting
desert mountain at sunrise | desert mountain at sunset
By holding everything else the same, you can actually see the AI shift gears only on the swapped-out part. It sounds obvious, but if you start tinkering with two or three words at once you can’t tell what caused what. That was my first workflow checkpoint 🙂
Watching unpredictable image drift
I ran into a funny pattern where I would run the same prompt, literally copy pasted from my history, and Midjourney would output a slightly new style even when nothing changed. Think of it like ordering the same fast food combo on different days but the fries are warmer sometimes and soggy other times. It can drive you crazy during A B testing since you might think your word swap caused the shift when really the AI just gave you a fresh seed.
If you actually want strict one-to-one testing, the secret is to lock in the seed number. In Midjourney you can add a parameter for seed so each run starts with the same base randomness. Once I locked that in, the comparison actually made sense. Without the seed lock, my A was occasionally drifting into B anyway. Sort of like a control group that refuses to behave. ¯\_(ツ)_/¯
Another trick: make two side-by-side prompts and generate them in the same batch, rather than running them hours apart. When I left huge time gaps, I started second guessing whether Midjourney updates their model silently in the background and that was messing with my tests.
Keeping track of multiple test pairs
After about three experiments I realized I was drowning in clutter. Image grids in the Discord feed, random copied prompts in Google Docs, some screenshots on my desktop — it was a mess. I ended up building a simple log table in Airtable where I pasted my pair of prompts, pasted the image grid link, and added a single sentence reaction like “B looks closer to a paperback cover illustration.” The act of writing one line per pair actually forced me to pay attention, because otherwise I would just scroll images mindlessly.
If you do not want to pay for Airtable, even a basic spreadsheet works. One column for Prompt A, one column for Prompt B, a link to the exported images, and a notes column. The important part is you have something structured enough to stop you from running in loops. Because Midjourney makes it too easy to keep rerolling forever.
I even color coded with conditional formatting — green for where A won visually, red for where B won, yellow for a wash. When you see a whole column of green down one specific adjective change, you actually start building real insight instead of just vibes.
Common mistakes that trick your results
One of the most common traps I hit was overcomplicating variation. I would change too many parts at once: lighting, camera angle, and medium. Then later I looked back and couldn’t identify which change made it better. If you are doing visual A B testing, you have to pretend you are a science student testing only one variable at a time. Otherwise you are staring at cool images but not learning anything.
Another mistake is ignoring aspect ratio. Same prompt phrased the same way but forcing square versus widescreen will completely alter composition. Half of my so called “wins” were just because one ratio framed the subject better.
Also beware of shiny new update days. Once when Midjourney announced a big upgrade, running the exact same test gave different stylistic outputs across the board. I thought I had discovered some magic word choice, but actually the whole system had updated overnight.
How to collect reactions from real people
Sometimes I thought “B is absolutely better.” Then I showed both images to a friend and they instantly picked A. That humbled me fast. A B testing is about choosing based on targets, not just personal taste. For visual testing, I set up a quick Google Form where I pasted two images side by side and just asked friends to click which they liked better. Simple yes or no. It gave me a tally to compare against my own bias.
Having around ten people click taught me that words like “minimalist” actually attracted positive reactions from some demographics but not others. Someone into photography tended to prefer terms like “cinematic lens.” Gathering feedback prevents you from sinking time into prompt changes that only you find appealing.
If you are running brand work for a client, you can even embed this kind of side-by-side in something like Typeform. Just be prepared for brutally honest comments about colors or subject accuracy.
Testing with consistent scales of detail
Midjourney has parameters that let you push detail higher or lower. I ran the same prompts with very low detail vs very high detail. The difference was like sketch thumbnails compared to hyper polished covers. For testing, you have to decide what level you are comparing. If Prompt A is tested on high detail and Prompt B on default, it is not a fair fight.
A clean way is to set your detail preference once up front and then leave it alone while you swap out descriptive adjectives. That way results live on the same tier. In early tests I forgot to standardize this, and my notes literally had lines like “A feels muted but could just be the detail setting.” Classic rookie mistake.
When automation helps and when it breaks
Since I’m obsessed with testing, I tried to automate runs. My fragile Zapier flow attempted to send paired prompts to Midjourney through Discord bots and dump results into Google Drive. It technically worked, but Discord rate limits made half of the attempts throw errors. Clicking this Zap run would sometimes post nothing, then run twice the next hour. Not reliable at all. Eventually I accepted that some parts are faster manual.
Where automation did help was on the logging side. Automatically saving every image grid into a pre-named Google Drive folder prevented tons of lost files. Even if the bot hiccuped on generation, at least the storage part ran smooth. So I divided the steps: gen images manually, archive automatically.
That split focus makes the system bearable without chasing bugs forever.
Comparing results across different sessions
A last tip I had to force myself into: test A and B right next to each other in the same session. If you run one image today and another tomorrow, your brain will trick you into remembering it brighter or cooler than it was. Human memory messes up comparisons. The best fix is literally having them side by side. I often paste both outputs into a Keynote slide just so I can flip a deck quickly.
Once in a while I line up six versions in a row like a police lineup and you immediately see which is out of place. It is way faster than opening files one by one. 😛
That habit saved me from running endless tests when actually the first three made the pattern clear.
Small words that cause big changes
After running dozens of prompt pairs, I discovered some adjectives have a massive visual swing. Words like “retro” or “cinematic” transform outputs much more drastically than softer terms like “serene” or “warm.” When you test, you start memorizing which adjectives act like levers and which act like seasoning. It is a fun mindset shift — not all words carry equal weight in Midjourney.
I keep a running short list now taped on my second monitor, the heavy lifter adjectives versus the subtle nudgers. That personal toolkit makes A B testing faster because I don’t waste time swapping in words that Midjourney tends to ignore. The feeling of hitting on one of those reliable levers is still satisfying every time.