How scoring works
What the severity, commercial-intent, opportunity and idea scores mean, and how far to trust them
Every report comes with numbers attached, and we want to be straight about what they are. They're estimates, kept in check by guardrails, meant to point you in a direction. They are not promises, and they're not as precise as the decimals make them look. Here's how each one is built and how much weight to put on it.
Pain point scores
Every pain point we surface has three scores, and they describe the problem, not a product.
Severity asks one thing: does this actually block someone's work? We score functional impact, not how loudly someone complains. "I'm so frustrated with this" is volume. "We lost three clients to invoicing delays" is severity. The scores sit on fixed, behaviour-based bands, because giving raters concrete reference points is the part of this with real research behind it (Nielsen on usability severity ratings).
Commercial intent asks whether there's a buying signal: someone naming a paid tool they already use, mentioning a budget, or a problem that's eating their billable hours. Reading commercial intent from how people write is a well-studied, doable thing (Jansen, Booth & Spink, 2008). What it is not is a dollar figure. You can't get a real willingness-to-pay without a pricing study where people react to actual prices, and public discussion simply can't give you that. Even properly-run surveys overstate what people will pay (Schmidt & Bijmolt, 2020). So a high score means "there's a buying signal here," not "they'll pay you $X." (We used to call this "willingness to pay." We renamed it because that claimed more than the text can support.)
Opportunity combines the two: high when both are strong, medium when one is, low when neither. We weight them equally on purpose. For inputs that point the right way, simple equal weighting is famously hard to beat (Dawes, 1979). The cutoff itself is a reasonable rule of thumb, not a threshold we've tuned to the decimal. We don't have the outcome data yet (which ideas actually made money), so we'd rather tell you it's a heuristic than dress it up as something calibrated.
The guardrails
We built every guardrail to do one job: pull a score down when it's been too generous. None of them can push a score up.
A pain with thin or missing evidence can't hold onto a high severity, however it was worded. If software can't realistically move the needle on a problem (something lifestyle, cultural, or structural), its commercial-intent score gets capped and it's kept out of idea generation, because there's no sense pricing a problem software can't touch. And generic emotional themes like burnout or stress, or anything that would read identically for any audience, get capped too. They don't tell you where the real opportunity is.
The idea scores
Each generated idea gets its own set of scores, shown as percentages, with an overall composite (coloured green, amber, or red) and a conservative go / no-go signal. The five you'll read are: market fit (does this solve a validated problem for a reachable market?), feasibility (can it be built with today's tools, and can a solo founder actually get the data it needs, reliably and in bulk?), solo-dev feasibility (could one person ship it and keep it running, operating cost included?), SEO (can it grow organic traffic at scale? this one's a preliminary estimate, firmed up later with real keyword data), and originality (how non-obvious it is, where higher means fewer builders would land on the same thing).
The numbers don't come out of one pass. A creative pass floats the concepts first, and each one has to name the concrete data it would run on, or admit it doesn't need any. Anything that only gestures at its data gets caught right here. Then an independent reviewer pressure-tests each concept: it has to name the real route the data is obtainable in bulk, and if it can't, we treat the data as unverified and mark it down. That step kills most of the easy optimism. The survivors get written up into full ideas.
Here's the part worth knowing about the numbers you actually see: they aren't the idea's own self-grade. A model scoring its own work marks it generously (Zheng et al., 2023), so once an idea is written up, a separate model re-grades the main scores from scratch (market fit, feasibility, SEO, how original it is, and how realistic it is for one person to run) against the same bands, with its reasoning written out before each number. For the solo-dev score it weighs the ongoing burden first — support, uptime, moderation, the marketing slog — because that, not the initial build, is what actually buries solo founders. It leans conservative when the evidence is thin. So a 70% for market fit is a second opinion that already talked the first number down, not the idea's first impression of itself.
Sitting on top of that re-grade is a handful of hard caps, and these can only ever cut a score, never lift it. A named data source is a claim, not a fact: if the data is reachable only one record at a time, sits behind a login, or has no real bulk route, its data score gets capped. Build feasibility can't run far ahead of data feasibility, since you can't build on data you can't get. Solo-dev feasibility can't run ahead of build feasibility either — if a thing is hard to build at all, it can't be easy for one person to build and run. Running cost is weighed like build cost, so anything that needs constant moderation or hand-seeding takes a hit. An idea whose whole mechanism is publishing claims about named people or businesses gets marked down for the legal exposure, though we still show it to you with the concern flagged. And the SEO score gets pulled back whenever the "thousands of pages" story falls apart, since login-gated pages can't be indexed and a pile of hand-written blog posts isn't programmatic SEO.
Ideas are ranked on the composite of market fit, feasibility, novelty, and SEO. The piece that matters most for trust: that build-feasibility cap feeds the ranking too, so "can you actually build this?" drags fragile ideas down the list. What floats to the top is the part you can lean on, not the part with the best pitch.
Novelty means a different thing per idea
We don't grade every idea on the same axis. Each one is matched to the way it actually wins: by being found (SEO and reach), by being genuinely different (a mechanism rivals can't easily copy), or by owning a workflow for one kind of user. Novelty then means something different depending on which of those an idea is going for. For an idea whose whole case is a clever mechanism, a low novelty score is a real problem. For a directory or catalog, the edge is the data it holds and how it's presented, not a trick, so a low novelty score is normal there and we don't treat it as a flaw. The ranking works the same way: each idea is weighed by what its own angle rewards, so a strong catalog isn't pushed down the list for scoring low on something it was never trying to do. The novelty score carries a one-line note explaining why it reads the way it does for that type of idea.
How the prompts are built and tested
None of this runs on casually-worded instructions. The prompt behind each stage is written deliberately, and a change to one is tested against real saved runs before it ships — the old wording and the new one are compared on the same niches, and the change is kept only if the output genuinely improves. The care is there because language models have predictable weak spots: they grade their own work generously (the reason for the separate re-grade above), and they latch onto any specific number put in front of them, drifting toward it instead of reasoning from the evidence (Lou & Sun, 2024). So the instructions are built not to hand the model an answer to copy — a market size or a price is worked out from your niche's own data, never seeded with an example figure to anchor on. The aim throughout is that a number you read was reasoned to, not parroted.
And where a step leans on the model to judge rather than generate — deciding which discussions or search terms are genuinely about your idea versus its broad category — we don't take one model's word for it. Before trusting a judge, we check its calls against several independent models and against what actually ranks in search, and we keep only the parts they agree on. A single model turns out to be unreliable on exactly the borderline calls that matter most, so the agreement of independent checks, not one model's confidence, is what we build on.
How to read them
Read these as bands, not decimals. A 0.63 and a 0.61 are the same thing. They reflect the discussion we found for your niche on the day we ran it, so more and better source material means better-calibrated scores. They're guides drawn from self-selected public conversation, not instruments measured against ground truth.
The point was never a perfect number. We'd rather hand you a cautious score with the reasoning attached than a confident one that falls apart the moment you push on it.
Sources
- Jakob Nielsen, Severity Ratings for Usability Problems (Nielsen Norman Group)
- Jansen, Booth & Spink (2008), Determining the informational, navigational, and transactional intent of web queries, Information Processing & Management
- Schmidt & Bijmolt (2020), Accurately measuring willingness to pay: a meta-analysis of the hypothetical bias, Journal of the Academy of Marketing Science
- Dawes (1979), The robust beauty of improper linear models in decision making, American Psychologist
- Zheng et al. (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Lou & Sun (2024), Anchoring Bias in Large Language Models: An Experimental Study