In 1956, the same year the Dartmouth researchers were coining the term artificial intelligence, a psychologist named John C. Flanagan published a paper that would quietly reshape hiring for the next seventy years. He called it the critical incident technique. The idea was simple: instead of asking candidates abstract questions about their traits and tendencies, ask them to describe specific situations where their behavior made a real difference. What happened. What they did. What the outcome was. The specificity was the point. Flanagan had noticed, through years of studying military pilot performance, that the gap between good performers and poor ones showed up not in their self-assessments but in the specific moments they could recall and describe with clarity. That insight became the foundation of behavioral interviewing. It spread through HR departments worldwide through the 1980s and 1990s, producing frameworks like STAR, SOAR, and CAR. Consulting firms built entire practices around it. Thousands of books were written about it. And it worked, up to a point. Structured behavioral interviews did produce better hiring signal than unstructured conversations. The research was clear on that. The problem was execution. A framework is only as good as the person running it. And most interviewers, however well-intentioned, do not follow the framework consistently. They accept vague answers. They skip follow-up probes because the conversation is flowing well and it feels rude to interrupt. They let candidates pivot from a weak answer to a stronger adjacent story without noting the pivot. By the time the interview ends, the structured framework has collapsed back into an unstructured conversation with a thin veneer of methodology on top. What AI interview platforms do is take Flanagan's original insight, the one about specificity and behavioral evidence, and execute it without the human failure modes. The question generation is role-aware and adaptive. The scoring is rubric-based and consistent. The follow-up probing does not stop because the conversation feels uncomfortable. This is not a replacement for good interview design. It is a mechanism that finally makes good interview design actually run the way it was supposed to.
Summary of key concepts
Image placeholder - replace with actual image
| Concept |
What it means |
Why it matters |
| Question generation |
How the AI builds a question set from the job description, seniority level, and competency framework |
Generic questions produce generic answers; role-specific questions produce real signal |
| Adaptive follow-up |
The AI generates the next question based on what the candidate actually said in their last answer |
Separates candidates who know the answer from those who know how to describe knowing it |
| Rubric-based scoring |
Each competency has a defined scale with behavioral anchors describing what each score looks like |
Makes scoring consistent across candidates, interviewers, and time |
| Evidence extraction |
The AI pulls specific quotes from the transcript to justify each score |
Every score is auditable and reversible by a human reviewer |
| Seniority calibration |
The depth and complexity of questions adjusts based on the candidate's level |
A principal engineer and a junior engineer should not be asked the same question at the same depth |
| Vagueness detection |
The AI identifies when an answer is a principle rather than a specific experience and probes accordingly |
Catches rehearsed answers and forces candidates to produce real evidence or reveal they cannot |
How questions are generated before the interview starts
The quality of an AI interview is determined almost entirely before the conversation begins. The question generation process is where most platforms either earn their value or waste it. A well-built system starts with three inputs. The job description, which tells it what the role actually does and what domain knowledge it requires. The seniority level, which tells it how deep to probe and what complexity of answer to expect. And the competency framework, which tells it which dimensions of skill and judgment actually matter for this role at this company. From those inputs, the system builds a question set that is role-specific, not generic. A question like "tell me about a time you solved a difficult problem" is useless because every candidate has a rehearsed answer for it and the answers are not comparable across roles. A question like "walk me through a time you had to redesign a system component under time pressure, what you changed and why you made those trade-offs" is specific enough that the answer either demonstrates real experience or it does not. The difference between those two questions is the difference between data and noise. Good platforms also generate a question tree, not a question list. The initial question is the entry point. Branching follow-up questions are pre-built for the most common answer patterns, and the system selects among them based on what the candidate actually says. This is different from a fixed script. The conversation can go in multiple directions depending on the candidate's answers, but it always goes in a direction that produces more evidence, not less.
question_quality_score = role_specificity + competency_alignment + follow_up_depth
role_specificity: does the question require domain knowledge for this role?
competency_alignment: does it directly test a competency that predicts performance?
follow_up_depth: are there at least 3 branching follow-ups pre-built for each answer type?
How adaptive follow-up questioning works in real time
This is the part that separates a real AI interviewer from a sophisticated form. And it is the part that most people underestimate until they see it in action. When a candidate answers a question, the AI is not just transcribing what they said and moving to the next item on the list. It is processing the response for three things simultaneously. First, did they answer the actual question or did they answer an easier adjacent question? Candidates frequently do this without realizing it. You ask about a failure and they tell you about a challenge that turned out fine. The AI catches the pivot and redirects. Second, is the answer specific or generic? Generic answers, which are answers stated as principles rather than experiences, get a probe that requires a specific example. Third, are there claims in the answer that are unsubstantiated? If someone says they improved team velocity by 40%, the AI asks how they measured it, what it was before, and what specifically changed. If they cannot answer, that inability is now on the record. Here is what that looks like in a real exchange. A candidate is asked about managing a project that was going off track. Candidate: "I believe in transparent communication with stakeholders and making sure the team has clear priorities. When things go off track I always make sure to surface issues early and work collaboratively to find solutions." That answer is entirely principle-based. No specifics, no evidence, nothing checkable. A human interviewer often accepts this and moves on because it sounds good and they do not want to seem aggressive. The AI does not accept it. It responds: "Can you walk me through a specific project where that happened? What was the project, what went wrong, and what did you actually do in the first 48 hours when you realized it was in trouble?" Now the candidate has to produce a real story or reveal they do not have one.
Every time an interviewer accepts a principle-based answer without requiring a specific example, they have collected zero evidence about that candidate. They have only confirmed that the candidate knows what a good answer sounds like.
How rubric-based scoring works
Scoring in most human-led interviews is impressionistic. The interviewer finishes the conversation, reflects on how it felt overall, and assigns a number that reflects their general sense of the candidate. The number is a summary of an impression, not an aggregation of evidence. Two interviewers watching the same candidate will often score them differently because they weighted different moments and interpreted the same answers differently. Rubric-based scoring works differently. Each competency has a defined scale, typically one to five, with behavioral anchors describing exactly what a score of one looks like, what a score of three looks like, and what a score of five looks like. A score of five on "communication clarity" is not "communicated well." It is "explained a complex concept using a concrete analogy, adjusted their explanation when the follow-up question suggested confusion, and confirmed understanding without being asked." Every point on the scale is behavioral and specific. The AI maps what happened in the transcript to those anchors. A score of three means the evidence in the transcript matches the behavioral description for three. The score is not the AI's opinion of the candidate. It is the AI's assessment of which behavioral anchor the transcript evidence most closely matches. The human reviewer can read the same transcript and the same anchor descriptions and decide whether they agree. The disagreement, when it happens, is now a conversation about evidence rather than a debate about impressions.
- Define the competency: one sentence describing what this skill is and why it matters for this role
- Write behavioral anchors for scores 1, 3, and 5: what does the interview transcript look like at each level?
- Build two to three questions per competency that require a specific behavioral example
- For each question, write follow-ups for: the vague answer, the specific answer that needs more depth, and the strong answer that should be probed on edge cases
- After the interview, map the transcript evidence to the anchor descriptions before assigning a score
- Attach the specific transcript quotes to the score so any reviewer can verify the reasoning
Seniority calibration and why it changes everything
One of the most common failures in interview design is using the same questions for every level of a role. I made this mistake for the first two years of hiring. We had a question bank for engineers, and we used it for junior hires and senior hires and principal hires. The senior candidates found it easy and performed well. The junior candidates struggled with the depth of probing and performed poorly. We concluded we were not finding strong junior engineers. What we were actually doing was administering an exam pitched at the wrong level. Seniority calibration means the AI adjusts the complexity, depth, and framing of questions based on the level of the role. A junior engineer is asked about a bug they found and fixed, and the follow-up probing checks whether they understood why it happened and what they learned. A senior engineer is asked about a systemic failure they prevented or diagnosed, and the follow-up probing checks whether they understood the second and third-order implications. A principal engineer is asked about a technical decision that had organizational consequences, and the probing checks whether they can reason about the trade-offs at a systems level, not just a code level. The same competency, assessed at the right depth for the level, produces comparable and meaningful scores. The same competency assessed with the wrong depth produces noise in both directions: seniors who are bored and underperform, juniors who are overwhelmed and underperform, and a middle ground of candidates who happen to match whatever depth you chose by accident.
prompt template: seniority-calibrated question framing
Junior: "Tell me about a time you [did X]. What happened and what did you learn?"
Mid: "Tell me about a time you [led X or owned X]. What was your approach and what would you change?"
Senior: "Tell me about a time [X created systemic impact]. How did you diagnose it and what were the trade-offs in your solution?"
Principal: "Tell me about a decision you made about [X] that had consequences beyond your immediate team. How did you reason through the trade-offs and what did you get wrong?"
How vagueness detection separates rehearsed from real
Image placeholder - replace with actual image
Rehearsed answers have a recognizable signature. They are smooth. They follow a structure. They hit the right keywords. They do not have rough edges or moments of uncertainty. Real answers have rough edges. The person pauses when they try to remember something specific. They correct themselves when they realize they are not being precise. They acknowledge what went wrong without being asked because the real memory of the event includes the awkward parts. Vagueness detection is the mechanism that surfaces this difference. The AI flags four patterns that indicate a rehearsed or unsupported answer. Principle statements presented as evidence: "I always prioritize transparency" is not an example of being transparent. Abstract outcomes without specifics: "the team was much more aligned after that" does not tell you what changed or how you know. Deflection to a different question: when a candidate answers a harder question by pivoting to a story that answers an easier one. Absence of failure or difficulty: real stories almost always include something that did not go as planned. A story where everything worked perfectly is usually a story that has been edited down to the highlights. When any of these patterns appear, the AI probes. The probe is not accusatory. It is simply a request for more specificity. "Can you give me the specific numbers?" "What did that actually look like day to day?" "What was the hardest part of that?" "Was there a moment where you thought it might not work?" These questions do not feel aggressive in a conversation. They feel like genuine curiosity. But they reliably separate candidates who have real experience from candidates who have well-prepared descriptions of experience.
Common mistakes in AI question and scoring setup
Importing a generic question bank without editing it. Most AI interview platforms come with pre-built question libraries. Those libraries are starting points, not finished products. A question that works for a software company hiring a product manager does not necessarily work for a healthcare company hiring a clinical operations manager. Review every question before deploying it. Remove the ones that do not require role-specific knowledge. Add follow-up probes that reflect the actual complexity of your role. This is two hours of work that determines whether your interviews produce signal or noise. Building a rubric without behavioral anchors. A scoring scale of one to five means nothing if one means "bad" and five means "good." Write behavioral anchors for at least scores one, three, and five. Describe what the transcript actually looks like at each level. This is the only way to get consistent scoring across multiple reviewers or across time. Using the same question depth for every seniority level. Map your question trees to the levels you hire at. A junior, mid, senior, and principal engineer should each have a version of the same competency question calibrated to the complexity of decision-making expected at that level. Using senior-level questions on junior candidates or junior-level questions on senior candidates produces invalid scores in both directions. Scoring the full interview before reading the transcript. The overall impression of a candidate at the end of a conversation is a real signal, but it is a biased one. Score each competency from the transcript evidence before forming an overall view. You will catch cases where a strong overall impression is masking weak evidence on the competency that actually matters most for the role.
Quick reference: question and scoring design cheat sheet
| Design decision |
Rule of thumb |
Threshold |
| Questions per competency |
One primary question plus three branching follow-ups per competency |
1 primary + 3 follow-ups |
| Competencies per interview |
Three to four competencies assessed deeply beats eight assessed shallowly |
Max 4 per session |
| Rubric anchor requirement |
Write behavioral descriptions for scores 1, 3, and 5 before running any interviews |
Anchors for 1, 3, 5 minimum |
| Vagueness probe trigger |
Any answer that is principle-based rather than example-based triggers a specificity probe |
0 examples = probe immediately |
| Seniority calibration |
Questions should require decision complexity proportional to the level being assessed |
Separate tracks for junior, mid, senior |
| Evidence requirement per score |
Every score must have at least one direct transcript quote attached as evidence |
No quote, no score |
| Rubric review frequency |
Review anchor descriptions against actual transcripts after every 15 to 20 interviews |
Every 15-20 interviews |
| Question bank review |
Remove any question a candidate can answer without role-specific knowledge |
Zero generic questions |
What this looks like with real numbers
A team that switched from unstructured human-led interviews to AI interviews with a properly built competency framework ran a comparison over three months. In the three months before, they conducted 94 interviews across two open roles. Interviewer agreement, measured by having two reviewers independently score the same candidate, was 49%. Nearly half the time, two experienced interviewers watching the same candidate disagreed by more than one point on a five-point scale. Time from interview completion to hiring decision averaged 8 days because debriefs kept getting rescheduled. In the three months after, with rubric-based AI scoring and transcript evidence attached to every score, interviewer agreement on reviewing the same transcripts rose to 84%. Time to decision dropped to 2 days because reviewers were looking at evidence rather than trying to reconstruct a memory. And the two hires made from the new process were still with the company six months later, while one of the three hires from the previous quarter had already left. Better questions, better scoring, better evidence. That is the chain that produces better hires.
Building this question and scoring architecture manually is possible and worth doing even if you never use an AI platform. If you are running interviews at scale and want the adaptive follow-up probing and rubric scoring to happen automatically without engineering time on every interview, TheCognitive runs 45 to 60 minute live video interviews with competency-specific question trees, real-time adaptive follow-up, and evidence-based scorecards across technical, behavioral, and managerial rounds. Any industry, any role. Details at thecognitive.io or book a walkthrough at calendly.com/cgmeet/30min.
Ask better questions. Score the evidence, not the impression.
Related Resources