In 2011, a company called HireVue raised its first serious round of funding on the premise that recording candidates answering interview questions on video and having humans review those recordings later would make hiring faster and more consistent. The idea was not wrong. Async video did solve a real problem: scheduling. You no longer needed two humans free at the same time. The candidate recorded their answers, the hiring manager watched when they had time, and the process moved faster. What nobody talked about much was what was lost. The follow-up question. The moment where a candidate says something interesting and you ask them to go deeper. The back-and-forth that reveals how someone thinks, not just what they have prepared. Async video solved the scheduling problem and quietly killed the conversation. But it looked like progress because it was faster, and faster is easy to measure while conversation quality is not. Over the next decade, every major hiring platform added video. Then they added AI scoring on top of the video. Then they added facial expression analysis, tone analysis, eye contact scoring. By 2019, HireVue itself was using facial analysis to score candidates. In 2021, they quietly dropped it after researchers pointed out it had no empirical validity and introduced significant bias. The whole arc of that decade was a lesson in what happens when you optimize for the metric that is easiest to measure rather than the outcome that actually matters. The current generation of AI interview platforms is genuinely different in capability from what existed five years ago. Large language models changed what is possible in terms of real-time conversation, adaptive questioning, and evidence-based scoring. But the market is now crowded with products that use the same language to describe very different things. "AI-powered," "adaptive," "evidence-based" all appear in the marketing of platforms that range from genuinely useful to dressed-up screening forms. Choosing the wrong one does not just waste money. It produces bad data that leads to bad hires, which is worse than having no data at all.
Summary of key concepts
Image placeholder - replace with actual image
| Concept |
What it means |
Why it matters |
| Conversation depth |
Whether the platform conducts a real two-way exchange or delivers a fixed script with recorded responses |
Depth determines whether you get evidence of thinking or evidence of preparation |
| Adaptive follow-up |
Whether the AI generates the next question based on what the candidate actually said |
The only way to separate real experience from rehearsed answers |
| Scoring transparency |
Whether scores come with transcript evidence or just a number |
A score without evidence is an opinion, not data |
| Round type fit |
Whether the platform is built for screening or for substantive assessment rounds |
Using a screening tool for a deep round produces shallow data on a decision that deserves depth |
| Role and industry flexibility |
Whether the platform works only for technical roles or across all functions and industries |
A platform that only works for engineers does not scale across your whole hiring operation |
| Candidate experience |
Completion rate, NPS, and whether candidates feel the process was fair |
A platform with a 50% completion rate is losing half your candidate pipeline before you see a single score |
The first question to ask: what round is this for
This is the question most buying decisions skip, and it is the reason so many teams end up with a platform that technically works but does not produce the data they actually need. AI interview platforms broadly split into two categories, even if they do not describe themselves that way. Screening tools are designed for high volume, early-stage filtering. They are optimized for speed, low friction, and moving large numbers of candidates through a quick assessment to produce a shortlist. Voice-based platforms like Ribbon sit here. They work well for roles where you need to confirm basic eligibility fast, particularly in high-turnover industries like retail, hospitality, and logistics where the assessment criteria are relatively straightforward and the conversation does not need to go deep. Assessment tools are designed for substantive rounds. They are optimized for depth, evidence quality, and producing a scorecard that a hiring manager can use to make a real hire or no-hire decision. These are the platforms you use when you want to replace or augment a human-led technical interview, a behavioral deep-dive, or a managerial assessment round. The conversation needs to be long enough, deep enough, and adaptive enough to produce signal about how someone actually thinks and performs under pressure. Most teams that end up dissatisfied with AI interview platforms bought an assessment tool and used it as a screener, or bought a screening tool and expected assessment-level depth from it. Clarify which round you are solving for before you look at a single vendor.
If you cannot answer "which specific round in our current hiring process does this replace," you are not ready to buy an AI interview platform. Figure out the round first. Then find the tool that fits it.
How to evaluate conversation quality
Conversation quality is the hardest thing to evaluate in a demo because every platform looks good in a demo. The demo candidate gives clean answers, the AI asks sensible follow-ups, the report looks organized. What you need to see is what happens when the candidate gives a bad answer. Ask the vendor to show you a recording where a candidate gave a vague, principle-based answer. Not a strong candidate, not a weak candidate who obviously struggles. A candidate who sounds good but is not being specific. Watch what the AI does. Does it accept the answer and move on? Does it generate a follow-up that pushes for specifics? Does the follow-up make sense given what the candidate actually said, or does it feel like a generic probe that would have been asked regardless of the answer? The follow-up question quality in that moment tells you more about the platform than any feature list. A platform that accepts vague answers is a platform that will produce scores based on how well candidates perform a soft skills monologue, not on whether they actually have the skills. That is a form with extra steps, not an interview. The second thing to look at is the scoring report for that same candidate. Is the score for the vague answer lower than the score for the specific one? Can you see the transcript evidence attached to each score? If the platform gave a 3 out of 5 to a candidate who produced no specific examples, and attached a transcript quote of a generic principle statement as the evidence for that 3, the scoring engine is not working. It is summarizing impressions, not extracting evidence.
conversation_quality_test:
Step 1: Ask vendor to show a recording where candidate gave a vague answer
Step 2: Check whether the AI follow-up required specificity or accepted the vague answer
Step 3: Check the scorecard — does the score reflect the lack of evidence? Step 4: Check the evidence field — is there a real transcript quote or just a summary? Step 5: Ask what happens if a candidate deflects to a different question entirely
Pass criteria: follow-up required specifics + score reflects evidence quality + quote is verbatim
Voice versus video and when each is the right choice
The format of the interview changes what the interview can assess, and most buying decisions underweight this. Voice-based AI interviews feel like phone calls. The candidate does not need to be on camera. The interaction is lower friction and generally has higher completion rates for early-stage screening because there is less psychological weight to it. Platforms built on voice work well when your goal is confirming basic eligibility, language fluency, and communication fundamentals at high volume. The speed and accessibility are real advantages for the right use case. Video-based AI interviews feel like real interviews because they are. The candidate is on camera. The AI interviewer is visually present. The format signals to the candidate that this is a substantive round and they prepare accordingly, which means you see a more genuine version of how they perform in a real interview setting. The conversation can go deeper because the format supports it. Platforms like TheCognitive run 45 to 60 minute live video conversations with adaptive follow-up across technical, behavioral, and managerial rounds. The round type and industry do not matter. What matters is that the format creates the conditions for a real conversation rather than a quick eligibility check. The mistake is using a voice screening tool when you need video assessment depth, or requiring candidates to do a full 60-minute video interview for a role where a 10-minute voice screen would have told you everything you needed to know. Match the format to the depth the decision requires.
| Format |
Best for |
Not suited for |
Typical length |
| Voice AI |
High-volume screening, eligibility confirmation, early funnel filtering |
Deep behavioral assessment, technical evaluation, managerial rounds |
10-20 minutes |
| Async video |
Candidate presentation review, culture fit signals, communication style |
Adaptive follow-up, probing under pressure, evidence-based scoring |
15-30 minutes |
| Live video AI |
Substantive assessment rounds, technical, behavioral, managerial depth |
Fast high-volume screening where depth is not required |
45-60 minutes |
What to look for in the scoring and reporting
The report is what you actually use to make decisions, so evaluating the report quality matters as much as evaluating the conversation quality. They are related but not the same thing. A platform can conduct a reasonable conversation and produce a terrible report, or conduct an adequate conversation and produce a report that is genuinely useful for decision-making. Three things in a report determine whether it is useful. First, whether every score has a direct transcript quote attached to it as evidence. A score without a quote is an opinion. It might be a well-calibrated opinion, but you cannot verify it, share it with a skeptical hiring manager, or use it as a basis for a defensible decision. Second, whether the scoring is competency-specific rather than a single overall rating. An overall score of 3.4 out of 5 tells you nothing about where the candidate is strong and where they are weak. A score of 4 on technical depth and 2 on communication clarity tells you exactly what to probe in the next round. Third, whether the transcript is searchable and the recording is timestamped. You should be able to jump directly to the moment in the recording where a specific exchange happened without watching the whole thing. When you are evaluating platforms, ask for three sample reports from real interviews, not demo interviews with planted candidates. Read the transcripts. Check whether the evidence quotes actually justify the scores. Look for cases where a candidate gave a weak answer and see whether the score reflects it. That is the test.
- Request three sample reports from real interviews, not demo candidates
- Check that every score has a direct transcript quote attached as evidence
- Verify scores are broken down by competency, not just an overall rating
- Confirm the transcript is searchable and the recording is timestamped
- Find a weak answer in the transcript and check whether the score reflects it
- Ask whether you can add your own competency definitions or only use the platform's defaults
- Check the candidate NPS and completion rate data for the platform, not just the vendor's claims
Role flexibility and industry fit
Image placeholder - replace with actual image
This one matters more than most teams realize until they are six months into a platform that works for one team and not for anyone else in the company. Some platforms are built specifically for technical hiring. Their question banks are engineering-heavy, their competency frameworks assume software development context, and their follow-up probing is calibrated for code and systems conversations. They work well for engineering roles and produce thin, often irrelevant data for everything else. If you are only hiring engineers, this is fine. If you are hiring across sales, operations, customer success, finance, or leadership, you need a platform whose conversation and scoring engine is genuinely role-agnostic. A role-agnostic platform means the competency framework can be customized for any function. The question generation can be built from a sales manager job description as well as a backend engineer job description. The follow-up probing is not defaulting to technical language when the role has nothing to do with technology. Ask vendors directly: show me a sample interview and report for a non-technical role. A sales manager, a clinical operations lead, a finance business partner. If they struggle to produce one, the platform is not as flexible as the marketing suggests.
A platform that only works for engineers is a platform your engineering team will love and every other hiring manager in the company will ignore. That is not a hiring infrastructure. That is a departmental tool with enterprise pricing.
Common mistakes when choosing an AI interview platform
Buying based on the demo candidate rather than the difficult candidate. Demos show the platform at its best with a cooperative, articulate candidate who gives clean answers. That is not your candidate pool. Ask to see what happens with a vague answer, a deflection, a candidate who struggles. The platform's behavior in those moments tells you more than any feature walkthrough. Optimizing for price per interview without accounting for report quality. A platform that costs $5 per interview and produces a score with no transcript evidence is not cheaper than a platform that costs $30 and produces a scorecard with evidence attached. The $5 platform produces data you cannot use. The cost of a bad hire from acting on bad data is orders of magnitude higher than the per-interview price difference. Not testing completion rate before committing. Completion rate is a proxy for candidate experience quality. If candidates are dropping out of AI interviews at 30 or 40%, something is wrong with the platform's UX, instructions, or format. A completion rate below 80% means you are losing candidates before you ever see a score. Ask for the vendor's completion rate data across their customer base, not just their best-performing customer. Choosing a platform that cannot be customized. Generic question banks and fixed competency frameworks work for the vendor's average customer. They do not work for your specific roles, your specific industry, or your specific definition of what strong looks like. If a vendor cannot show you how to build custom competency definitions and custom question trees, you will be scoring candidates against someone else's rubric forever.
Quick reference: AI interview platform evaluation cheat sheet
| Evaluation criterion |
What to check |
Minimum bar |
| Conversation depth |
Ask to see what happens when a candidate gives a vague answer |
AI must probe for specifics, not accept the vague answer |
| Scoring evidence |
Check that every score has a direct transcript quote attached |
No score without a verbatim transcript quote |
| Competency breakdown |
Confirm scores are per competency, not just an overall rating |
Minimum 3 competencies scored separately |
| Completion rate |
Ask for platform-wide completion rate data, not best-case |
80% minimum across customer base |
| Customization |
Check whether you can define your own competencies and question trees |
Full custom competency and question bank required |
| Non-technical role fit |
Ask for a sample interview and report for a sales or operations role |
Must work outside engineering with role-appropriate questions |
| Recording and transcript |
Confirm recording is timestamped and transcript is searchable |
Both required for efficient hiring manager review |
| Round type fit |
Confirm the platform is built for the depth of round you are replacing |
Screening tools for screening, assessment tools for assessment |
What this looks like with real numbers
A team that ran a three-platform evaluation before choosing an AI interview tool documented the process in detail. They tested each platform on the same role with the same five candidates. Platform one, a well-known voice screening tool, had a 91% completion rate and produced a score and a three-sentence summary per candidate. Fast, clean, usable for screening. Useless for the substantive behavioral round they were trying to replace. Platform two, a video platform with AI-generated scores, had a 74% completion rate. Three of the five candidates dropped out before finishing. The two who completed received scorecards with numerical scores and no transcript evidence attached. The scores could not be verified or explained to the hiring manager. Platform three had a 90% completion rate. Each report had a full transcript, timestamped recording, and competency-specific scores with verbatim quotes attached. Hiring manager review time per candidate was 12 minutes. The decision to move forward or not on each candidate took under a day. The price per interview across the three platforms varied by less than $15. The difference in decision quality was not close.
Evaluating platforms is worth doing carefully because the choice compounds over time. Every interview you run on the wrong platform produces data you cannot use, or worse, data that misleads you. If you are running substantive assessment rounds across technical, behavioral, or managerial tracks and want to see what evidence-based AI interviewing actually looks like in practice, TheCognitive runs 45 to 60 minute live video interviews with adaptive follow-up, competency-specific scoring, and full transcript evidence across any role and any industry. The first 100 interviews are free. Details at thecognitive.io or book a 30-minute walkthrough at calendly.com/cgmeet/30min.
Buy for the round you are replacing, not for the demo you were shown.
Related Resources