Get 50% off EQ-i Youth Certification this youth month! Find out more

Search our website

Find any of our blog posts or products in our assessment catalogue

Cool Isn’t Enough: The Case for Evidence-Based Assessments in the Age of AI

25 June 2025

Explore the trade-offs between AI-powered tools and evidence-based assessments. Learn why validity, reliability, and fairness still matter in the age of innovation.

Author: Kleinjan Redelinghuys

Imagine yourself in a situation where you are forced to choose between two options: proven and cool. Proven refers to psychological assessments rooted in decades of research, designed to be valid, reliable, and fair. Cool refers to AI-powered tools that promise sleek design, fast results, and modern appeal, but often without the scientific rigour typically (though not always) associated with traditional methods.

Which one would you choose?

While thinking of your answer, you probably wonder why you can’t choose both.

In an ideal world, we wouldn’t have to choose. We’d have assessments that are scientifically sound and ooze innovation. However, in reality, people often choose between proven and cool without even realising it, and that quiet trade-off can have serious consequences.

Before delving into the potential trade-offs between proven and cool, let’s briefly look at what’s meant by valid, reliable, and fair and why many find AI-powered assessments appealing.

A Quick Refresher: What Do We Mean by Valid, Reliable, and Fair?

  • Validity: Does the assessment measure what it claims to measure?

  • Reliability: Are the results consistent over time, raters, or testing forms?

  • Fairness: Does the assessment yield unbiased results across demographic groups?

These aren’t just academic niceties. They’re critical in making accurate, just, and legally defensible decisions.

What’s the Appeal of “AI-Powered”?

  • Efficiency and Speed: Automated scoring and reporting can reduce turnaround time from days to seconds.

  • Scalability: Large-scale assessments can be deployed with minimal human effort.

  • Data Richness: AI systems, especially those used in affective computing, behavioural biometrics, and digital phenotyping can incorporate novel inputs that aren’t typically captured by traditional tests.

  • Adaptivity: Some AI-based assessments dynamically adjust based on a test-taker’s responses, making the candidate’s experience more intuitive.

  • Perceived Innovation: For many, “We apply AI” is often synonymous with “We’re ahead of the curve”.

  • Improved Candidate Experience: Some AI tools are game-like and more interactive than traditional tests. This can boost engagement and reduce test anxiety, especially among younger or digitally savvy users.

Consequently, AI assessments often outperform traditional tools across various applications.

These features are compelling, but is there a catch?

The Potential Trade-Off Between Proven and Cool

In psychology, education, and hiring, “AI-based assessment” is increasingly used as a marketing badge, implying sophistication, objectivity, and modernity. While some AI tools meet high scientific standards, many don’t, as innovation often takes precedence over the quieter principles of validity, reliability, and fairness. Still, caution shouldn't become blanket rejection. The problem isn’t AI itself, it’s untested or unvalidated AI.

Hence, some of the key concerns involving AI in the assessment space revolve around:

  • Transparency: Black-box/opaque models often provide no insight into how decisions are made, making auditing and accountability difficult. In contrast, interpretable models allow for inspection and justification.

  • Bias and Fairness: Bias in AI systems can arise from unrepresentative training data, flawed feature selection, or constraints imposed by designers.

  • Insufficient Scientific Grounding: Some systems rely on superficial correlations or secondary indicators (e.g. word choice, facial expressions, typing speed) without benchmarking against validated constructs or established clinical standards.

  • Overfitting and Lack of Generalisability: AI assessment models often overfit to training data (i.e. models that perform well only on the data they were trained on) and lack generalisability, leading to unreliable results across different populations, cultures, or contexts. This concern isn't unique to AI; legacy psychometric tools can also overfit when poorly designed or outdated.

  • Reductionist Thinking: All assessments, including traditional ones, simplify complex human traits. However, when AI systems infer psychological constructs from narrow cues without theoretical grounding, the risk of invalid conclusions grows.

  • Overemphasis on Tech Appeal: AI tools are sometimes chosen for their tech appeal (e.g. sleek design, novelty) rather than proven effectiveness or fairness, which may lead to misguided decisions.

  • Automation Without Oversight: Automated scoring exists in AI and traditional assessments. What matters is not whether a system automates judgement, but whether it's paired with appropriate human review.

This is a timely reminder that without the grounding of good measurement science, even the most innovative tools can lead us astray.

We should ask ourselves: can a hiring tool that hasn’t been tested for racial or gender bias really support fair decisions? Can a chatbot that detects depression without benchmarking against clinical tools be trusted?

These concerns aren’t just theoretical. Let’s look at some real-world examples where AI assessments missed the mark.

Real-World Examples of AI Falling Short

Case Study 1: HireVue

Once marketed as a cutting-edge hiring solution, HireVue used AI to analyse candidates' facial expressions and vocal tone. The company claimed to identify traits linked to success, but critics raised concerns that:

  • The science behind facial cues was unclear.

  • The method risked bias against neurodivergent individuals and people from different cultures.

  • The process lacked transparency.

Following pressure from advocacy groups and privacy experts, HireVue dropped its facial analysis feature in 2021. This is an example of how innovation without accountability can falter.

Case Study 2: Mental Health AI Platforms

Several mental health platforms have promoted AI’s ability to detect anxiety or depression through text or speech analysis. Yet many:

  • have not been benchmarked against validated clinical tools.

  • don’t disclose how their models were trained.

  • offer little evidence of accuracy across diverse populations.

Although these tools may offer helpful prompts, they are not substitutes for a qualified, evidence-based diagnosis.

These examples highlight the risks of relying on AI assessments without solid scientific validation and oversight. With these real-world pitfalls in mind, the question remains:

What Would You Choose?

Let’s come back to the original question: proven or cool?

But before answering, now consider that:

Traditional psychometric tools are not immune to criticism. Some are outdated, poorly normed, or never validated for diverse populations. Additional concerns may include questionable statistical or methodological practices, lack of transparency in test development, and insufficient consideration of cultural and contextual factors. Therefore, both AI and legacy tools require ongoing validation, scrutiny, and adaptation.

Consequently, the question shouldn’t be proven or cool, it should be:

Does this assessment (AI-based or otherwise) meet accepted validity, reliability, and fairness standards?

In high-stakes contexts like employment, education, and mental health, that’s the only question that matters.

The Ideal Future Isn’t Either/Or—It’s Both

AI has the potential to transform psychological and educational assessments, but only if it meets the same scientific and ethical standards expected from traditional methods. That means:

  • Robust validation

  • Ongoing human oversight

  • Independent audits for bias

  • Transparent methods

  • Clear documentation of model design

Until then, the smartest path forward is cautious optimism. Although AI can enhance assessments, it should never replace the critical judgement, empathy, and ethical responsibility humans bring to high-stakes decisions. With careful science and ethical guardrails, we can move beyond “proven vs. cool” and toward a future that is both.

What’s your take on AI-powered assessments? Share your experiences below!

Newsletter

Get up-to-date industry news right in your inbox

Someone pointing to the left looking surprised

This site uses cookies to enhance your experience and to provide us with information on how to improve our website. To find out more, see our Terms of Business.