Evaluating LLMs - Labellerr AI

Evaluating LLMs

A collection of 7 posts

MiniMax-M1, the World's First Open-Source, Large-Scale, Hybrid-Attention Reasoning Model

MiniMax‑M1: 1M‑Token Open‑Source Hybrid‑Attention AI

Meet MiniMax‑M1: a 456 B‑parameter, hybrid-attention reasoning model under Apache 2.0. Thanks to a hybrid Mixture‑of‑Experts and lightning attention, it handles 1 M token contexts with 75% lower FLOPs—delivering top-tier math, coding, long‑context, and RL‑based reasoning.

Phi-4-Reasoning: Building Smarter AI Agents with 14B Param

Phi-4-Reasoning: Building Smarter AI Agents with 14B Param

Discover how Phi-4-Reasoning, a 14B-parameter model, enhances AI agent intelligence through curated data and reinforcement learning. Learn about its performance in complex reasoning tasks and how it outperforms larger models.

Experiment and Learn What's new in Qwen-3

Qwen 3 Breakdown: What’s New & How It Performs

Explore Alibaba's latest AI model, Qwen 3, featuring hybrid reasoning capabilities and multilingual support. Discover its innovative design, performance benchmarks, and how it stands out in the competitive AI landscape.

GPT 4.1: Better and Cheaper Than GPT-4o?

GPT 4.1: Better and Cheaper Than GPT-4o?

GPT-4.1, OpenAI's latest model, surpasses GPT-4o with improved coding abilities, a massive 1 million token context window, and more affordable pricing. This article explores the advancements and benefits of GPT-4.1 for developers and businesses alike.

Meta Launched LLama 4

Llama 4 Unleashed: What’s New in This LLM?

Llama 4 is Meta’s latest large language model (LLM), bringing better reasoning, longer context, and smarter responses. Explore how it compares to other LLMs and what it means for developers, researchers, and businesses using AI.

Evaluating LLMs

Opik Is Changing How You Evaluate LLMs — Find Out How

Opik by Comet automates LLM evaluation, detects errors and hallucination. It tracks decisions, flags mistakes, and removes manual testing.

Is Your Reasoning Model Any Better? Check with ARC AGI v2!

Is Your AI Smart Enough? Test It with ARC AGI v2!

ARC AGI V2 tests AI reasoning with abstract tasks that go beyond memorization. It evaluates how well models recognize patterns, solve problems, and generalize knowledge.