How a small team of 3 took on the world’s top AI labs… and beat them at their own game.

HLE scores of leading agentic AI systems

Every company has a pivotal moment.

For some, it’s a product breakthrough. For others, a market explosion.

For us it was this:

We just ran the toughest deep-research benchmark on the planet... and finished ahead of the biggest AI labs with a team of three.

DeepWriter’s proprietary Abraxas engine did the heavy lifting. Our agentic system maps the problem, interrogates itself, and then lands the truth (more in this below).

Thanks to that, we placed ahead of:

Grok 4 Heavy (in its long-run setting)
Gemini 3.0 (which we love... you’ll see why below)
Kimi K2-Thinking
Claude Opus 4.5
GPT 5 Pro (with tools)
Perplexity, Mistral, Meta, etc.

Unlike others, we prove it with audited trace logs you can check below. It's one complete run, first pass only, and no retries. So if transparency counts, we set the standard.

And yes.. we really are just three folks.

Meet Garrett, Josh, and Denis:

Guess what.

The guy on the right doesn’t even know how to code.

So technically... count two.

But what we lacked in headcount, we made up for in obsession.

It all started two years ago, when Garrett had this crazy idea.

Picture this:

An engineer with no big title or corporate badge (but decades of experience) is sitting in his basement surrounded with hardware.

He’s alone and it’s 3am.

He’s running advanced LLM experiments on uncharted ground while most are shipping wrappers.

That’s Garrett.

For a full picture, here's his actual workplace:

Garrett wasn’t building a chatbot or trying to start a company. He was obsessed with a single question:

Could AI think... discover new ideas... and shift paradigms?

Einstein showed how a simple equation could unify time, space, and light. This revolutionary thought changed how we see the universe.

Garrett believed AI could someday do something similar. That it could create structure that turns talk into thought.

He wasn’t looking for a glorified Wikipedia writer that just regurgitates what’s already been said. He also wasn’t impressed by LLMs capped at a few pages.

Garrett would only settle for AI that can reason, verify, and produce output at any length... even book-length if that was required by the user.

He also wasn’t sleeping much, but the results were enough to keep going.

And that's how DeepWriter was born.

Two years later, that obsession brought us here. To the most brutal test a deep research tool could ever face:

Meet Humanity’s Last Exam: the most sophisticated test of AI reasoning ever built.

Most AI benchmarks measure knowledge. They’ll assess if a model can recall facts, define a term or translate a phrase.

But Humanity’s Last Exam (or HLE for short) is a completely different animal.

It doesn’t test what you know. It tests how you think.

The test has been designed to push AI systems into the same corner as a human expert.

It includes thousands of questions across science, history, logic, economics, and more. The kind of problems you can't solve by skimming Wikipedia.

Each of these questions demands true expertise in that field, multi-step problem solving, synthesis and the ability to find some of the most obscure information on Earth.

In practice, most humans get almost none right. Even top specialists only nail some of them.

That's exactly why we created DeepWriter... to help individuals reach expert-level outcomes on problems that would otherwise stall.

Base models aren't always the right solution here... they can't reliably research answers this complex or connect the dots.

Many hallucinate and look for shortcuts.

Others fail outright.

That’s how HLE exposes the gap between sounding smart and being smart.

And it does it at scale:

There are 3,000 total questions but only 1,000 are public. In our case, we’ve attempted 878 questions.

Why 878?

Because 122 are image-based, and we don’t process visuals (yet).

Most public HLE reports we’ve reviewed focus on the text-only subset. We aligned with that to keep the comparison fair.

Some quick math.

From a statistical point of view, 878 questions is more than enough to achieve 95%+ confidence compared to the full 3,000. That’s why researchers worldwide treat the public set as a valid proxy.

Each question is graded in black and white: right or wrong. In other words, you can't get an "A-", or take partial credit.

We even reviewed each answer by hand for logical equivalence, and if it failed.. it failed.

That’s how brutal this test is.

If an AI can pass HLE, it means it can survive the real world of deep research.

Because let’s be honest… the world doesn’t give you neatly packaged trivia. It throws you messy, interlocking problems.

The problems demand reasoning chains, comparisons, and synthesis across multiple domains.

That’s exactly what DeepWriter was built for.

Spoiler alert: It's the second time we prove with numbers the power of DeepWriter's agentic engine.

Weeks ago, we attempted HLE and achieved the impressive score of 37.4.

That's was enough to top Grok 4 Heavy on its shorter-run setting, ChatGPT 5, Mistral, Meta, and many others.

For a tiny team of 3, this was already impressive.

Here's how it looked in comparison to other solutions:

For your reference, you can check the public logs of this previous benchmark here.

Although, that's old data at this point (our new score is even more impressive).

You see, we wanted to achieve more.

Our dream was to create the most sophisticated agentic system in the world.

So when we sat down to plan the second run, we already had a battle-tested platform.

The problem?

Running HLE is nothing like piping a few prompts through a chatbot.

It’s a gauntlet that demands infrastructure, resilience… more resilience… and cash.

So, the first time we run it, we had to build a benchmark harness from scratch.

We used a script to load questions and enforce strict timeouts. The same script had to log everything, and crucially, resume from exactly where it left off if anything failed.

We also had to build multiple systems to ensure our infra could support a test like this.

You see, we weren’t just chasing a number.

We wanted an audit trail we could publicly stand behind.

But nothing can be ever perfect the first time.

Remember... we're a small team, not some billion-dollar lab in a secret location.

Our actual location is the basement.

That's why we had to face struggles many others can't even imagine:

The first time we run HLE, a scheduled OS update almost nuked an overnight batch of questions.
We also had to keep up with hundreds of heavy questions, tight timeouts, reruns... hammering rate limits while we babysat uptime.
Our first time, Garrett didn't sleep for weeks making sure everything was running smoothly.
During our first attempt, we also restarted the test a few times... not because DeepWriter was weak, but because each run hardened it.

You see, big labs throw millions of dollars at this problem.. but we had budget tighter than most labs spend on lunch.

And it wasn’t just money. It was precision.

Remember when we said HLE uses binary grading?

(Aka the answer is either right or wrong).

In other words, you can’t take partial credit or “close enough”.

We had cases where the reasoning was correct but an extra character or space in the final string meant a zero.

Brutal…but fair. That’s the game.

Here’s an example from HLE.

Question:

A number mask is a sequence of digits in which the special characters “?” and “*” can occur.

Official Answer: 1076020 10760200 107602000 1076020000 1576026930

Our Answer: 1076020 10760200 107602000 1076020000 1576026530

Notice our result was almost right... but we had one bad digit.

In that case, that question means no credit.

You see, HLE’s questions are extremely complex.

That's why early stress tests taught us a hard lesson: timing mattered.

Since DeepWriter's Abraxas engine is agentic by nature, we hardwired structured thought into its workflow.

For example, we learned that if an agent jumped too fast into one approach, the answers got worse.

So we re-trained its discipline. First, it would slow down and reason, and only then it would switch gears at the exact moment it mattered.

We also taught the system to challenge itself… to explore competing ideas and align on the one that made the most sense.

In other words, we built an AI that argues with itself before it answers.

It took us countless iterations until we were ready for the final challenge: to deliver the smartest agentic AI on the Planet.

We wanted to create a solution that gives users exactly what they need:

Cohesive, long-form documents with real insight with beautifully crafted charts.

In Garrett's words:

From the outside our solution may look simple.

At the end of the day, it's just one button that says "start generation".

But behind, we run one of the most complex agent orchestrations on Earth. Sometimes, this means tens of millions micro-steps per job.

And with all that, we paused the test only two times for a few hours, so it wouldn’t churn overnight. For transparency, we logged everything in the trace logs below.

That’s what we call a solid system.

The benchmark forced us to look inward again and again.

Each review made us rethink the pipeline, the agents, and how they hand off work.

Then the quiet upgrades followed.

For example, DeepWriter now runs its own advanced coding experiments.

Instead of superficial search, we taught DeepWriter go to properly evaluate sources and encourage it to go down rabbit holes.

It sounds small... but that’s what a “smart” tool should do.

Here are some ways that DeepWriter stands out from other tools:

While LLMs always regurgitate and consensus view, DeepWriter thinks laterally and questions everything.
Other Deep Research tools are limited to ground truth. But this does not allows the discovery of new ideas. DeepWriter uses ground truth, but it's capable of building upon it.
It can compare multiple pieces of data, run code experiments and complex simulations of all kinds.
While other products lose the script after a few pages (and are incapable of much longer outputs), DeepWriter can generate hundreds of pages and actually gets smarter, the longer is writes.

These quiet gains made the tool much more powerful.

And also, we're using Gemini 3.0 Pro as our base model for this benchmark (although we're capable of using any base model out there).

So today...

DeepWriter's final HLE score is 50.91.

At first glance, 50.91 is just a number.

But here’s what it really means:

We had just beaten, Grok 4 Heavy, GPT-5 Pro, Kimi K2- Thinking, Claude 4.5, and many others.

For a team of three, that wasn’t just validation.

It was a statement.

You see, compared with a small company like us, big labs are giants with billions of dollars and R&D.

They're also guarded by an army of engineers.

DeepWriter?

We were the crew nobody would ever bet on. Some passionate and quirky geeks slipping in through the side door, armed with nothing but creativity and lot of stubbornness.

We just refused to quit.. and somehow, we cracked the safe.

That’s the power of obsession.

We built lean what they couldn’t... and it felt so good!

DeepWriter's HLE score in practice and why it matters for the future of AI.

Benchmarks like HLE don’t care about marketing, branding, or hype cycles.

They don’t reward flashy demos or hand out bonus points for having billions in funding or a trillion parameters.

They expose one thing: can your system actually think?

For example:

Can it take a medical study, compare it against historical data, and then explain whether a new treatment with a novel drug is more likely to succeed?
Can it read a 90-page financial report, spot the hidden risks, and explain them in plain English?
Can it analyze climate datasets, reconcile conflicting models, and predict which regions are most at risk over the next 18 months?

That’s the bar.

And on that metric, here’s where DeepWriter stands:

But here’s the twist: that’s just the surface.

What nobody tells you is that the real story isn’t the raw score. It’s how we got there.

Because DeepWriter isn’t just another chatbot bolted to an API.

It’s an agentic system that was built to reason, plan, use tools, and challenge its own answers before committing.

The secret behind the secret: DeepWriter's agentic lift and what it means to you

Truth moment.

We work hard... very hard.

But there’s no way three people beat billion-dollar labs without the one kind of leverage that matters in this era... agents.

Scoring high on HLE isn’t about raw model power.

It’s about what happens when you wrap that model in a disciplined agentic system that plans, uses tools, checks itself, and only then commits.

Look at the chart below:

The dark blue shows the base model.

(In our case, we used Gemini 3.0)

The light blue is the agentic lift... the extra performance you unlock when orchestration kicks in.

In our case, that lift is the difference (that's why it's orange).

So what does “agentic” mean in plain English?

Think of LLMs as neurons:

They process signals, but each only sees a slice of the picture. They answer a question, but can’t reliably connect beyond their own scope.

DeepWriter's agentic nature is the growing brain that connects those neurons. It makes them question and reason together.

As base LLMs improve, the whole brain upgrades with them.

So more connections and more reasoning means more correct answers. It also means tighter timelines and more complex results you can trust.

Transparency note:

We couldn't fit Grok 4 Heavy in our charts because they have an somewhat unconventional way of measuring things.

At the same time, excluding them would've been convenient for us, but not fair.

Based on their own charts, Grok’s curve shows:

20.41 on shorter run setting
50.5 on long run setting

It’s possible those numbers were achieved on a single pass that simply took more time. Timing isn’t published, so we can’t compare.

For reference, our single first pass is 50.91

Therefore, under Grok Heavy’s own settings, DeepWriter leads on all fronts:

Here's the most important part:

Where others bolt on tools, we built an architecture that turns a model into a researcher.

One that doesn’t just answer...

it reasons, checks, and converges on the truth.

This is why we’re not just competing with billion-dollar labs...

We’re outthinking them.

When AI agents answer real world questions, there isn't a keyword they can lookup, or some neat little article that gives the exact answer.

This is where most models collapse under pressure. They grab the first shiny shortcut and fail.

Folks call it "AI slop" and it's this low quality output everyone and their dog hates.

But in our case, we slow down and work through each problem. Our system also challenged itself before committing.

That’s the difference between a chatbot that sounds smart and a research agent that, in fact, is smart.

In today's world, it’s not just about how a base model performs on its own...

It’s about what happens when you give it structure, discipline, and the ability to actually think.

That’s what the agentic nature is all about: reasoning.

Not guessing or parroting... reasoning.

Aka...
We just flipped the whole game.

For the last five years, the AI race has been measured in one dimension: size.

Think bigger models, datasets and GPU clusters.

This is because we needed size above all... until today.

You see, size isn’t intelligence. That's why even the biggest models can't ace Humanity’s Last Exam.

What HLE exposed is this: thinking > size

Because when the test demands reasoning, planning, and synthesis, sheer parameter count won’t save you.

You need architecture and smart agents capable of talking to each other.

This proves that intelligence is no longer the exclusive domain of trillion-dollar labs. Small teams can now compete (and even win) by building smarter, not bigger.

In our case, it may look like this:

With each new improved base model, our lift will be exponentially bigger.

This will give us the competitive advantage nobody else has.

For users, this means a coming shift: when it comes to hard problems, they'll walk away from chatbots that “sound” impressive toward agents that can withstand real-world complexity.

In practice, this means the future of AI won’t be decided by who has the most GPUs. It will be decided by who builds systems that can actually think.

And that levels the playing field.

Judge us by receipts.

Below you'll find our HLE results with time stamps.

The proof

Access DeepWriter's Humanity's Last Exam logs below (GitHub). This is the ultimate proof DeepWriter is the most capable agentic AI system in the world.

Check proof here