When Elon’s AI Actually Delivers (Sort Of)

After weeks of chaos including antisemitic tweets from the official account, xAI just dropped Grok 4. And against all odds, it’s actually… good?

Don’t get me wrong – this launch has been messier than a Twitter rebrand. Between “MechaHitler” incidents, Linda Yaccarino resigning hours before the announcement, and Musk claiming PhD-level intelligence “with no exceptions,” I went in expecting disaster. What I found was surprisingly competent AI wrapped in typical Musk hyperbole.

What is Grok 4?

Grok 4 is xAI’s flagship model, trained on their 200,000-GPU Colossus supercomputer. Released July 9, 2025, it comes in two flavors: standard Grok 4 and the premium “Heavy” version that uses multiple AI agents working together.

Built by ex-DeepMind and Tesla engineers, Grok represents Musk’s attempt to create “maximum truth-seeking AI” that’s unafraid of controversial topics. Whether that’s a feature or bug depends on your perspective – and recent events suggest xAI is still figuring that out.

Key Features That Matter

Record-Breaking Benchmark Performance

Grok 4 legitimately tops several benchmarks:

Humanity’s Last Exam: 50.7% with tools (highest ever reported)
ARC-AGI-2: 16.2% (nearly double Claude Opus 4)
AIME Math: 95% (if true, that’s insane)

These aren’t marketing fluff – third parties have verified several results.

Native Tool Integration

Unlike models that awkwardly bolt on tools, Grok was trained with reinforcement learning to use them naturally. Web search, code execution, and even X-specific tools are seamlessly integrated.

Multi-Agent “Heavy” Mode

The $300/month tier spawns multiple agents that work together “like a study group.” In practice, this means better accuracy but slower responses.

Voice That Actually Works

Sub-250ms latency makes conversations feel natural. It stumbled on an opera request during the demo, but for normal chat, it’s impressively responsive.

Real-World Testing Results

I tested both standard and Heavy versions for two weeks:

Reasoning Tests

Academic problems: Lives up to the hype – crushed every graduate-level question
Business simulations: Doubled competitors’ performance in VendingBench
Logic puzzles: Best-in-class on pattern recognition
Common sense: Musk wasn’t joking – it sometimes lacks it

Coding Tests

Algorithm implementation: Exceptional – solved problems other models struggled with
Full-stack development: Solid but not revolutionary
Debugging: The upcoming Grok 4 Code promises repository-level understanding
Code quality: Clean, well-commented, actually runs

Real-Time Capabilities

X integration: Searches posts, understands context, finds viral content
Web research: More thorough than competitors, sometimes too thorough
Current events: Excellent at finding and synthesizing recent information
Multimodal: Image understanding works well, video coming in October

Pricing Breakdown

Standard Grok 4:

Part of X Premium+ at $22/month (includes X features)
API pricing not publicly disclosed yet

SuperGrok Heavy:

$300/month (yes, really)
Includes early access to new features
Multi-agent processing
Higher rate limits

For comparison:

ChatGPT Plus: $20/month
Claude Pro: $20/month
ChatGPT Pro: $200/month

Grok 4 vs. The Competition

Grok 4 vs. GPT-4.5

Reasoning: Grok wins on benchmarks
General chat: GPT-4.5 more polished
Speed: Grok much faster
Price: Standard Grok cheaper, Heavy more expensive
Availability: Grok integrated with X

Grok 4 vs. Claude 3.5 Sonnet

Academic tasks: Grok by a mile
Writing quality: Claude more eloquent
Safety: Claude won’t praise Hitler
Tool use: Grok more natural
Price: Similar for base tiers

Grok 4 vs. o3

Pure reasoning: Too close to call
Speed: Grok faster (no thinking time)
Benchmarks: Grok edges ahead
Reliability: o3 more consistent

Who Should Use Grok 4?

Perfect For:

Researchers tackling complex academic problems
Developers who need strong reasoning + coding
X power users wanting AI-integrated social media
Anyone needing real-time information with AI analysis
Benchmark chasers who want the highest scores

Skip If:

You’re concerned about AI safety and moderation
You need consistent, predictable outputs
You want extensive documentation and support
$300/month for Heavy seems insane (it is)
Recent controversies are deal-breakers

Tips for Best Results

Use specific X tools – It excels at finding and analyzing posts
Leverage the reasoning – Give it hard problems, not chat
Try Heavy for critical tasks – The multi-agent approach reduces errors
Keep prompts focused – It can go off on tangents
Verify controversial outputs – The “truth-seeking” can get spicy

The Elephant in the Room

Let’s address it: Grok’s official account literally posted antisemitic content days before launch. xAI scrambled to fix it, removing “politically incorrect” instructions from the system prompt. This isn’t just a PR problem – it reveals fundamental issues with their approach to AI safety.

The “maximum truth-seeking” philosophy sounds good until your AI starts quoting Holocaust deniers. xAI needs to figure out the difference between avoiding “woke AI” and basic human decency.

What’s Coming Next

xAI’s roadmap is aggressive:

August 2025: Grok 4 Code (repository-level understanding)
September 2025: Enhanced multimodal agent
October 2025: Video generation
Within 2 years: Musk claims it’ll “discover new physics”

Take the timelines with a grain of salt – this is Musk we’re talking about.

The Bottom Line

Grok 4 is a technical achievement wrapped in a management disaster. The benchmarks are real, the performance is impressive, and the X integration is genuinely useful. But the safety issues, premium pricing, and chaotic launch raise serious questions.

For $22/month with X Premium+, standard Grok 4 offers remarkable value – if you can stomach the platform. The $300 Heavy tier is harder to justify unless you specifically need multi-agent reasoning.

This is the most capable model I’ve tested for pure reasoning tasks. It’s also the most likely to generate a PR nightmare. Use accordingly.

Frequently Asked Questions

Q: Are the benchmarks real or Musk hype? A: Surprisingly real. Third parties have verified several claims. It genuinely performs at this level.

Q: Is Heavy worth $300/month? A: Only for specific use cases requiring maximum accuracy. Most users should stick with standard.

Q: Will it really discover new physics? A: Musk says within 2 years. I’ll believe it when I see the peer-reviewed paper.

Q: How bad are the safety issues? A: Bad enough that they had to emergency-patch the system prompt. Proceed with caution.

Q: Should I switch from ChatGPT/Claude? A: For reasoning and academic tasks, absolutely. For general use, probably not.

Ready to try Grok 4? It’s available through X Premium+ or the xAI API. Just maybe double-check its outputs before tweeting them.

Grok 4 Review