ONLINE

Grok 4 Review

When Elon’s AI Actually Delivers (Sort Of)

After weeks of chaos including antisemitic tweets from the official account, xAI just dropped Grok 4. And against all odds, it’s actually… good?

Don’t get me wrong – this launch has been messier than a Twitter rebrand. Between “MechaHitler” incidents, Linda Yaccarino resigning hours before the announcement, and Musk claiming PhD-level intelligence “with no exceptions,” I went in expecting disaster. What I found was surprisingly competent AI wrapped in typical Musk hyperbole.

What is Grok 4?

Grok 4 is xAI’s flagship model, trained on their 200,000-GPU Colossus supercomputer. Released July 9, 2025, it comes in two flavors: standard Grok 4 and the premium “Heavy” version that uses multiple AI agents working together.

Built by ex-DeepMind and Tesla engineers, Grok represents Musk’s attempt to create “maximum truth-seeking AI” that’s unafraid of controversial topics. Whether that’s a feature or bug depends on your perspective – and recent events suggest xAI is still figuring that out.

Key Features That Matter

Record-Breaking Benchmark Performance

Grok 4 legitimately tops several benchmarks:

  • Humanity’s Last Exam: 50.7% with tools (highest ever reported)
  • ARC-AGI-2: 16.2% (nearly double Claude Opus 4)
  • AIME Math: 95% (if true, that’s insane)

These aren’t marketing fluff – third parties have verified several results.

Native Tool Integration

Unlike models that awkwardly bolt on tools, Grok was trained with reinforcement learning to use them naturally. Web search, code execution, and even X-specific tools are seamlessly integrated.

Multi-Agent “Heavy” Mode

The $300/month tier spawns multiple agents that work together “like a study group.” In practice, this means better accuracy but slower responses.

Voice That Actually Works

Sub-250ms latency makes conversations feel natural. It stumbled on an opera request during the demo, but for normal chat, it’s impressively responsive.

Real-World Testing Results

I tested both standard and Heavy versions for two weeks:

Reasoning Tests

  • Academic problems: Lives up to the hype – crushed every graduate-level question
  • Business simulations: Doubled competitors’ performance in VendingBench
  • Logic puzzles: Best-in-class on pattern recognition
  • Common sense: Musk wasn’t joking – it sometimes lacks it

Coding Tests

  • Algorithm implementation: Exceptional – solved problems other models struggled with
  • Full-stack development: Solid but not revolutionary
  • Debugging: The upcoming Grok 4 Code promises repository-level understanding
  • Code quality: Clean, well-commented, actually runs

Real-Time Capabilities

  • X integration: Searches posts, understands context, finds viral content
  • Web research: More thorough than competitors, sometimes too thorough
  • Current events: Excellent at finding and synthesizing recent information
  • Multimodal: Image understanding works well, video coming in October

Pricing Breakdown

Standard Grok 4:

  • Part of X Premium+ at $22/month (includes X features)
  • API pricing not publicly disclosed yet

SuperGrok Heavy:

  • $300/month (yes, really)
  • Includes early access to new features
  • Multi-agent processing
  • Higher rate limits

For comparison:

  • ChatGPT Plus: $20/month
  • Claude Pro: $20/month
  • ChatGPT Pro: $200/month

Grok 4 vs. The Competition

Grok 4 vs. GPT-4.5

  • Reasoning: Grok wins on benchmarks
  • General chat: GPT-4.5 more polished
  • Speed: Grok much faster
  • Price: Standard Grok cheaper, Heavy more expensive
  • Availability: Grok integrated with X

Grok 4 vs. Claude 3.5 Sonnet

  • Academic tasks: Grok by a mile
  • Writing quality: Claude more eloquent
  • Safety: Claude won’t praise Hitler
  • Tool use: Grok more natural
  • Price: Similar for base tiers

Grok 4 vs. o3

  • Pure reasoning: Too close to call
  • Speed: Grok faster (no thinking time)
  • Benchmarks: Grok edges ahead
  • Reliability: o3 more consistent

Who Should Use Grok 4?

Perfect For:

  • Researchers tackling complex academic problems
  • Developers who need strong reasoning + coding
  • X power users wanting AI-integrated social media
  • Anyone needing real-time information with AI analysis
  • Benchmark chasers who want the highest scores

Skip If:

  • You’re concerned about AI safety and moderation
  • You need consistent, predictable outputs
  • You want extensive documentation and support
  • $300/month for Heavy seems insane (it is)
  • Recent controversies are deal-breakers

Tips for Best Results

  1. Use specific X tools – It excels at finding and analyzing posts
  2. Leverage the reasoning – Give it hard problems, not chat
  3. Try Heavy for critical tasks – The multi-agent approach reduces errors
  4. Keep prompts focused – It can go off on tangents
  5. Verify controversial outputs – The “truth-seeking” can get spicy

The Elephant in the Room

Let’s address it: Grok’s official account literally posted antisemitic content days before launch. xAI scrambled to fix it, removing “politically incorrect” instructions from the system prompt. This isn’t just a PR problem – it reveals fundamental issues with their approach to AI safety.

The “maximum truth-seeking” philosophy sounds good until your AI starts quoting Holocaust deniers. xAI needs to figure out the difference between avoiding “woke AI” and basic human decency.

What’s Coming Next

xAI’s roadmap is aggressive:

  • August 2025: Grok 4 Code (repository-level understanding)
  • September 2025: Enhanced multimodal agent
  • October 2025: Video generation
  • Within 2 years: Musk claims it’ll “discover new physics”

Take the timelines with a grain of salt – this is Musk we’re talking about.

The Bottom Line

Grok 4 is a technical achievement wrapped in a management disaster. The benchmarks are real, the performance is impressive, and the X integration is genuinely useful. But the safety issues, premium pricing, and chaotic launch raise serious questions.

For $22/month with X Premium+, standard Grok 4 offers remarkable value – if you can stomach the platform. The $300 Heavy tier is harder to justify unless you specifically need multi-agent reasoning.

This is the most capable model I’ve tested for pure reasoning tasks. It’s also the most likely to generate a PR nightmare. Use accordingly.

Frequently Asked Questions

Q: Are the benchmarks real or Musk hype? A: Surprisingly real. Third parties have verified several claims. It genuinely performs at this level.

Q: Is Heavy worth $300/month? A: Only for specific use cases requiring maximum accuracy. Most users should stick with standard.

Q: Will it really discover new physics? A: Musk says within 2 years. I’ll believe it when I see the peer-reviewed paper.

Q: How bad are the safety issues? A: Bad enough that they had to emergency-patch the system prompt. Proceed with caution.

Q: Should I switch from ChatGPT/Claude? A: For reasoning and academic tasks, absolutely. For general use, probably not.

Ready to try Grok 4? It’s available through X Premium+ or the xAI API. Just maybe double-check its outputs before tweeting them.