Evaluating Google's New Gemini Language Model: How Does It Stack Up Against GPT-3 and GPT-4?

Google recently unveiled its new Gemini language model, claiming it can rival OpenAI's top GPT-3 and GPT-4 models in language understanding and generation abilities. But how does Gemini actually perform compared to these other leading AI systems?

Researchers from Carnegie Mellon University and BerriAI decided to find out by benchmarking Gemini against GPT-3, GPT-4, and other models on 10 diverse language tasks. Their goal was to provide an impartial, in-depth analysis of Gemini's strengths and weaknesses.

The Tests: A Range of Language Abilities

The researchers tested Gemini Pro (comparable to GPT-3.5), GPT-3.5 Turbo, GPT-4 Turbo, and the open-source Mixtral model. The evaluations covered:

This comprehensive test suite required strong language understanding, reasoning, and generation abilities.

The Results: Gemini Lags Behind GPT-3 and GPT-4 Overall

Across all the benchmarks, Gemini Pro performed worse than GPT-3.5 Turbo and significantly worse than GPT-4 Turbo. However, it did surpass the open-source Mixtral model on every task.

Table showing the main results of our benchmarking. The best model is listed in bold, and the second best
model is underlined.

Image Source: Akter, Syeda Nahida, et al. "An In-depth Look at Gemini's Language Abilities." arXiv preprint arXiv:2312.11444 (2022).

Some key findings:

So in summary, Gemini Pro achieved accuracy comparable to but slightly below GPT-3.5 Turbo overall. The researchers concluded it still has weaknesses to address but also exhibits strengths in handling complexity and reasoning depth.

The Takeaways: Closing the Gap on GPT-3 and GPT-4

While Gemini does not yet match GPT-3 or surpass GPT-4 as claimed, this analysis provides an objective look at areas where Google's model excels as well as where it needs improvement.

With fine-tuning, Gemini's upcoming Ultra version may close the gap and provide true competition to these other monolithic AI systems. But more impartial testing will be needed to verify its capabilities across a diverse range of language understanding and generation tasks.

Citation: Akter, Syeda Nahida, et al. "An In-depth Look at Gemini's Language Abilities." arXiv preprint arXiv:2312.11444 (2022).

