Looking for world class content for your learning platform ?
Explore content
June 5, 2024
|
7 mins to read

How does GPT-4o measure up against its competitors?

OpenAI has recently released their new model GPT-4o, so we're comparing and contrasting three Gen AI models to see how they measure up.
Cameron Flanagan
AI Engineer

If you’ve been paying any attention to Artificial Intelligence, you’ll recognise how quickly the news surrounding it moves. The AI landscape of 2022 – the launch of ChatGPT, how novel it was, and the knock on effects it caused – seems almost unrecognisable compared to what we’re faced with today.

With multiple vendors and companies competing to be placed first in the AI Race, it sometimes feels like new developments are being announced faster than we humans can keep up with them.

One such development is the latest release for the aforementioned Artificial Intelligence Poster Child, OpenAI (the company behind Chat GPT.) Their latest release “GPT-4 Omni” (or “GPT-4o” as its also known) has been the topic of a lot of discussion.

At Thrive, we believe it’s important for our collective future and safety to keep abreast of new developments in the AI landscape (as quickly as they might be moving) so in this blog, we’ll be comparing GPT-4o to Google’s Gemini Pro 1.5 and Meta’s Llama 3.

Read on to discover their features, key differentiators and potential limitations – and see how all three models compare.

GPT-4o

GPT-4 Omni” builds upon OpenAI’s GPT-4 Turbo model.

Similar to Google (who we’ll discuss shortly), OpenAI has implemented a mixture of expert architecture, resulting in a model that is “half as cheap, and twice as fast.” The model's context window has also been extended, and is now capable of processing 128,000 tokens.

Beyond this, it’s able to better understand vision and audio specifically, boasts improved performance on non-English languages, has the futuristic ability to detect users’ emotions, and – as one user pointed out – has the “unspoken” ability to act as an AI Girlfriend.

Despite being reminiscent of Spike Jonze’s Sci-Fi film Her – an effect that’s underscored by Sam Altman using Scarlett Johannson’s voice without her permission – reception to GPT-4o has, broadly speaking, been marked by awe and praise.

But it’s not without its critics. Some are understandably concerned – as many have been since the advent of AI – about the ethical questions that this new development brings up.

As pointed out in a Forbes article, ChatGPT's policies have historically been "contrary to the EU General Data Protection Regulation (GDPR)", and GPT-4o does little to temper those concerns. With capabilities that walk the line between awe-inspiring and dystopian – including the ability to access users’ screens, and the fact that the model uses personal data as training information – privacy advocates are hesitant to celebrate the new release.

For their part, OpenAI has responded to concerns by closely monitoring any potential issues. "We are working with red teamers  --  domain experts in areas like misinformation, hateful content, and bias  -- who will be adversarially testing the model."


Key differentiators:

Multimodal processing and emotional intelligence: GPT-4o is able to process multimodal inputs and gather an understanding of emotional cues, tone, and sentiment in language, allowing for empathetic responses.

Real time multimodal generation: This model is the only one of the three able to process and generate all media formats.

Potential limitations:

Early development stages: The model is a recent release, so it could be prone to bugs. Some features remain in alpha testing prior to release.

Customisation: The model is only accessible via OpenAI API or Azure API, so the ability to optimise and access the model is restricted by the vendors.

Now that we’ve explored the most recent development, how does this compare to other LLMs in the AI Race? Let’s see how our other competitors measure up.

Google’s Gemini Pro 1.5

Announced in February of this year, prior to the splashy release of GPT-4o, Google’s Gemini Pro 1.5 is the latest step in the company’s AI journey. This model maintains Gemini's multi-modal capabilities while delivering enhanced performance and efficiencies through a new Mixture of Experts (MoE) architecture.

Unlike traditional Transformers, which operate as a single large neural network, MoE models divide the network into smaller "expert" neural networks. Among its advancements, the most notable is the significant increase in context window size, expanding from handling 32,000 tokens to now accommodating up to 1 million tokens in production.

This means Gemini can now process vast amounts of data, including hundreds of pages of text, one hour of video, and up to eleven hours of audio -- a gamechanging development. As well as being multi-modal, Gemini is multilingual and able to understand 36 languages. As it's a closed-source models, we don't have specific details about its size and architecture -- but in the words of Google DeepMind CEO Demis Hassabis, this new development offers: “dramatically enhanced performance.”

“Our latest innovations in model architecture allow Gemini 1.5 to learn complex tasks more quickly and maintain quality, while being more efficient to train and serve. These efficiencies are helping our teams iterate, train and deliver more advanced versions of Gemini faster than ever before, and we’re working on further optimizations.”

- Demis Hassabis


But what of the ethical concerns? There are a few, including one from a journalist whose request for Gemini Pro 1.5 to translate Georgian to English was incorrectly flagged as “unsafe content”(perhaps revealing a bug within the model), and data privacy and security concerns raised by a model that relies on so much data for training.

Google’s official statement on these concerns:

“In advance of releasing 1.5 Pro, we've taken the same approach to responsible deployment as we did for our Gemini 1.0 models, conducting extensive evaluations across areas including content safety and representational harms, and will continue to expand this testing. Beyond this, we’re developing further tests that account for the novel long-context capabilities of 1.5 Pro.”

Key differentiators:

Context window: The model's extensive context window enables users to upload entire manuals of context as part of their prompt. This allows the model to be well informed before providing decisions or answers.

Multilingual and multimodal: Gemini Pro 1.5 demonstrates high capability in working across different languages and formats, catering to diverse requirements.


Potential limitations:

Flexibility: Being exclusively available via Google Cloud APIs may restrict users to Google's pricing and fine-tuning limits, potentially limiting flexibility in usage and cost management.

Text generation only: Despite its ability to handle multimodal inputs, the model currently only generates text as an output, limiting its versatility in producing other types of content such as images or audio.

Meta’s Llama 3

Unable to resist throwing their hat in the ring, Meta released their AI Assistant powered by Llama 3 in April 2024.

The open-source LLM offering is available in two sizes: a 70 billion parameter version and an 8 billion parameter version. While the former offers greater capability and performance, it also comes with a higher price tag.

Architecturally, Llama 3 follows a standard encoder and decoder architecture, with no major surprises in its release -- although Meta has made significant improvements in how the model understands and processes text. These advancements include better text representations (RoPE embeddings) as well as a new technique called Grouped Query Attention (GQA).

GQA helps the model work faster and more accurately by breaking the input data into smaller chunks, and focusing on the important parts within those chunks. This makes the model more efficient and effective at generating accurate responses.

Llama 3 also boasts a 100% increase in context window from its predecessor, leaving it with a default size of 8200 tokens.

Key differentiator

Open source: Llama 3 is the only model of the three that is open source, making it highly customisable in terms of optimising and deploying. It also has a large developer community that shares their findings and optimisations.

Potential limitations

Not multilingual or multimodal: The model is currently limited to just text inputs, and performs poorly on languages other than English. Plans for these features are in Meta’s development roadmap for future releases.

Technical expertise: To get the most out of Llama 3, someone is required to optimise the model for a given use case – and this necessitates a certain level of technical expertise. However, the default model is available via multiple cloud APIs.

How do the models compare?

Getting down to brass tacks: Now that we know a little more about them, how do these three models measure up?

When it comes to capabilities, the models we’ve discussed in this blog are some of the largest and most capable currently available on the market.

All three of them can complete text generation and text-understanding tasks to a very high standard. They are also all able to summarise, understand, generate and classify text in real time, achieving high performance across common benchmarks like MMLU and GPQA.

Both GPT-4o and Gemini 1.5 Pro can perform tasks in multiple languages, while Llama is currently limited to English, with plans to expand its linguistic capabilities in the future. Unlike the other two models, Llama lacks multimodal capabilities and can only process text inputs. In contrast, Gemini and GPT-4o can handle audio, video, and images alongside text, with GPT-4o even being able to generate these media formats.

Despite Llama's limitations in multilingual and multimodal capabilities, its open-source nature and strong developer community make it a great candidate for certain use cases. The ability to fine-tune the model in great detail allows for optimal performance in specific tasks, such as generating text in a particular tone of voice or generating content for a specific industry or level of expertise.

While all the above models are capable of performing these tasks, their performance may vary. The flexibility of open-source models like Llama offers more opportunities for optimization and improved performance.

As with all things AI-related, our chief concern should always be the ethical implications. We’re particularly interested in this aspect of the conversation, and have explored it through our blogs The ethical implications of AI in the workplace and Is AI a threat or an opportunity?

At Thrive, we embrace AI’s capabilities with cautious optimism. While we want to be continuously innovating and evolving, we’re also cognisant of AI’s risks – and we want to make sure we have processes in place to mitigate those risks in order to be fully compliant.

Usability and support

Both GPT and Google's APIs for their models offer comprehensive documentation with ample examples, making them easy to use and understand. Llama, when accessed through services like AWS Bedrock or Google cloud, also boasts thorough documentation. Additionally, Llama provides extensive guides on fine-tuning, model access, and setting up the model locally should you choose to self host Llama rather than use an API.

Documentation and community

All three providers foster active developer communities. GPT and Gemini have dedicated forums where users can ask questions, share findings, and help each other. Llama benefits from similar channels, but its community also stands out by sharing fine-tuned model variants for wider use, for the more technical audiences.


Ease of use and access

All models offer playground environments for experimentation before committing to API usage. GPT and Gemini provide straightforward sign-in processes, allowing users to generate API keys and access the models. Billing information is required for usage beyond the free tier. Llama follows a similar approach, offering access through cloud providers of choice for the default model. However, deploying and integrating fine-tuned models for specific purposes requires more technical expertise.

Cost


(
Token definition: Tokens are the fundamental units of data processed by LLMs. They can represent words, parts of words, or even characters, depending on the tokenization method used.)

Pricing Models:

Gemini 1.5 Pro:

Cost per million input tokens: $3.50

Cost per million output tokens: $10.50

Llama 3 70B (Amazon Web Services API example)

Cost per million input tokens: $2.65

Cost per million output tokens: $3.50

GPT-4o

Cost per million input tokens: $5.00

Cost per million output tokens: $15.00

GPT and Gemini: Both models offer API access exclusively, charging per input and output token.

Llama: Provides greater flexibility. You can leverage cloud providers like AWS for API access or deploy it on your own compute resources, paying by the number of running hours.

Scalability

Each API has rate limits, generally allowing hundreds of requests per minute and hundreds of thousands of tokens. Gemini boasts the highest limits, with a capacity of 1000 requests per minute and 1 million tokens.

Long-term viability and support

The APIs provided by major tech companies are highly available and offer help portals with active developer communities. Llama's support capability depends on the hosting vendor or API provider.

More Stories

See all

See Thrive in action

Explore what impact Thrive could make for your team and your learners today.

June 5, 2024
|
7 mins to read

How does GPT-4o measure up against its competitors?

OpenAI has recently released their new model GPT-4o, so we're comparing and contrasting three Gen AI models to see how they measure up.
Cameron Flanagan
AI Engineer

If you’ve been paying any attention to Artificial Intelligence, you’ll recognise how quickly the news surrounding it moves. The AI landscape of 2022 – the launch of ChatGPT, how novel it was, and the knock on effects it caused – seems almost unrecognisable compared to what we’re faced with today.

With multiple vendors and companies competing to be placed first in the AI Race, it sometimes feels like new developments are being announced faster than we humans can keep up with them.

One such development is the latest release for the aforementioned Artificial Intelligence Poster Child, OpenAI (the company behind Chat GPT.) Their latest release “GPT-4 Omni” (or “GPT-4o” as its also known) has been the topic of a lot of discussion.

At Thrive, we believe it’s important for our collective future and safety to keep abreast of new developments in the AI landscape (as quickly as they might be moving) so in this blog, we’ll be comparing GPT-4o to Google’s Gemini Pro 1.5 and Meta’s Llama 3.

Read on to discover their features, key differentiators and potential limitations – and see how all three models compare.

GPT-4o

GPT-4 Omni” builds upon OpenAI’s GPT-4 Turbo model.

Similar to Google (who we’ll discuss shortly), OpenAI has implemented a mixture of expert architecture, resulting in a model that is “half as cheap, and twice as fast.” The model's context window has also been extended, and is now capable of processing 128,000 tokens.

Beyond this, it’s able to better understand vision and audio specifically, boasts improved performance on non-English languages, has the futuristic ability to detect users’ emotions, and – as one user pointed out – has the “unspoken” ability to act as an AI Girlfriend.

Despite being reminiscent of Spike Jonze’s Sci-Fi film Her – an effect that’s underscored by Sam Altman using Scarlett Johannson’s voice without her permission – reception to GPT-4o has, broadly speaking, been marked by awe and praise.

But it’s not without its critics. Some are understandably concerned – as many have been since the advent of AI – about the ethical questions that this new development brings up.

As pointed out in a Forbes article, ChatGPT's policies have historically been "contrary to the EU General Data Protection Regulation (GDPR)", and GPT-4o does little to temper those concerns. With capabilities that walk the line between awe-inspiring and dystopian – including the ability to access users’ screens, and the fact that the model uses personal data as training information – privacy advocates are hesitant to celebrate the new release.

For their part, OpenAI has responded to concerns by closely monitoring any potential issues. "We are working with red teamers  --  domain experts in areas like misinformation, hateful content, and bias  -- who will be adversarially testing the model."


Key differentiators:

Multimodal processing and emotional intelligence: GPT-4o is able to process multimodal inputs and gather an understanding of emotional cues, tone, and sentiment in language, allowing for empathetic responses.

Real time multimodal generation: This model is the only one of the three able to process and generate all media formats.

Potential limitations:

Early development stages: The model is a recent release, so it could be prone to bugs. Some features remain in alpha testing prior to release.

Customisation: The model is only accessible via OpenAI API or Azure API, so the ability to optimise and access the model is restricted by the vendors.

Now that we’ve explored the most recent development, how does this compare to other LLMs in the AI Race? Let’s see how our other competitors measure up.

Google’s Gemini Pro 1.5

Announced in February of this year, prior to the splashy release of GPT-4o, Google’s Gemini Pro 1.5 is the latest step in the company’s AI journey. This model maintains Gemini's multi-modal capabilities while delivering enhanced performance and efficiencies through a new Mixture of Experts (MoE) architecture.

Unlike traditional Transformers, which operate as a single large neural network, MoE models divide the network into smaller "expert" neural networks. Among its advancements, the most notable is the significant increase in context window size, expanding from handling 32,000 tokens to now accommodating up to 1 million tokens in production.

This means Gemini can now process vast amounts of data, including hundreds of pages of text, one hour of video, and up to eleven hours of audio -- a gamechanging development. As well as being multi-modal, Gemini is multilingual and able to understand 36 languages. As it's a closed-source models, we don't have specific details about its size and architecture -- but in the words of Google DeepMind CEO Demis Hassabis, this new development offers: “dramatically enhanced performance.”

“Our latest innovations in model architecture allow Gemini 1.5 to learn complex tasks more quickly and maintain quality, while being more efficient to train and serve. These efficiencies are helping our teams iterate, train and deliver more advanced versions of Gemini faster than ever before, and we’re working on further optimizations.”

- Demis Hassabis


But what of the ethical concerns? There are a few, including one from a journalist whose request for Gemini Pro 1.5 to translate Georgian to English was incorrectly flagged as “unsafe content”(perhaps revealing a bug within the model), and data privacy and security concerns raised by a model that relies on so much data for training.

Google’s official statement on these concerns:

“In advance of releasing 1.5 Pro, we've taken the same approach to responsible deployment as we did for our Gemini 1.0 models, conducting extensive evaluations across areas including content safety and representational harms, and will continue to expand this testing. Beyond this, we’re developing further tests that account for the novel long-context capabilities of 1.5 Pro.”

Key differentiators:

Context window: The model's extensive context window enables users to upload entire manuals of context as part of their prompt. This allows the model to be well informed before providing decisions or answers.

Multilingual and multimodal: Gemini Pro 1.5 demonstrates high capability in working across different languages and formats, catering to diverse requirements.


Potential limitations:

Flexibility: Being exclusively available via Google Cloud APIs may restrict users to Google's pricing and fine-tuning limits, potentially limiting flexibility in usage and cost management.

Text generation only: Despite its ability to handle multimodal inputs, the model currently only generates text as an output, limiting its versatility in producing other types of content such as images or audio.

Meta’s Llama 3

Unable to resist throwing their hat in the ring, Meta released their AI Assistant powered by Llama 3 in April 2024.

The open-source LLM offering is available in two sizes: a 70 billion parameter version and an 8 billion parameter version. While the former offers greater capability and performance, it also comes with a higher price tag.

Architecturally, Llama 3 follows a standard encoder and decoder architecture, with no major surprises in its release -- although Meta has made significant improvements in how the model understands and processes text. These advancements include better text representations (RoPE embeddings) as well as a new technique called Grouped Query Attention (GQA).

GQA helps the model work faster and more accurately by breaking the input data into smaller chunks, and focusing on the important parts within those chunks. This makes the model more efficient and effective at generating accurate responses.

Llama 3 also boasts a 100% increase in context window from its predecessor, leaving it with a default size of 8200 tokens.

Key differentiator

Open source: Llama 3 is the only model of the three that is open source, making it highly customisable in terms of optimising and deploying. It also has a large developer community that shares their findings and optimisations.

Potential limitations

Not multilingual or multimodal: The model is currently limited to just text inputs, and performs poorly on languages other than English. Plans for these features are in Meta’s development roadmap for future releases.

Technical expertise: To get the most out of Llama 3, someone is required to optimise the model for a given use case – and this necessitates a certain level of technical expertise. However, the default model is available via multiple cloud APIs.

How do the models compare?

Getting down to brass tacks: Now that we know a little more about them, how do these three models measure up?

When it comes to capabilities, the models we’ve discussed in this blog are some of the largest and most capable currently available on the market.

All three of them can complete text generation and text-understanding tasks to a very high standard. They are also all able to summarise, understand, generate and classify text in real time, achieving high performance across common benchmarks like MMLU and GPQA.

Both GPT-4o and Gemini 1.5 Pro can perform tasks in multiple languages, while Llama is currently limited to English, with plans to expand its linguistic capabilities in the future. Unlike the other two models, Llama lacks multimodal capabilities and can only process text inputs. In contrast, Gemini and GPT-4o can handle audio, video, and images alongside text, with GPT-4o even being able to generate these media formats.

Despite Llama's limitations in multilingual and multimodal capabilities, its open-source nature and strong developer community make it a great candidate for certain use cases. The ability to fine-tune the model in great detail allows for optimal performance in specific tasks, such as generating text in a particular tone of voice or generating content for a specific industry or level of expertise.

While all the above models are capable of performing these tasks, their performance may vary. The flexibility of open-source models like Llama offers more opportunities for optimization and improved performance.

As with all things AI-related, our chief concern should always be the ethical implications. We’re particularly interested in this aspect of the conversation, and have explored it through our blogs The ethical implications of AI in the workplace and Is AI a threat or an opportunity?

At Thrive, we embrace AI’s capabilities with cautious optimism. While we want to be continuously innovating and evolving, we’re also cognisant of AI’s risks – and we want to make sure we have processes in place to mitigate those risks in order to be fully compliant.

Usability and support

Both GPT and Google's APIs for their models offer comprehensive documentation with ample examples, making them easy to use and understand. Llama, when accessed through services like AWS Bedrock or Google cloud, also boasts thorough documentation. Additionally, Llama provides extensive guides on fine-tuning, model access, and setting up the model locally should you choose to self host Llama rather than use an API.

Documentation and community

All three providers foster active developer communities. GPT and Gemini have dedicated forums where users can ask questions, share findings, and help each other. Llama benefits from similar channels, but its community also stands out by sharing fine-tuned model variants for wider use, for the more technical audiences.


Ease of use and access

All models offer playground environments for experimentation before committing to API usage. GPT and Gemini provide straightforward sign-in processes, allowing users to generate API keys and access the models. Billing information is required for usage beyond the free tier. Llama follows a similar approach, offering access through cloud providers of choice for the default model. However, deploying and integrating fine-tuned models for specific purposes requires more technical expertise.

Cost


(
Token definition: Tokens are the fundamental units of data processed by LLMs. They can represent words, parts of words, or even characters, depending on the tokenization method used.)

Pricing Models:

Gemini 1.5 Pro:

Cost per million input tokens: $3.50

Cost per million output tokens: $10.50

Llama 3 70B (Amazon Web Services API example)

Cost per million input tokens: $2.65

Cost per million output tokens: $3.50

GPT-4o

Cost per million input tokens: $5.00

Cost per million output tokens: $15.00

GPT and Gemini: Both models offer API access exclusively, charging per input and output token.

Llama: Provides greater flexibility. You can leverage cloud providers like AWS for API access or deploy it on your own compute resources, paying by the number of running hours.

Scalability

Each API has rate limits, generally allowing hundreds of requests per minute and hundreds of thousands of tokens. Gemini boasts the highest limits, with a capacity of 1000 requests per minute and 1 million tokens.

Long-term viability and support

The APIs provided by major tech companies are highly available and offer help portals with active developer communities. Llama's support capability depends on the hosting vendor or API provider.

More Stories

See all

See Thrive in action

Explore what impact Thrive could make for your team and your learners today.