• author: AI Explained

Orca: Microsoft's Open Source Model that Beats GPT-3.5

Less than two weeks ago, a paper concluded that open source models were only able to mimic the style, but not the factuality of chat GPT. However, just 48 hours ago, Microsoft released a 51-page report on Orca, a 13 billion parameter model that beats GPT-3.5 and matches GPT-4 in several tests of reasoning. In this article, we will explore Orca in detail, its development, and how it performs better than other models such as llama, alpaca, and vicuna.

Development of Orca

Orca is named after killer whales, which are frequent visitors to South American coastlines. All the research behind Orca was done by Microsoft. According to the abstract, Orca was developed because other models such as llama, alpaca, and vicuna lacked rigorous evaluation, resulting in overestimating their small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFM's large Foundation models.

Orca was developed using a 13 billion parameter model that learns to imitate the reasoning process of larger models. It learned by looking at GPT-4's step-by-step thought process and was guided by teacher assistance from Chachi PT, which is GPT-3.5.

Comparison with Other Models

Orca outperforms conventional state-of-the-art models such as vicuna by more than 100 in complex zero-shot reasoning benchmarks like the Big Bench Hard and by 42 on AGI Eval. It reaches parity with Chachi PT on the Big Bench Hard and shows competitive performance in professional and academic examinations such as the SAT, LSAT, GRE, and GMAT.

Moreover, Orca surpasses conventional models such as Vicuna in the Vicuna evaluation set and matches text DaVinci 3 in the SAT, LSAT, GRE, and GMAT. What's more intriguing, the performance was zero-shot without Chain of Thought or any advanced methods.

How Orca Learns

Orca leverages system instructions asking models like GPT-4 and Chachi PT to think step by step, leading to much richer explanations. This allows smaller models such as Orca to access detailed responses from the model that explains the reasoning process of the teacher as it generates the response.

The developers used the Flan collection, which was released by Google in February and focused on balancing the kind of prompts and tasks you fine-tune the language model on to make tasks as complex and diverse as possible.

Differences in Size

Orca's 13 billion parameters are about 7% of the size of GPT-3, which is 175 billion parameters and possibly around 1-2% of GPT-4's size. This difference in size means it can be run on much smaller devices like a desktop or even possibly a laptop.


In conclusion, Orca is Microsoft's open source model that beats GPT-3.5 in several tests of reasoning and matches GPT-4. It outperforms conventional models such as Vicuna in the Vicuna evaluation set and shows competitive performance in professional and academic examinations. Orca was developed by leveraging system messages to get GPT models to think step by step leading to much richer explanations. Orca used Flan collection to make tasks as complex and diverse as possible and can be run on smaller devices due to its smaller size. The potential of Orca and its possible impact on future generations of open source models is worth looking out for.

The Performance of Orca Language Model in Various Benchmarks

When it comes to evaluating the performance of language models like Orca, several benchmarks are used to assess their capabilities. Here, we will look at some of the results of Orca's performance in various benchmarks.

Open-Ended Generation

In terms of open-ended generation, Orca was found to have a 95% chat GPT quality and 85% GT4 quality. However, using GT4 as an assessor may not always produce reliable results because of its positive bias towards the response of the first model in the comparison set. Despite this, Orca outperforms vicuna by a large margin and is competitive with text DaVinci 3.

Multiple Choice Questions

Orca was also evaluated on multiple choice questions that objectively test its reasoning capabilities. These tests are quite challenging, even for advanced language models, with only a few models achieving perfect scores. Orca outperforms vicuna by a large margin and is very competitive with text DaVinci 3. However, it is important to note that overall, Orca lags behind gpd4, but this is all in zero shots.

Big Bench Hard

The Big Bench Hard is a benchmark specifically designed for language models, with 23 challenging tasks where human raters still outperform language models. Orca was found to massively outperform the previous best open-source model, vicuna, and beat even chat GPT on average. Although it still lags behind gpt4, Orca outperforms it on a few tasks such as Web of Lies and Temple Sequences.

Tool Augmentation

One way to further improve Orca's performance is through tool augmentation, where larger models create tools that smaller models can use more efficiently. When given these tools created by gpt4 or better language models, Orca's performance across a range of tasks goes dramatically up.

These results are a baseline, and the authors mention other ways that Orca could be improved, such as through context learning, few-shot learning, and advanced prompting techniques like Chain of Thought prompting, which have not been tested yet.


In conclusion, Orca has proven to be a highly capable language model with excellent performance in various benchmarks. While it may not outperform gpt4 overall, it still outperforms other open-source models by a significant margin. Additionally, with the implementation of tool augmentation and testing in other contexts such as multi-turn conversations, Orca's capabilities could be further enhanced, making it an exciting development in the field of natural language processing.

Orca: Improving Language Models with Step-by-Step Explanations

Language models have reached impressive levels of performance, but they still have room for improvement. Recently, Microsoft introduced Orca, a method that shows how learning from step-by-step explanations can significantly enhance the quality of language models, regardless of their size. While the model is not yet publicly available, it offers promising insights on how to develop more robust evaluation methods and to make better use of powerful models like GPT-4 as teachers.

Why Did Microsoft Conduct This Research?

There is some speculation as to why Microsoft developed this method in the first place. Some researchers suggest that Microsoft wanted to test if it was easy to imitate the large language models on the cheap. If it is, then it could impact future investments in GPT-5 or GPT-6. Still, the reason for this research's pursuit remains nebulous.

Insights and Suggestions for Future Research

The authors of the research paper suggest that there is still much to explore in the development of language models. They recommend considering Orca's insights when designing alignment and post-training techniques, as well as evaluation methods. They hope that improved evaluation methods will result in more effective use of language models like GPT-4 as teachers. Additionally, the authors suggest that chatGPT could work as an intermediate teacher.

Open-source vs. Private Models

There are differing opinions on how the "open-source vs. non-open source" debate affects the future of language models. According to Ilya Satsukova from OpenAI, the divide between open-source and private models is growing. As the amount of effort required for producing a neural net keeps increasing, it will become harder to reproduce models like GPT-4 by open-source models. In contrast, Sam Altman, also from OpenAI, thinks that even if open-source models catch up, OpenAI will always have a different kind of mode. He suggests that OpenAI's mode is not just about copying a model, but it is also about figuring out what is next.


Orca has shown that learning from step-by-step explanations can improve the quality of language models. This research will inform the development of more effective evaluation methods and techniques for language models' alignment and post-training. The open-source vs. private models debate continues to play out within the artificial intelligence community. While there will always be a gap between open-source and private models, the discourse around the topic will continue to spark innovation within the development and application of language models.

Previous Post

Automating Image Generation with Stable Effusion and Chat GPT

Next Post

16 Fascinating and Surprising Moments from Sam Altman's World Tour

About The auther

New Posts

Popular Post