LLMs are very complex. Keeping up with this fast-paced changing world requires understanding how large language models work and terms such as tokens, parameters, weights, inference and fine-tuning, and more.
Therefore, in this article, we will explore what are large language models, what defines 'large' in large language models, and the difference between private and open-source LLMs. We will also examine the key terms needed to understand this disruptive technology. Lastly, we will discuss what you need to consider when deploying LLMs in your organization.
In machine learning, there are many models, but with the growth of generative AI in recent years, large language models (LLMs) have become a hot topic. Because LLMs are currently the only computer programs that are capable of understanding and generating human language with very close accuracy to humans.
The underlying foundation is deep learning neural networks with an encoder and a decoder architecture. This enables LLMs to analyze relationships between the words or data they form during self-supervised or semi-supervised training. They can use those findings to process human language and generate human-like text or other content.
Also, these models are often built on extensive amounts of data which is where the name “large” comes from. Because the more data the model has, the better the output usually is.
LLM is a subset of deep learning, which is a subset of machine learning, which is a subset of AI. Let me explain:
Artificial intelligence (AI) – refers to computer systems aka algorithms that can perform tasks with human-like intelligence. AI has many sub-categories such as machine learning, deep learning, natural language processing, expert systems, robotics, machine vision, and speech recognition – each performs different AI processes.
Machine learning is a subset of AI. These are computer algorithms capable of performing specific tasks such as making predictions and decisions based on any data to which they have access.
Deep Learning: is a subset of machine learning that makes models process the data in a way that mimics the human brain also known as a neural network.
LLM models are advanced machine learning models that perform even more specific tasks - understanding human language and generating content like text, images, audio, or videos when the user requests it through a prompt.
LLMs can’t understand the human language as we speak or write. However, they can understand computational numbers such as 0 and 1. Therefore each character, word, or sentence must be first broken down for computer-friendly language.
This is done through embedding models. These smaller models break down the words into numbers, create relationships between them, and place them in the vector space. This gives LLMs the power to create emails, design graphics, convert text to audio, and more.
Open-source llm models are free to download from places, such as Huggingface, and used, changed, or redistributed by anyone.
While open source means it should be completely open, all the weights and training data, with LLMs it’s not that simple. There are different models with different levels of openness. Examples are LLaMA 3.2 by Meta, Mistral 7B by Mistral AI and StableLM by Stability AI. Some of these have licenses such as Apache, others are restrictions based on the size or for certain industries.
Private models are commercial models that are available through an API for a monthly fee. For example, OpenAI GPT-4, Anthropic Claude 2 and Google PaLM 2. While these models offer instant LLM power, just connect the API with your application, there are many security and data privacy concerns because you cannot access the models' back end. This is one of the reasons organizations seek open-source models that they can run in their internal data centers.
Tokens: text such as words, subwords, or characters that we provide for the model, which then breaks them into smaller units called tokens. This process is the building block of each llm-powered application.
Parameters: these are variables also known as relationships between words or sentences that models learn during the training process. They can be used as the setting to adjust how applications generate content. For example, you can choose what quality, creativity, or diversity these models can produce content and adjust the recording to your specific needs. As a result, larger models have more parameters and can understand more complex data but require more computing power and vice versa.
Weights: weights are sub-sets of parameters that represent numerical values between the connection of neural networks that the model learns and creates from trained data. Each model can adjust weights to optimize for maximum performance.
Inference: is the process when using trained data to make a prediction or generate output while the user requests it through a prompt (often from unseen data). This means, that the more text or images you generate or the organizations' workloads process, the more inferences you consume. This directly correlates with the total cost of running LLMS.
Fine tuning: essentially means adjusting parameters to improve models’ performance. For example, you can train it with more specialized data for more specific tasks. Doing so, you refine their ability to generate more relevant responses, especially with the new context or use cases without expanding models’ sizes. Essentially fine-tuning makes the model more efficient.
The term large in front of a large language model is usually based on two aspects:
Number of parameters: Each llm model is usually grouped by millions to billions to trillions of parameters. For example, the mistral-7B model means it has 7 billion parameters.
Training data size: LLMs are trained using vast amounts of data, which can consist of many terabytes of text from different sources. The same mistral-7B is trained up to 8 trillion tokens. We can estimate that an average document contains about 1,000 tokens. Then roughly the total documents that this model was trained on are 8 billion documents.
While all generative AI models are considered large language models, we can categorize each as small, medium, or large.
Small Models: Typically have hundreds of millions of parameters (e.g., models with 125M to 500M parameters). For example, answerai-colbert-small-v1 and DistilBERT are approximately 33 and 66 million parameters respectively.
Medium Models: Range from about 1 billion to 10 billion parameters. For example, GPT-2 by OpenAI as 1.5 billion parameters and GPT-Neo developed by EleutherAI has 1.3B and 2.7B models.
Large Models: Generally refer to models with 10 billion parameters or more. Examples include models like GPT-3 (175 billion parameters) and GPT-4 (up to 1 trillion parameters).
Quantization is a technique in which you reduce the model sizes by decreasing the precision parameters such as the weights. As organizations seek to harness generative ai, quantization will become crucial for enterprise-wide adoption for two main reasons:
This requires a lot of computing power and will be very difficult with current model sizes and power supply requirements.
However, quantized models consumes significantly fewer resources and is easier to deploy in various scenarios
For example, we have reduced Lama 3.1 70b from 16-bit to 8-bit floating point numbers, reducing its size by more than half. This enables us to run the model in 48 GB of server, compared to the requirements of 140 GB unquantized model. As a result, the quantized model is also a lot faster and more cost-efficient.
In the next years when organizations are starting to scale their generative AI workloads, quantization will be one solution that enables that. Another is running models on-premises, which we will cover next.
Running LLMs on-premises is approximately 60-80% more cost-efficient and is the only way to secure your data. But it's a time-consuming, complicated, and expensive process.
Three main things you need to consider when deploying LLMs in your organizations workloads:
1. Hardware Requirements: Ensure you have sufficient GPU resources, as LLMs require significant computational power. NVIDIA GPUs (e.g., A100, V100) are commonly used for these. For most medium and large-size organizations it makes sense to buy your own hardware unless you have that capacity already. The second option is to use a private cloud as it is still very secure but comes with a slightly higher cost.
2. Software Requirements: You have two options: build it yourself or buy software as a service.
3. Model Selection: There are many private and open-source models in the market. What to choose? The former enables you to instantly enable llm power in your organization's application via API but comes with a hefty cost. In the long term, the expense can even outperform all the benefits. The latter provides more control over the technology you want to build, is secure, and is up to 80% more cost-efficient, especially when using quantized models and on-premises hardware. While this option requires a higher upfront investment, the break-even point can be reached in just a few months. From that point on, it generates clear value.
Large language models make sense of human language and generate new content by combining training data, unseen data, and user prompts. For businesses that means they can understand data to reduce operational costs and make better or faster decisions. No other modern technology can do this.
The adoption of this new technology is growing, fast. Many organizations are currently in the POC stage. The next year will be the scaling time of the most effective POCs. Innovators and early adopters will enjoy the first movers’ advantage, which is reduced costs and additional revenue. Ultimately, this means they will have more resources to get ahead of their competitors. Regarding that, I would like to ask which side of the fence you are on: getting ahead or staying behind?