Large Language Models Demystified

Raido Linde
|
October 2, 2024

LLMs are very complex. Keeping up with this fast-paced changing world requires understanding how large language models work and terms such as tokens, parameters, weights, inference and fine-tuning, and more.

Therefore, in this article, we will explore what are large language models, what defines 'large' in large language models, and the difference between private and open-source LLMs. We will also examine the key terms needed to understand this disruptive technology. Lastly, we will discuss what you need to consider when deploying LLMs in your organization.

What are Large Language Models (LLMs)?

In machine learning, there are many models, but with the growth of generative AI in recent years, large language models (LLMs) have become a hot topic. Because LLMs are currently the only computer programs that are capable of understanding and generating human language with very close accuracy to humans.

The underlying foundation is deep learning neural networks with an encoder and a decoder architecture. This enables LLMs to analyze relationships between the words or data they form during self-supervised or semi-supervised training. They can use those findings to process human language and generate human-like text or other content.

Also, these models are often built on extensive amounts of data which is where the name “large” comes from. Because the more data the model has, the better the output usually is.

AI, Machine Learning, Deep Learning and LLMs

LLM is a subset of deep learning, which is a subset of machine learning, which is a subset of AI. Let me explain:

Artificial intelligence (AI) – refers to computer systems aka algorithms that can perform tasks with human-like intelligence. AI has many sub-categories such as machine learning, deep learning, natural language processing, expert systems, robotics, machine vision, and speech recognition – each performs different AI processes.

Machine learning is a subset of AI. These are computer algorithms capable of performing specific tasks such as making predictions and decisions based on any data to which they have access.

Deep Learning: is a subset of machine learning that makes models process the data in a way that mimics the human brain also known as a neural network.

LLM models are advanced machine learning models that perform even more specific tasks - understanding human language and generating content like text, images, audio, or videos when the user requests it through a prompt.

How Do LLMs Understand Human Language?

LLMs can’t understand the human language as we speak or write. However, they can understand computational numbers such as 0 and 1. Therefore each character, word, or sentence must be first broken down for computer-friendly language.

This is done through embedding models. These smaller models break down the words into numbers, create relationships between them, and place them in the vector space. This gives LLMs the power to create emails, design graphics, convert text to audio, and more.

Open-Source vs Private Models

Open-source llm models are free to download from places, such as Huggingface, and used, changed, or redistributed by anyone.

While open source means it should be completely open, all the weights and training data, with LLMs it’s not that simple. There are different models with different levels of openness. Examples are LLaMA 3.2 by Meta, Mistral 7B by Mistral AI and StableLM by Stability AI. Some of these have licenses such as Apache, others are restrictions based on the size or for certain industries.

Private models are commercial models that are available through an API for a monthly fee. For example, OpenAI GPT-4, Anthropic Claude 2 and Google PaLM 2. While these models offer instant LLM power, just connect the API with your application, there are many security and data privacy concerns because you cannot access the models' back end. This is one of the reasons organizations seek open-source models that they can run in their internal data centers.

Comparing Proprietary and Open-Source LLM Models

Criteria Open-Source Models Proprietary Models
Cost Typically free, but you need the hardware to run them. Expensive, often requiring subscriptions or pay-per-use fees.
Control Full or partial control over the weights and training data. No control. Restricted to API settings provided by the vendor.
Deployment Flexibility Can be deployed anywhere but requires your own hardware and software. Very simplified deployment options with cloud environments that are expensive.
Security & Privacy You will have total control over your data, models, and infrastructure. No control over the technology, data, or models.
Optimization Can be easily tailored based on your specific workloads or application needs. Limited in your ability to optimize or tailor to specific tasks or workloads.
Customization Fully customizable based on your preference and needs, regardless of industry. Limited customization options, restricted to API settings provided by the vendor.
Ease of Use More complex because you need to run them in your environment. Easier to use, as you don't need the infrastructure or hardware.
Transparency Great transparency into how the model works. Black-box models where internal workings are hidden.
Support Options Large, active community support with free resources available. Paid support available but limited with often slower response times.

What are LLM Terms: Parameters, Weights, Tokens, Inferences, and Fine-Tuning?

Tokens: text such as words, subwords, or characters that we provide for the model, which then breaks them into smaller units called tokens. This process is the building block of each llm-powered application.  

Parameters: these are variables also known as relationships between words or sentences that models learn during the training process. They can be used as the setting to adjust how applications generate content. For example, you can choose what quality, creativity, or diversity these models can produce content and adjust the recording to your specific needs. As a result, larger models have more parameters and can understand more complex data but require more computing power and vice versa.

Weights: weights are sub-sets of parameters that represent numerical values between the connection of neural networks that the model learns and creates from trained data. Each model can adjust weights to optimize for maximum performance.

Inference: is the process when using trained data to make a prediction or generate output while the user requests it through a prompt (often from unseen data). This means, that the more text or images you generate or the organizations' workloads process, the more inferences you consume. This directly correlates with the total cost of running LLMS.

Fine tuning: essentially means adjusting parameters to improve models’ performance. For example, you can train it with more specialized data for more specific tasks. Doing so, you refine their ability to generate more relevant responses, especially with the new context or use cases without expanding models’ sizes. Essentially fine-tuning makes the model more efficient.

What Determines a Large Language Model as “Large”?

The term large in front of a large language model is usually based on two aspects:

Number of parameters: Each llm model is usually grouped by millions to billions to trillions of parameters. For example, the mistral-7B model means it has 7 billion parameters.

Training data size: LLMs are trained using vast amounts of data, which can consist of many terabytes of text from different sources. The same mistral-7B is trained up to 8 trillion tokens. We can estimate that an average document contains about 1,000 tokens. Then roughly the total documents that this model was trained on are 8 billion documents.

Small vs. Medium vs. Large Models

While all generative AI models are considered large language models, we can categorize each as small, medium, or large.

Small Models: Typically have hundreds of millions of parameters (e.g., models with 125M to 500M parameters). For example, answerai-colbert-small-v1 and DistilBERT are approximately 33 and 66 million parameters respectively.

Medium Models: Range from about 1 billion to 10 billion parameters. For example, GPT-2 by OpenAI as 1.5 billion parameters and GPT-Neo developed by EleutherAI has 1.3B and 2.7B models.

Large Models: Generally refer to models with 10 billion parameters or more. Examples include models like GPT-3 (175 billion parameters) and GPT-4 (up to 1 trillion parameters).

What Is Quantization and Why Is It Important?

Quantization is a technique in which you reduce the model sizes by decreasing the precision parameters such as the weights. As organizations seek to harness generative ai, quantization will become crucial for enterprise-wide adoption for two main reasons:

  • The amount of data gets bigger each year
  • There are hundreds to thousands of workloads that could be combined with LLMs

This requires a lot of computing power and will be very difficult with current model sizes and power supply requirements.

However, quantized models consumes significantly fewer resources and is easier to deploy in various scenarios

For example, we have reduced Lama 3.1 70b from 16-bit to 8-bit floating point numbers, reducing its size by more than half. This enables us to run the model in 48 GB of server, compared to the requirements of 140 GB unquantized model. As a result, the quantized model is also a lot faster and more cost-efficient.

In the next years when organizations are starting to scale their generative AI workloads, quantization will be one solution that enables that. Another is running models on-premises, which we will cover next.

Deploying LLMs in Your Organization

Running LLMs on-premises is approximately 60-80% more cost-efficient and is the only way to secure your data. But it's a time-consuming, complicated, and expensive process.

Three main things you need to consider when deploying LLMs in your organizations workloads:

1. Hardware Requirements: Ensure you have sufficient GPU resources, as LLMs require significant computational power. NVIDIA GPUs (e.g., A100, V100) are commonly used for these. For most medium and large-size organizations it makes sense to buy your own hardware unless you have that capacity already. The second option is to use a private cloud as it is still very secure but comes with a slightly higher cost.

2. Software Requirements:  You have two options: build it yourself or buy software as a service.

  • Build Your Own Generative AI Stack: This approach requires substantial investment in time and resources and can take 1-2 years. For example, you need to learn complicated Kubernetes to develop the software stack, data connectors, and authorization and authentication systems from scratch.
  • License a Solution: Alternatively, consider licensing a pre-built generative AI software infrastructure from companies like ConfidentialMind. This option provides access to advanced machine learning capabilities at a fraction of the cost and time compared to building it internally.

3. Model Selection: There are many private and open-source models in the market. What to choose? The former enables you to instantly enable llm power in your organization's application via API but comes with a hefty cost. In the long term, the expense can even outperform all the benefits. The latter provides more control over the technology you want to build, is secure, and is up to 80% more cost-efficient, especially when using quantized models and on-premises hardware. While this option requires a higher upfront investment, the break-even point can be reached in just a few months. From that point on, it generates clear value.

Conclusion

Large language models make sense of human language and generate new content by combining training data, unseen data, and user prompts. For businesses that means they can understand data to reduce operational costs and make better or faster decisions. No other modern technology can do this.

The adoption of this new technology is growing, fast. Many organizations are currently in the POC stage. The next year will be the scaling time of the most effective POCs. Innovators and early adopters will enjoy the first movers’ advantage, which is reduced costs and additional revenue. Ultimately, this means they will have more resources to get ahead of their competitors. Regarding that, I would like to ask which side of the fence you are on: getting ahead or staying behind?

Share

By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
;

Our Address

Otakaari 27,
02150 Espoo,
Finland

Follow us

Email us

info (@) confidentialmind.com