Introducing Falcon 180B: The New Frontier in AI Technology
Falcon 180B is an impressive and high-performing LLM, holding the top spot on the Open LLM Leaderboard. However, when considering its implementation, you must account for its resource requirements and associated costs. It is a scaled-up version of Falcon 40B and builds on its innovations such as Multi-Query Attention for improved scalability.
Amazon SageMaker facilitated the training of Falcon 180B on 3.5 trillion tokens across up to 4096 GPUs simultaneously, amounting to approximately 7,000,000 GPU hours. Consequently, Falcon 180B is 2.5 times larger than Llama 2 and received training with 4x more compute. After reading this post, you will get to know:
- Overview of Falcon 180B
- Model Architecture of Falcon 180B
- Exploring the Variants of Falcon 180B
- falcon 180b vs falcon 40b vs falcon 7b
- How to Access and Utilize Falcon 180B
- Implementing Text Generation with Falcon 180B
- Conclusion
Overview of Falcon 180B
Falcon 180B is a 180-billion-parameter Large Language Model (LLM) that has gained recognition for its performance in the field of artificial intelligence. It is known for its comparable performance to Google’s Palm 2 (Bard) and being not far behind GPT-4. It is often referred to as the “Llama 2” killer due to its higher performance as a pertained-only model.
In terms of architecture, Falcon 180B is a scaled-up version of Falcon 40B and builds on its innovations for better scalability, such as multi-query attention. To learn more about the architecture of Falcon, we suggest reading the initial blog article that introduced it. The (TII) team trained Falcon 180B on 3.5 trillion tokens using Amazon SageMaker across up to 4096 GPUs simultaneously, totaling about 7,000,000 GPU hours. Consequently, Falcon 180B is 2.5 times larger than Llama 2 and received training with four times as much computing power.
As of September 2023, Falcon 180B holds the top position on the Hugging Face Open LLM Leaderboard, making it the highest-performing pertained LLM available. However, its size and resource requirements are substantial. Inference with the Falcon 180B requires 640GB of memory when quantized to half-precision (FP16), which translates to eight A100 80GB GPUs.
Alternatively, it can be quantized down to int4, requiring eight A100 40GB GPUs (320GB of memory). The cost of maintaining such a model online can be significant, potentially reaching $20,000 per month.
Despite the high cost, Falcon 180B’s license allows commercial usage and provides organizations with more control over their data, training, and model ownership compared to alternatives like OpenAI’s GPT-4.
The Technology Innovation Institute (TII) trained Falcon-40B on 1,000 billion tokens of Refined Web, a high-quality filtered and deduplicated web dataset, enhancing it with curated corpora. Significant components of the curated corpora drew inspiration from The Pile (Gao et al., 2020).
Falcon 180B is the highest scoring openly released pre-trained LLM (Figure 0), surpassing Meta’s LLaMA 2 (67.35),” said the Hugging Face blog. The new model also handily outperforms its predecessor Falcon 40B’s score of 60.4.
Model Architecture of Falcon 180B
Falcon 180B, an advanced version of Falcon 40B, features a multi-query attention mechanism for improved scalability. Unlike the traditional multi-head attention scheme, which assigns a unique query, key, and value to each head, the multi-query approach uses a single key and value for all heads, enhancing efficiency. The Technology Innovation Institute (TII) efficiently utilized 4096 GPUs on Amazon SageMaker to train the model, investing approximately 7,000,000 GPU hours in the process. This effort made Falcon 180B four times more computationally intensive to train than Llama 2, highlighting its advanced capabilities (Figure 1).
Exploring the Variants of Falcon 180B
The Falcon-180B model is available in two versions:
- Falcon-180B (Base): The base version of Falcon-180B is a causal decoder-only model. It consists of 180 billion parameters and is primarily designed for further fine-tuning on specific datasets. This model provides a strong foundation for building custom language models tailored to specific tasks or domains.
- Falcon-180B-Chat: The chat version of Falcon-180B is also a 180 billion-parameter causal decoder-only model. In addition to its base architecture, this variant has undergone fine-tuning on a mixture of instruction (chat) datasets. These instruction datasets include Ultrachat 5, Platypus 6, and Airoboros 7. The fine-tuning process on these chat datasets enables the model to better understand and generate conversational responses.
Both the base and chat versions of the Falcon-180B offer powerful language generation capabilities. The base version is suitable for users who want to fine-tune the model on their own data, allowing for customization and adaptation to specific requirements. The chat version, however, specifically targets generating chat-like responses and engaging in conversational interactions, thanks to its additional fine-tuning on instruction datasets.
Falcon 180b vs Falcon 40b vs Falcon 7b
Falcon-180B, Falcon-40B, and Falcon-7B are different models within the Falcon family of language models. Here’s a comparison between these models:
Falcon-180B:
- Parameters: Falcon-180B is a colossal model with 180 billion parameters.
- Training Data: It has been trained on a massive 3,500 billion tokens of Refined Web, which is a dataset based on CommonCrawl .
- License: Falcon-180B is open-access and comes with a permissive license, allowing for commercial use.
- Notable Features: Being the largest model in the Falcon family, experts expect Falcon-180B to reshape the landscape of AI-driven language understanding.
Falcon-40B:
- Parameters: Falcon-40B has 40 billion parameters.
- Training Data: The team trained it on 1 trillion tokens, predominantly from Refined Web, incorporating curated sources such as conversational data from Reddit.
- License: Falcon-40B is also open-access and allows for commercial use.
- Performance: Falcon-40B has achieved a leaderboard score of 60.4.
- Inference: Running the Falcon-40B model can be challenging due to its size, but it can be loaded in 8-bit mode to fit in memory.
Falcon-7B:
- Parameters: Falcon-7B has 7 billion parameters.
- Training Data: It has been trained on 1.5 trillion tokens, with a focus on scaling and improving the quality of web data.
- License: Falcon-7B is open-access and allows for commercial use.
- Performance: Falcon-7B has achieved a leaderboard score of 48.8.
- Inference: Falcon-7B requires around 15GB of GPU memory, making it accessible even on consumer hardware.
How to Access and Utilize Falcon 180B
The Falcon 180B, available in Transformers version 4.33, is available in the Hugging Face ecosystem and can be tested in Space or the attached playground Link Here (Figure1).
Implementing Text Generation with Falcon 180B
To execute the model’s inference at full bfloat16 precision, you require roughly eight A100 80GB GPUs or an equivalent setup.
from transformers import AutoTokenizer
import transformers
import torch
# Model identifier on Hugging Face
model = "tiiuae/falcon-180b"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model)
# Set up the pipeline for text generation, optimizing for bfloat16 precision
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16, # Optimize for memory and speed on compatible hardware
trust_remote_code=True, # Ensure you trust the source as this can execute remote code
device_map="auto", # Automatically use available devices
)
# Generate sequences using the model
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Girafatron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
# Print the generated text
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Applications of Falcon 180B
- Language Tasks: Falcon 180B can excel in various language tasks such as proficiency assessments, reasoning, coding, and knowledge testing.
- Virtual Assistants and Chatbots: Developers can use large language models like Falcon 180B to power virtual assistants and chatbots, enabling more natural and intelligent conversations.
- Machine Translation: Developers can utilize Falcon 180B in machine translation applications to improve the accuracy and fluency of translated text.
- Sentiment Analysis: Analysts can employ Falcon 180B in sentiment analysis tools to analyze and understand the sentiment expressed in text data.
- Collaborative Apps: Developers can integrate Falcon 180B into collaborative apps to enhance their language processing capabilities, enabling features like intelligent suggestions and auto-completion.
- Education and Training: Educators and developers can use Falcon 180B in educational settings for tasks such as generating instructional content and training AI bots.
- Healthcare and Finance: The competitive performance of Falcon 180B opens up opportunities for its application in healthcare and finance, where language processing plays a crucial role.
Conclusion
The Falcon 180B Large Language Model (LLM) has generated considerable excitement. It stands as the largest publicly accessible model on the Hugging Face model hub, leading the pack among open-access models and competing closely with renowned proprietary models like PaLM-2. However, our analysis suggests there is room for improvement in text extraction, summarization, authoring, and code generation.
Resources:
Try it from Here.