Llama 4 Now Available on Novita AI: Unleashing Multimodal MoE Power
Novita AI

Novita AI @novita_ai

About: Deploy AI models effortlessly with our simple API. Build and scale on the most affordable, reliable GPU cloud. https://novita.ai/

Joined:
Oct 12, 2023

Llama 4 Now Available on Novita AI: Unleashing Multimodal MoE Power

Publish Date: Apr 8
0 0

Meta has just unveiled its groundbreaking Llama 4 family of models, marking a significant leap in AI capabilities with native multimodality and mixture-of-experts (MoE) architecture.

Today, we’re excited to announce that Llama 4 Scout and Llama 4 Maverick are now available on Novita AI, enabling business and developers to harness these powerful models through simple API integration.

Novita AI is offering the first of the Llama 4 model herd at the following pricing:

Llama 4 Scout: $0.1 / M input tokens and $0.5 / M output tokens

Llama 4 Maverick: $0.2 / M input tokens and $0.85 / M output tokens

Understanding the Llama 4 Herd

The Llama 4 release introduces three distinct models, each designed for different needs and computational constraints:

Llama 4 Scout features 16 experts and delivers state-of-the-art performance for its class. It supports an industry-leading 10M token context length, making it ideal for processing large amounts of data, including entire codebases or extensive documentation.

Llama 4 Maverick is Meta’s product workhorse, incorporating 128 experts to deliver superior performance across a wide range of tasks. It excels at precise image understanding and creative writing while supporting up to 1M tokens in context.

Llama 4 Behemoth serves as the teacher model for the Llama 4 family with 16 experts. While not yet publicly released as it’s still in training, Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM-focused benchmarks.

Note: The context window for Llama 4 Scout on Novita AI is 131,072 tokens, while the context window for Llama 4 Maverick is 1,048,576 tokens.

Key Features and Capabilities

Native Multimodality

Llama 4 models incorporate early fusion to seamlessly integrate text and vision tokens into a unified model backbone. This enables joint pre-training with large amounts of unlabeled text, image, and video data.

The enhanced vision encoder, based on MetaCLIP but further optimized for LLM integration, allows the models to process multiple images alongside text prompts without additional engineering.

Extended Context Length

One of the most significant advancements in Llama 4 is its support for extraordinarily long contexts:

  • Llama 4 Scout: 10 million tokens

  • Llama 4 Maverick: 1 million tokens

This leap in context length enables applications that were previously impractical, such as:

  • Multi-document summarization and analysis

  • Reasoning over extensive codebases

  • Parsing vast amounts of user activity for personalized experiences

  • Processing entire research archives in a single prompt

Multilingual and Reasoning Capabilities

Llama 4 models have been pre-trained on 200 languages — with dedicated fine-tuning support for 12, including Arabic, Spanish, German, and Hindi. Over 100 of these languages have more than 1 billion training tokens each — offering 10 times more multilingual coverage than Llama 3.

This extensive training enables superior performance across languages, making the models suitable for global applications.

The models also demonstrate enhanced reasoning abilities thanks to specialized training recipes. For Maverick, this included a continuous online RL strategy with adaptive data filtering, focusing on medium-to-hard difficulty prompts.

Performance Benchmarks and Use Cases

Benchmarks

According to Meta’s official benchmark data, Llama 4 models demonstrate exceptional performance across various tasks, as shown in the tables below:

Llama 4 Scout Benchmarks

Llama 4 Maverick Benchmarks

Llama 4 Scout is best suited for long-context applications, while Llama 4 Maverick excels at complex reasoning and creative tasks that involve multimodal understanding. Here are the ideal use cases for each model based on their strengths:

Llama 4 Scout:

  • Multi-document summarization for legal or financial analysis

  • Personalized task automation using extensive user data

  • Efficient image processing for lightweight multimodal applications

Explore Llama 4 Scout Demo Now

Llama 4 Maverick:

  • Multilingual customer support with visual context

  • Generating marketing content based on multimodal inputs

  • Advanced document intelligence combining text, diagrams, and tables

  • Creative writing and content generation with precise image understanding

Explore Llama 4 Maverick Demo Now

Both models excel in situations requiring multimodal understanding, reasoning over extensive context, and multilingual capabilities.

Getting Started with Llama 4 on Novita AI

Integrating Llama 4 models into your applications via Novita AI’s model library is straightforward, requiring just a few lines of code. Here’s how to get started:

Setting Up Your Environment

First, ensure you have an API key from Novita AI. If you don’t have one yet, sign up and create an API key through the Novita AI dashboard.

Integrating with Python

Novita AI provides OpenAI-compatible endpoints for seamless integration. Here’s a simple example using the Python client:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="<YOUR Novita AI API Key>",
)

model = "meta-llama/llama-4-maverick-17b-128e-instruct-fp8"
stream = True # or False
max_tokens = 2048
system_content = """Be a helpful assistant"""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": system_content,
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    presence_penalty=presence_penalty,
    frequency_penalty=frequency_penalty,
    response_format=response_format,
    extra_body={
      "top_k": top_k,
      "repetition_penalty": repetition_penalty,
      "min_p": min_p
    }
  )

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

For more detailed examples and comprehensive integration guides, visit our LLM API documentation.

Conclusion

The arrival of Llama 4 on Novita AI represents a significant milestone in the democratization of advanced AI capabilities.

With native multimodality, extended context lengths, and an efficient MoE architecture, these models enable new classes of applications that were previously impractical or prohibitively expensive.

Whether you’re building applications for document processing, multilingual communication, or creative content generation, Llama 4 provides the tools you need to create intelligent, responsive experiences.

Get started today with Novita AI’s simple integration process and competitive pricing to bring the power of Llama 4 to your applications and users.

About Novita AI
Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.

Comments 0 total

    Add comment