Llama 3 Paper - Review (Part 2)
This is part 2 of my review of Meta’s Llama 3 technical paper. In part 1, I covered the core language model architecture, training methodology, and overall performance. Here I want to look at the multimodal parts of the model: vision, video, and speech.
What strikes me about Meta’s approach is how compositional it is. Instead of training entirely new models from scratch, they extend the existing Llama 3 language models with specialized adapters. That lets them add new capabilities while preserving the text model they already have.
Key concepts
Compositional multimodal architecture
Meta uses a modular approach for the multimodal capabilities in Llama 3. Rather than training joint models from scratch, they combine pre-trained language models with modality-specific encoders connected through adapter layers. This lets the language and vision/audio components develop somewhat separately, avoids some of the pain of joint training, preserves text-only performance, and reduces inference overhead.
Vision encoder and adapter
The vision module uses a pre-trained ViT-H/14 image encoder, modified to include 850M parameters, plus cross-attention layers that connect visual representations to the language model. These cross-attention layers are large, adding about 100B parameters to the 405B model. To preserve fine-grained visual information, they extract features from multiple intermediate layers of the vision encoder rather than only using the final layer output. This helps with tasks that need detailed localization.
Video recognition architecture
The video module builds on the image module with two additions: a “temporal aggregator” that merges frames to capture temporal relationships, and dedicated video cross-attention layers. The aggregator uses a perceiver resampler architecture to compress multiple frames into a smaller representation. During pre-training, they start with 16 frames, aggregated to 1, and scale up to 64 frames during fine-tuning to handle longer videos more effectively.
Speech understanding approach
Unlike the vision module, the speech component does not use cross-attention layers. Instead, it generates embeddings that directly integrate with text tokens in the language model. The speech module consists of a 1B-parameter Conformer encoder followed by a smaller adapter. This direct integration lets the speech interface use the language model’s existing capabilities without modifying its parameters, which apparently works well at larger scales.
Speech generation
For text-to-speech, Meta takes a different approach. Rather than fine-tuning the language model for speech generation, they use a streaming text-to-speech system that uses Llama 3 embeddings for text normalization and prosody modeling. The language model is helping the speech system sound more context-aware, rather than directly becoming the speech generator.
Scaling and training challenges
Training these multimodal adapters introduces challenges beyond the core language model. The computation becomes heterogeneous because some tokens require more processing than others, token counts vary a lot across modalities, and combining different representations creates numerical instability issues. Meta handles these with pipeline design, sequence parallelization, and higher precision for gradient accumulation.
What I learned
Compositional design makes technical sense
Initially, I had concerns about the compositional approach. I wondered if mapping the higher-dimensional image modality into the latent space of a language model would cause a lot of information loss. While the dimensionality itself is an implementation detail, text does feel like it contains less raw information than images. But after seeing GPT-4o generate accurate images from text prompts and handle complex visual tasks, I became more convinced that language model latent spaces can encode visual concepts pretty well. The limitation I was worried about may not be as severe in practice.
Multi-layer feature extraction preserves fine-grained information
One detail I found especially interesting was how they handled the problem of CLIP-like models failing to preserve fine-grained localization information. Instead of relying only on the final layer output, they extract features from multiple intermediate layers of the vision encoder, specifically the 4th, 8th, 16th, 24th, and 31st layers. This makes sense because lower layers retain more spatial detail before everything gets abstracted away in higher layers. I hadn’t thought much about this limitation of contrastive learning before, but it explains why CLIP-like models can struggle with precise visual details.
Handling many-shot jailbreaking in long-context models
Something that caught my attention was the vulnerability of long-context models to many-shot jailbreaking attacks. The longer context window enables a new attack vector: they specifically mention that 256-shot attacks become possible. They mitigated this by fine-tuning models on datasets that include safe behavior even when unsafe behavior appears in context. The fact that they could do this without hurting false refusal rates or helpfulness metrics suggests the model can distinguish between demonstrations in context and actual instructions.
Safety becomes more granular
What struck me about Meta’s safety approach is how granular it is becoming. Instead of one generic “harmful content” classifier, they are developing specialized safety mechanisms for specific risks, including cybersecurity, coding, and spear-phishing. As models get more capable, the attack vectors multiply, so the safety work becomes more specific too.
Contextual safety debt
The 256-shot attack example shows that extending a model’s capabilities can introduce new safety problems that were not relevant before. Longer context is useful, but it also creates new ways to steer the model. That feels like a kind of safety debt: every new ability may bring new attack vectors that need their own mitigations.
Summary
Meta’s approach to multimodal Llama 3 is practical. Rather than developing entirely new architectures or doing joint pre-training from scratch, they extend the existing language model through specialized adapters and compositional design. This lets them reuse strong pre-trained components while adding new capabilities incrementally.
The paper also shows the tension between capability gains and safety. As models gain new abilities, like longer context, new vulnerabilities appear. Safety work is not a one-time cleanup pass; it has to move alongside the model.
I’m especially curious about how this compositional approach scales to more modalities. Does adding modalities create useful cross-modal transfer? Do some modalities complement each other in unexpected ways? The Llama 3 paper does not answer those questions directly, but it makes them feel worth asking.