Long Text Embedding with Chunked Processing
This directory contains examples for using vLLM's chunked processing feature to handle long text embedding that exceeds the model's maximum context length.
🚀 Quick Start
Start the Server
Use the provided script to start a vLLM server with chunked processing enabled:
# Basic usage (supports very long texts up to ~3M tokens)
./service.sh
# Custom configuration with different models
MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./service.sh
# For extremely long documents
MODEL_NAME="intfloat/multilingual-e5-large" \
MAX_EMBED_LEN=3072000 \
./service.sh
Test Long Text Embedding
Run the comprehensive test client:
python client.py
📁 Files
| File | Description |
|---|---|
service.sh |
Server startup script with chunked processing enabled |
client.py |
Comprehensive test client for long text embedding |
⚙️ Configuration
Server Configuration
The key parameters for chunked processing are in the --pooler-config:
{
"pooling_type": "auto",
"normalize": true,
"enable_chunked_processing": true,
"max_embed_len": 3072000
}
!!! note
pooling_type sets the model's own pooling strategy for processing within each chunk. The cross-chunk aggregation automatically uses MEAN strategy when input exceeds the model's native maximum length.
Chunked Processing Behavior
Chunked processing uses MEAN aggregation for cross-chunk combination when input exceeds the model's native maximum length:
| Component | Behavior | Description |
|---|---|---|
| Within chunks | Model's native pooling | Uses the model's configured pooling strategy |
| Cross-chunk aggregation | Always MEAN | Weighted averaging based on chunk token counts |
| Performance | Optimal | All chunks processed for complete semantic coverage |
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
intfloat/multilingual-e5-large |
Embedding model to use (supports multiple models) |
PORT |
31090 |
Server port |
GPU_COUNT |
1 |
Number of GPUs to use |
MAX_EMBED_LEN |
3072000 |
Maximum embedding input length (supports very long documents) |
POOLING_TYPE |
auto |
Model's native pooling type: auto, MEAN, CLS, LAST (only affects within-chunk pooling, not cross-chunk aggregation) |
API_KEY |
EMPTY |
API key for authentication |
🔧 How It Works
- Enhanced Input Validation:
max_embed_lenallows accepting inputs longer thanmax_model_lenwithout environment variables - Smart Chunking: Text is split based on
max_position_embeddingsto maintain semantic integrity - Unified Processing: All chunks processed separately through the model using its configured pooling strategy
- MEAN Aggregation: When input exceeds model's native length, results combined using token count-based weighted averaging across all chunks
- Consistent Output: Final embeddings maintain the same dimensionality as standard processing
Input Length Handling
- Within max_embed_len: Input is accepted and processed (up to 3M+ tokens)
- Exceeds max_position_embeddings: Chunked processing is automatically triggered
- Exceeds max_embed_len: Input is rejected with clear error message
- No environment variables required: Works without
VLLM_ALLOW_LONG_MAX_MODEL_LEN
Extreme Long Text Support
With MAX_EMBED_LEN=3072000, you can process:
- Academic papers: Full research papers with references
- Legal documents: Complete contracts and legal texts
- Books: Entire chapters or small books
- Code repositories: Large codebases and documentation
📊 Performance Characteristics
Chunked Processing Performance
| Aspect | Behavior | Performance |
|---|---|---|
| Chunk Processing | All chunks processed with native pooling | Consistent with input length |
| Cross-chunk Aggregation | MEAN weighted averaging | Minimal overhead |
| Memory Usage | Proportional to number of chunks | Moderate, scalable |
| Semantic Quality | Complete text coverage | Optimal for long documents |
🧪 Test Cases
The test client demonstrates:
- ✅ Short text: Normal processing (baseline)
- ✅ Medium text: Single chunk processing
- ✅ Long text: Multi-chunk processing with aggregation
- ✅ Very long text: Many chunks processing
- ✅ Extreme long text: Document-level processing (100K+ tokens)
- ✅ Batch processing: Mixed-length inputs in one request
- ✅ Consistency: Reproducible results across runs
🐛 Troubleshooting
Common Issues
-
Chunked processing not enabled:
ValueError: This model's maximum position embeddings length is 4096 tokens...Solution: Ensure
enable_chunked_processing: truein pooler config -
Input exceeds max_embed_len:
ValueError: This model's maximum embedding input length is 3072000 tokens...Solution: Increase
max_embed_lenin pooler config or reduce input length -
Memory errors:
RuntimeError: CUDA out of memorySolution: Reduce chunk size by adjusting model's
max_position_embeddingsor use fewer GPUs -
Slow processing: Expected: Long text takes more time due to multiple inference calls
Debug Information
Server logs show chunked processing activity:
INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing
INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096)
🤝 Contributing
To extend chunked processing support to other embedding models:
- Check model compatibility with the pooling architecture
- Test with various text lengths
- Validate embedding quality compared to single-chunk processing
- Submit PR with test cases and documentation updates
🆕 Enhanced Features
max_embed_len Parameter
The new max_embed_len parameter provides:
- Simplified Configuration: No need for
VLLM_ALLOW_LONG_MAX_MODEL_LENenvironment variable - Flexible Input Validation: Accept inputs longer than
max_model_lenup tomax_embed_len - Extreme Length Support: Process documents with millions of tokens
- Clear Error Messages: Better feedback when inputs exceed limits
- Backward Compatibility: Existing configurations continue to work