OpenAI-Compatible Image Parsing: Fixing LangChain Streaming Limitations
Date: August 28, 2025
Issue: OpenAI-compatible APIs returning images in streaming responses not parsed by LangChain.js
Resolution Time: ~6 hours
π The Problem
While ChatOllama supported image uploads from users, a critical gap existed in handling AI-generated images from multimodal models. When using OpenAI-compatible APIs (particularly OpenRouter with Gemini models) that return images as part of their responses, these images were completely ignored during streaming chat sessions.
The issue was particularly problematic for users leveraging advanced multimodal models that could generate charts, diagrams, or other visual content. Instead of seeing the generated images, users would only receive text responses, missing crucial visual information that models like Gemini Flash were producing.
This limitation significantly impacted the user experience, especially for:
- Data visualization requests (charts, graphs)
- Diagram generation tasks
- Creative image generation workflows
- Technical documentation with visual aids
π Root Cause Investigation
After extensive debugging and API response analysis, we discovered that OpenAI-compatible providers use a different response structure for image content compared to the standard OpenAI format that LangChain.js expected.
The Hidden Response Structure
Most OpenAI-compatible APIs (like OpenRouter) return image content using an images field alongside the standard content field:
{
"role": "assistant",
"content": "Here's the chart you requested: ",
"images": [
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgo...",
"detail": "high"
}
}
]
}
However, LangChain.js streaming processors only handled these fields:
- β
contentfield (text content) - β
tool_callsfield (function calls) - β
function_callfield (legacy function calls) - β
audiofield (audio content) - β
imagesfield (completely ignored)
The core issue was in two critical functions within the LangChain OpenAI chat models:
_convertCompletionsDeltaToBaseMessageChunk()- For streaming responses_convertCompletionsMessageToBaseMessage()- For non-streaming responses
Both functions simply discarded any images field data, causing visual content to vanish from the final message.
π§ The Fix Implementation
Step-by-Step Implementation Guide
To implement this fix in your own project, youβll need to make changes to three key areas:
- Custom LangChain OpenAI Chat Model - Parse
imagesfield from API responses - Server Endpoint - Extract and handle multimodal content
- Frontend Components - Display parsed images
Step 1: Create Custom LangChain Implementation
Since this was a fundamental limitation in LangChain.js itself, we created a customized version of the OpenAI chat models at server/models/openai/chat_models.ts.
Required Changes:
1.1. Enhanced Streaming Delta Processing
Find the _convertCompletionsDeltaToBaseMessageChunk() method in your LangChain OpenAI chat model and modify it:
Before (Original LangChain):
const content = delta.content ?? ""
After (Fixed):
let content = delta.content ?? ""
// Handle images field that might contain image_url content
if (delta.images && Array.isArray(delta.images)) {
// Convert content to array format if it's a string and there are images
if (typeof content === "string") {
const contentArray = []
if (content) {
contentArray.push({ type: "text", text: content })
}
// Add image content from the images field
for (const image of delta.images) {
if (image.type === "image_url" && image.image_url) {
contentArray.push({
type: "image_url",
image_url: image.image_url,
})
}
}
content = contentArray
}
}
1.2. Enhanced Non-Streaming Message Processing
Find the _convertCompletionsMessageToBaseMessage() method and modify it:
Before (Original LangChain):
return new AIMessage({
content: message.content || "",
// ... other fields
})
After (Fixed):
// Handle images field that might contain image_url content
let content = message.content || ""
if (message.images && Array.isArray(message.images)) {
// Convert content to array format if it's a string and there are images
if (typeof content === "string") {
const contentArray = []
if (content) {
contentArray.push({ type: "text", text: content })
}
// Add image content from the images field
for (const image of message.images) {
if (image.type === "image_url" && image.image_url) {
contentArray.push({
type: "image_url",
image_url: image.image_url,
})
}
}
content = contentArray
}
}
return new AIMessage({
content,
// ... other fields
})
Step 2: Update Server Endpoint Content Processing
Modify your chat endpoint to extract and handle multimodal content from the enhanced LangChain implementation:
File: server/api/models/chat/index.post.ts (or your equivalent)
Add this new function:
const extractContentFromChunk = (chunk: BaseMessageChunk): { text: string; images: any[] } => {
let content = chunk?.content
let textContent = ''
let images: any[] = []
// Handle array content (multimodal)
if (Array.isArray(content)) {
// Extract text content
textContent = content
.filter(item => item.type === 'text_delta' || item.type === 'text')
.map(item => ('text' in item ? item.text : ''))
.join('')
// Extract image content
images = content
.filter(item => item.type === 'image_url' && item.image_url?.url)
.map(item => ({ type: 'image_url', image_url: item.image_url }))
} else {
// Handle string content
textContent = content || ''
}
return { text: textContent, images }
}
Update your streaming logic:
// Replace existing extractContentFromChunk calls
const { text, images } = extractContentFromChunk(chunk)
// Handle both text and images in your response
if (accumulatedImages.length > 0) {
const contentArray: MessageContent[] = []
if (accumulatedTextContent) {
contentArray.push({ type: 'text', text: accumulatedTextContent })
}
contentArray.push(...accumulatedImages)
contentToStream = contentArray
} else {
contentToStream = accumulatedTextContent
}
Step 3: Frontend Image Display Implementation
Ensure your frontend components can extract and display images from the multimodal content:
File: components/ChatMessageItem.vue (or your equivalent)
Add image extraction logic:
const messageImages = computed(() => {
const content = props.message.content
if (!content || !Array.isArray(content)) return []
return content
.filter(item => item.type === 'image_url' && item.image_url?.url)
.map(item => item.image_url!.url)
})
Update your template to display images:
<template>
<!-- Text content -->
<div v-if="messageContent" v-html="markdown.render(messageContent)" />
<!-- Image gallery -->
<div v-if="messageImages.length > 0" class="image-gallery">
<img v-for="(url, index) in messageImages"
:key="index"
:src="url"
:alt="`Image ${index + 1}`"
class="rounded-lg max-h-64 object-contain" />
</div>
</template>
Add basic CSS for image display:
.image-gallery {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
gap: 0.5rem;
margin-top: 0.75rem;
}
.image-gallery img {
width: 100%;
height: auto;
background: var(--color-gray-100);
cursor: pointer;
}
π§ͺ Comprehensive Testing Strategy
We implemented extensive testing to ensure robustness across different scenarios:
Test Coverage:
- β Text with single image - Proper array conversion
- β Multiple images - Maintains correct order and structure
- β Images only (empty content) - Works without text content
- β Backward compatibility - No breaking changes for standard responses
- β Invalid image objects - Graceful error handling
- β Empty images array - Handles edge cases properly
- β Malformed data - Robust error handling for invalid inputs
Validation Commands:
npx tsx server/models/openai/tests/validate-core-logic.ts
npx tsx server/models/openai/tests/validate-image-url-parsing.ts
π― Content Format Transformation
The fix intelligently transforms API responses into LangChain-compatible multimodal content:
Input (OpenAI-Compatible API):
{
"content": "Here are two visualizations: ",
"images": [
{
"type": "image_url",
"image_url": { "url": "data:image/png;base64,chart1..." }
},
{
"type": "image_url",
"image_url": { "url": "data:image/png;base64,chart2..." }
}
]
}
Output (LangChain Message):
[
{ "type": "text", "text": "Here are two visualizations: " },
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,chart1..." } },
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,chart2..." } }
]
π Lessons Learned
This implementation taught us several valuable lessons about working with evolving AI APIs:
API Standardization is Still Evolving: Different OpenAI-compatible providers use varying response formats for multimodal content. Being adaptable to these differences is crucial for maintaining broad compatibility.
Custom LangChain Implementations Have Value: While staying close to upstream LangChain is generally preferred, sometimes specific use cases require custom implementations to unlock functionality that standard libraries donβt yet support.
Robust Testing Prevents Regressions: Comprehensive edge case testing was essential, especially when dealing with the variety of response formats from different API providers.
Backward Compatibility is Non-Negotiable: Any changes to core message processing must maintain 100% backward compatibility to avoid breaking existing workflows.
π Impact and Results
The implementation delivers significant improvements to ChatOllamaβs multimodal capabilities:
Immediate Benefits:
- Full Multimodal Support: Users can now see AI-generated images from models like Gemini Flash
- Enhanced Visualizations: Data charts, diagrams, and creative images display properly
- API Provider Flexibility: Works seamlessly with OpenRouter, OpenAI, and other compatible providers
- Zero Breaking Changes: Existing text-only workflows remain completely unaffected
Technical Improvements:
- Streaming Performance: Images appear in real-time as theyβre generated
- Memory Efficiency: Optimized processing only activates when images are present
- Error Resilience: Graceful handling of malformed or incomplete image data
- Future-Proof Architecture: Ready for additional multimodal content types
π‘ Real-World Usage Examples
This fix enables powerful new workflows:
// User Request: "Create a bar chart showing Q4 sales data"
// API Response: Mixed text + generated image
{
"role": "assistant",
"content": "Here's your Q4 sales visualization:",
"images": [{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,<chart_data>",
"detail": "high"
}
}]
}
// ChatOllama Now Displays: Text + Interactive Image
π Quick Implementation Checklist
For developers implementing this fix:
β Required Files to Modify:
-
server/models/openai/chat_models.ts(or copy from@langchain/openai)- β
Add image parsing to
_convertCompletionsDeltaToBaseMessageChunk() - β
Add image parsing to
_convertCompletionsMessageToBaseMessage()
- β
Add image parsing to
-
server/api/models/chat/index.post.ts(your chat endpoint)- β
Update
extractContentFromChunk()function - β Handle multimodal content in streaming logic
- β
Update
-
components/ChatMessageItem.vue(your message component)- β
Add
messageImagescomputed property - β Update template with image gallery
- β Add CSS for image display
- β
Add
β Key Code Patterns to Look For:
Problem Indicators:
// β Only handles text content
const content = delta.content ?? ""
// β Ignores images field completely
return new AIMessage({ content: message.content })
Solution Patterns:
// β
Handles both text and images
if (delta.images && Array.isArray(delta.images)) {
// Convert to multimodal array format
}
// β
Extracts images from multimodal content
return content
.filter(item => item.type === 'image_url' && item.image_url?.url)
.map(item => item.image_url!.url)
β Testing Your Implementation:
- Test with OpenRouter + Gemini Flash (known to return
imagesfield) - Verify both streaming and non-streaming responses
- Check multiple images in single response
- Ensure backward compatibility with text-only responses
This fix enables full multimodal support for OpenAI-compatible APIs that use the images response field. By implementing these three key changes, you can unlock image generation capabilities in your LangChain.js-based chat applications.