OpenAI-Compatible Image Parsing: Fixing LangChain Streaming Limitations

Date: August 28, 2025
Issue: OpenAI-compatible APIs returning images in streaming responses not parsed by LangChain.js
Resolution Time: ~6 hours

🐛 The Problem

While ChatOllama supported image uploads from users, a critical gap existed in handling AI-generated images from multimodal models. When using OpenAI-compatible APIs (particularly OpenRouter with Gemini models) that return images as part of their responses, these images were completely ignored during streaming chat sessions.

The issue was particularly problematic for users leveraging advanced multimodal models that could generate charts, diagrams, or other visual content. Instead of seeing the generated images, users would only receive text responses, missing crucial visual information that models like Gemini Flash were producing.

This limitation significantly impacted the user experience, especially for:

Data visualization requests (charts, graphs)
Diagram generation tasks
Creative image generation workflows
Technical documentation with visual aids

🔍 Root Cause Investigation

After extensive debugging and API response analysis, we discovered that OpenAI-compatible providers use a different response structure for image content compared to the standard OpenAI format that LangChain.js expected.

The Hidden Response Structure

Most OpenAI-compatible APIs (like OpenRouter) return image content using an images field alongside the standard content field:

{
  "role": "assistant",
  "content": "Here's the chart you requested: ",
  "images": [
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/png;base64,iVBORw0KGgo...",
        "detail": "high"
      }
    }
  ]
}

However, LangChain.js streaming processors only handled these fields:

✅ content field (text content)
✅ tool_calls field (function calls)
✅ function_call field (legacy function calls)
✅ audio field (audio content)
❌ images field (completely ignored)

The core issue was in two critical functions within the LangChain OpenAI chat models:

_convertCompletionsDeltaToBaseMessageChunk() - For streaming responses
_convertCompletionsMessageToBaseMessage() - For non-streaming responses

Both functions simply discarded any images field data, causing visual content to vanish from the final message.

🔧 The Fix Implementation

Step-by-Step Implementation Guide

To implement this fix in your own project, you’ll need to make changes to three key areas:

Custom LangChain OpenAI Chat Model - Parse images field from API responses
Server Endpoint - Extract and handle multimodal content
Frontend Components - Display parsed images

Step 1: Create Custom LangChain Implementation

Since this was a fundamental limitation in LangChain.js itself, we created a customized version of the OpenAI chat models at server/models/openai/chat_models.ts.

Required Changes:

1.1. Enhanced Streaming Delta Processing

Find the _convertCompletionsDeltaToBaseMessageChunk() method in your LangChain OpenAI chat model and modify it:

Before (Original LangChain):

const content = delta.content ?? ""

After (Fixed):

let content = delta.content ?? ""

// Handle images field that might contain image_url content
if (delta.images && Array.isArray(delta.images)) {
  // Convert content to array format if it's a string and there are images
  if (typeof content === "string") {
    const contentArray = []
    if (content) {
      contentArray.push({ type: "text", text: content })
    }
    // Add image content from the images field
    for (const image of delta.images) {
      if (image.type === "image_url" && image.image_url) {
        contentArray.push({
          type: "image_url",
          image_url: image.image_url,
        })
      }
    }
    content = contentArray
  }
}

1.2. Enhanced Non-Streaming Message Processing

Find the _convertCompletionsMessageToBaseMessage() method and modify it:

Before (Original LangChain):

return new AIMessage({
  content: message.content || "",
  // ... other fields
})

After (Fixed):

// Handle images field that might contain image_url content
let content = message.content || ""
if (message.images && Array.isArray(message.images)) {
  // Convert content to array format if it's a string and there are images
  if (typeof content === "string") {
    const contentArray = []
    if (content) {
      contentArray.push({ type: "text", text: content })
    }
    // Add image content from the images field
    for (const image of message.images) {
      if (image.type === "image_url" && image.image_url) {
        contentArray.push({
          type: "image_url",
          image_url: image.image_url,
        })
      }
    }
    content = contentArray
  }
}

return new AIMessage({
  content,
  // ... other fields
})

Step 2: Update Server Endpoint Content Processing

Modify your chat endpoint to extract and handle multimodal content from the enhanced LangChain implementation:

File: server/api/models/chat/index.post.ts (or your equivalent)

Add this new function:

const extractContentFromChunk = (chunk: BaseMessageChunk): { text: string; images: any[] } => {
  let content = chunk?.content
  let textContent = ''
  let images: any[] = []

  // Handle array content (multimodal)
  if (Array.isArray(content)) {
    // Extract text content
    textContent = content
      .filter(item => item.type === 'text_delta' || item.type === 'text')
      .map(item => ('text' in item ? item.text : ''))
      .join('')

    // Extract image content
    images = content
      .filter(item => item.type === 'image_url' && item.image_url?.url)
      .map(item => ({ type: 'image_url', image_url: item.image_url }))
  } else {
    // Handle string content
    textContent = content || ''
  }

  return { text: textContent, images }
}

Update your streaming logic:

// Replace existing extractContentFromChunk calls
const { text, images } = extractContentFromChunk(chunk)

// Handle both text and images in your response
if (accumulatedImages.length > 0) {
  const contentArray: MessageContent[] = []
  if (accumulatedTextContent) {
    contentArray.push({ type: 'text', text: accumulatedTextContent })
  }
  contentArray.push(...accumulatedImages)
  contentToStream = contentArray
} else {
  contentToStream = accumulatedTextContent
}

Step 3: Frontend Image Display Implementation

Ensure your frontend components can extract and display images from the multimodal content:

File: components/ChatMessageItem.vue (or your equivalent)

Add image extraction logic:

const messageImages = computed(() => {
  const content = props.message.content
  if (!content || !Array.isArray(content)) return []

  return content
    .filter(item => item.type === 'image_url' && item.image_url?.url)
    .map(item => item.image_url!.url)
})

Update your template to display images:

<template>
  <!-- Text content -->
  <div v-if="messageContent" v-html="markdown.render(messageContent)" />
  
  <!-- Image gallery -->
  <div v-if="messageImages.length > 0" class="image-gallery">
    <img v-for="(url, index) in messageImages"
         :key="index"
         :src="url"
         :alt="`Image ${index + 1}`"
         class="rounded-lg max-h-64 object-contain" />
  </div>
</template>

Add basic CSS for image display:

.image-gallery {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
  gap: 0.5rem;
  margin-top: 0.75rem;
}

.image-gallery img {
  width: 100%;
  height: auto;
  background: var(--color-gray-100);
  cursor: pointer;
}

🧪 Comprehensive Testing Strategy

We implemented extensive testing to ensure robustness across different scenarios:

Test Coverage:

✅ Text with single image - Proper array conversion
✅ Multiple images - Maintains correct order and structure
✅ Images only (empty content) - Works without text content
✅ Backward compatibility - No breaking changes for standard responses
✅ Invalid image objects - Graceful error handling
✅ Empty images array - Handles edge cases properly
✅ Malformed data - Robust error handling for invalid inputs

Validation Commands:

npx tsx server/models/openai/tests/validate-core-logic.ts
npx tsx server/models/openai/tests/validate-image-url-parsing.ts

🎯 Content Format Transformation

The fix intelligently transforms API responses into LangChain-compatible multimodal content:

Input (OpenAI-Compatible API):

{
  "content": "Here are two visualizations: ",
  "images": [
    { 
      "type": "image_url", 
      "image_url": { "url": "data:image/png;base64,chart1..." } 
    },
    { 
      "type": "image_url", 
      "image_url": { "url": "data:image/png;base64,chart2..." } 
    }
  ]
}

Output (LangChain Message):

[
  { "type": "text", "text": "Here are two visualizations: " },
  { "type": "image_url", "image_url": { "url": "data:image/png;base64,chart1..." } },
  { "type": "image_url", "image_url": { "url": "data:image/png;base64,chart2..." } }
]

📚 Lessons Learned

This implementation taught us several valuable lessons about working with evolving AI APIs:

API Standardization is Still Evolving: Different OpenAI-compatible providers use varying response formats for multimodal content. Being adaptable to these differences is crucial for maintaining broad compatibility.

Custom LangChain Implementations Have Value: While staying close to upstream LangChain is generally preferred, sometimes specific use cases require custom implementations to unlock functionality that standard libraries don’t yet support.

Robust Testing Prevents Regressions: Comprehensive edge case testing was essential, especially when dealing with the variety of response formats from different API providers.

Backward Compatibility is Non-Negotiable: Any changes to core message processing must maintain 100% backward compatibility to avoid breaking existing workflows.

🚀 Impact and Results

The implementation delivers significant improvements to ChatOllama’s multimodal capabilities:

Immediate Benefits:

Full Multimodal Support: Users can now see AI-generated images from models like Gemini Flash
Enhanced Visualizations: Data charts, diagrams, and creative images display properly
API Provider Flexibility: Works seamlessly with OpenRouter, OpenAI, and other compatible providers
Zero Breaking Changes: Existing text-only workflows remain completely unaffected

Technical Improvements:

Streaming Performance: Images appear in real-time as they’re generated
Memory Efficiency: Optimized processing only activates when images are present
Error Resilience: Graceful handling of malformed or incomplete image data
Future-Proof Architecture: Ready for additional multimodal content types

💡 Real-World Usage Examples

This fix enables powerful new workflows:

// User Request: "Create a bar chart showing Q4 sales data"
// API Response: Mixed text + generated image
{
  "role": "assistant", 
  "content": "Here's your Q4 sales visualization:",
  "images": [{
    "type": "image_url",
    "image_url": {
      "url": "data:image/png;base64,<chart_data>",
      "detail": "high"
    }
  }]
}

// ChatOllama Now Displays: Text + Interactive Image

🚀 Quick Implementation Checklist

For developers implementing this fix:

✅ Required Files to Modify:

server/models/openai/chat_models.ts (or copy from @langchain/openai)
- ✅ Add image parsing to _convertCompletionsDeltaToBaseMessageChunk()
- ✅ Add image parsing to _convertCompletionsMessageToBaseMessage()
server/api/models/chat/index.post.ts (your chat endpoint)
- ✅ Update extractContentFromChunk() function
- ✅ Handle multimodal content in streaming logic
components/ChatMessageItem.vue (your message component)
- ✅ Add messageImages computed property
- ✅ Update template with image gallery
- ✅ Add CSS for image display

✅ Key Code Patterns to Look For:

Problem Indicators:

// ❌ Only handles text content
const content = delta.content ?? ""

// ❌ Ignores images field completely  
return new AIMessage({ content: message.content })

Solution Patterns:

// ✅ Handles both text and images
if (delta.images && Array.isArray(delta.images)) {
  // Convert to multimodal array format
}

// ✅ Extracts images from multimodal content
return content
  .filter(item => item.type === 'image_url' && item.image_url?.url)
  .map(item => item.image_url!.url)

✅ Testing Your Implementation:

Test with OpenRouter + Gemini Flash (known to return images field)
Verify both streaming and non-streaming responses
Check multiple images in single response
Ensure backward compatibility with text-only responses

This fix enables full multimodal support for OpenAI-compatible APIs that use the images response field. By implementing these three key changes, you can unlock image generation capabilities in your LangChain.js-based chat applications.