Skip to content

👁️ AI Vision Analysis Made Simple

Right now, you have chat, images, audio, files, and speech working in your application. But what if your AI could also see and understand images?

Vision analysis opens up visual intelligence. Instead of manually reviewing screenshots, documents, or charts, users can upload any image and get instant AI-powered insights, data extraction, and intelligent visual analysis.

You’re about to learn exactly how to add intelligent vision processing to your existing application.


🧠 Step 1: Understanding AI Vision Analysis

Section titled “🧠 Step 1: Understanding AI Vision Analysis”

Before we write any code, let’s understand what AI vision analysis actually means and why it’s useful for your applications.

AI vision analysis is like having a professional visual analyst inside your application. Users upload any image - screenshots, documents, charts, photos - and the AI reads, understands, and extracts meaningful insights automatically.

Real-world analogy: It’s like hiring a team of specialists who can instantly look at any visual content and give you detailed analysis, extract key data, and provide actionable insights. Instead of spending time manually reviewing images, you upload them and get professional analysis in seconds.

Think about all the times you or your users need to analyze visual content:

  • Business documents need OCR and data extraction
  • Charts and graphs need data interpretation and trend analysis
  • Screenshots need UI analysis and improvement suggestions
  • Photos need object recognition and content analysis
  • Dashboards need KPI extraction and performance insights

Without AI vision analysis, you’d need to:

  1. Manually examine every image (time-consuming)
  2. Extract data points by hand (error-prone)
  3. Miss important visual patterns (limiting)
  4. Process one image at a time (inefficient)

With AI vision analysis, you just upload any image and get intelligent insights instantly.

Your vision analyzer will support all major analysis modes:

📄 Document Analysis - OCR, text extraction, data processing

  • Best for: Reports, invoices, forms, contracts
  • AI extracts: Text content, key data points, structured information

📊 Chart Analysis - Data visualization interpretation

  • Best for: Graphs, charts, data visualizations
  • AI extracts: Numerical data, trends, insights, patterns

🎯 General Analysis - Comprehensive visual understanding

  • Best for: Screenshots, photos, general images
  • AI extracts: Objects, context, descriptions, recommendations

We’ll start with a unified approach that can handle any type of visual content intelligently.


🔧 Step 2: Adding Vision Analysis to Your Backend

Section titled “🔧 Step 2: Adding Vision Analysis to Your Backend”

Let’s add vision analysis to your existing backend using the same patterns you learned in previous modules. We’ll add new routes to handle image uploads and AI analysis.

Building on your foundation: You already have a working Node.js server with OpenAI integration. We’re simply adding vision capabilities to what you’ve built.

Step 2A: Understanding Vision Analysis State

Section titled “Step 2A: Understanding Vision Analysis State”

Before writing code, let’s understand what data our vision analysis system needs to manage:

// 🧠 VISION ANALYSIS STATE CONCEPTS:
// 1. Image Upload - The uploaded image data and metadata
// 2. Analysis Type - Document, chart, or general analysis mode
// 3. Vision Settings - OCR, data extraction, detail level
// 4. AI Results - Processed insights and extracted information
// 5. Error States - Invalid images, processing failures, file size limits

Key vision analysis concepts:

  • Image Processing: Different analysis approaches for documents vs photos
  • GPT-4o Vision: Using OpenAI’s vision model for image understanding
  • Analysis Modes: OCR-focused, data extraction, or general analysis
  • Results Structure: Organized output that’s easy to display

First, add the image processing dependencies to your backend. In your backend folder, run:

Terminal window
npm install sharp

What this package does:

  • sharp: Optimizes images for better AI analysis and smaller file sizes

Add this new endpoint to your existing index.js file, right after your text-to-speech routes:

import sharp from 'sharp';
// 👁️ VISION ANALYSIS ENDPOINT: Add this to your existing server
app.post("/api/vision/analyze", upload.single("image"), async (req, res) => {
try {
// 🛡️ VALIDATION: Check if image was uploaded
const uploadedImage = req.file;
const { analysisType = "general", includeOCR = true, extractData = true } = req.body;
if (!uploadedImage) {
return res.status(400).json({
error: "Image file is required",
success: false
});
}
console.log(`👁️ Analyzing: ${uploadedImage.originalname} (${uploadedImage.size} bytes)`);
// 🖼️ IMAGE OPTIMIZATION: Prepare image for vision analysis
const optimizedImage = await optimizeImageForVision(uploadedImage.buffer);
const base64Image = optimizedImage.toString('base64');
const imageUrl = `data:${uploadedImage.mimetype};base64,${base64Image}`;
// 🔍 ANALYSIS PROMPT: Generate appropriate prompt based on type
const analysisPrompt = generateVisionPrompt(analysisType, includeOCR, extractData);
// 🤖 AI VISION ANALYSIS: Process with GPT-4o
const response = await openai.responses.create({
model: "gpt-4o",
input: [
{
role: "system",
content: analysisPrompt.systemPrompt
},
{
role: "user",
content: [
{
type: "text",
text: analysisPrompt.userPrompt
},
{
type: "image_url",
image_url: {
url: imageUrl,
detail: "high"
}
}
]
}
]
});
// 📤 SUCCESS RESPONSE: Send analysis results
res.json({
success: true,
file_info: {
name: uploadedImage.originalname,
size: uploadedImage.size,
type: uploadedImage.mimetype
},
analysis: {
type: analysisType,
include_ocr: includeOCR,
extract_data: extractData,
result: response.output_text,
model: "gpt-4o"
},
timestamp: new Date().toISOString()
});
} catch (error) {
// 🚨 ERROR HANDLING: Handle analysis failures
console.error("Vision analysis error:", error);
res.status(500).json({
error: "Failed to analyze image",
details: error.message,
success: false
});
}
});
// 🔧 HELPER FUNCTIONS: Vision analysis utilities
// Optimize image for better vision analysis
const optimizeImageForVision = async (imageBuffer) => {
try {
// Resize large images for better processing
const optimized = await sharp(imageBuffer)
.resize(2048, 2048, {
fit: 'inside',
withoutEnlargement: true
})
.jpeg({ quality: 85 })
.toBuffer();
return optimized;
} catch (error) {
console.error('Image optimization error:', error);
return imageBuffer; // Return original if optimization fails
}
};
// Generate analysis prompts based on type
const generateVisionPrompt = (analysisType, includeOCR, extractData) => {
const baseSystem = "You are a professional visual analyst with expertise in document analysis, data extraction, and image understanding.";
switch (analysisType) {
case 'document':
return {
systemPrompt: `${baseSystem} You specialize in document analysis, OCR, and text extraction.`,
userPrompt: `Analyze this document image with focus on:
1. TEXT EXTRACTION: ${includeOCR ? 'Extract all readable text content using OCR' : 'Summarize visible text content'}
2. DOCUMENT STRUCTURE: Identify document type, layout, and organization
3. KEY DATA: Extract important numbers, dates, names, and values
4. INSIGHTS: Provide analysis of the document's purpose and key information
Provide clear, structured analysis that's easy to understand.`
};
case 'chart':
return {
systemPrompt: `${baseSystem} You specialize in chart analysis, data visualization interpretation, and trend analysis.`,
userPrompt: `Analyze this chart/graph with focus on:
1. CHART TYPE: Identify the type of visualization (bar, line, pie, etc.)
2. DATA EXTRACTION: ${extractData ? 'Extract specific numerical values and data points' : 'Summarize key trends and patterns'}
3. TRENDS: Identify patterns, trends, and significant changes
4. INSIGHTS: Provide business intelligence and actionable insights
Focus on accuracy and clear interpretation of the visual data.`
};
default: // general
return {
systemPrompt: `${baseSystem} You provide comprehensive visual analysis for any type of image.`,
userPrompt: `Analyze this image comprehensively:
1. CONTENT DESCRIPTION: What do you see in this image?
2. KEY ELEMENTS: Important objects, text, or data visible
3. CONTEXT ANALYSIS: Purpose, setting, or business context
4. ACTIONABLE INSIGHTS: Useful observations or recommendations
${includeOCR ? 'Include any readable text content.' : ''}
${extractData ? 'Extract any numerical or structured data visible.' : ''}
Provide practical, useful analysis that helps users understand the image better.`
};
}
};

Function breakdown:

  1. Validation - Ensure we have an image to analyze
  2. Image optimization - Prepare image for better AI analysis
  3. Prompt generation - Create appropriate analysis prompts
  4. Vision analysis - Process with GPT-4o vision capabilities
  5. Response formatting - Return structured results with metadata

Step 2D: Updating File Upload Configuration

Section titled “Step 2D: Updating File Upload Configuration”

Update your existing multer configuration to handle images:

// Update your existing multer setup to handle images
const upload = multer({
storage: multer.memoryStorage(),
limits: {
fileSize: 25 * 1024 * 1024 // 25MB limit
},
fileFilter: (req, file, cb) => {
// Accept all previous file types PLUS images
const allowedTypes = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'text/plain',
'text/csv',
'application/json',
'text/javascript',
'text/x-python',
'audio/wav',
'audio/mp3',
'audio/mpeg',
'audio/mp4',
'audio/webm',
'image/jpeg', // Add image support
'image/png', // Add image support
'image/webp', // Add image support
'image/gif' // Add image support
];
const extension = path.extname(file.originalname).toLowerCase();
const allowedExtensions = ['.pdf', '.docx', '.xlsx', '.csv', '.txt', '.md', '.json', '.js', '.py', '.wav', '.mp3', '.jpeg', '.jpg', '.png', '.webp', '.gif'];
if (allowedTypes.includes(file.mimetype) || allowedExtensions.includes(extension)) {
cb(null, true);
} else {
cb(new Error('Unsupported file type'), false);
}
}
});

Your backend now supports:

  • Text chat (existing functionality)
  • Streaming chat (existing functionality)
  • Image generation (existing functionality)
  • Audio transcription (existing functionality)
  • File analysis (existing functionality)
  • Text-to-speech (existing functionality)
  • Vision analysis (new functionality)

🔧 Step 3: Building the React Vision Component

Section titled “🔧 Step 3: Building the React Vision Component”

Now let’s create a React component for vision analysis using the same patterns from your existing components.

Step 3A: Creating the Vision Analysis Component

Section titled “Step 3A: Creating the Vision Analysis Component”

Create a new file src/VisionAnalysis.jsx:

import { useState, useRef } from "react";
import { Upload, Eye, FileText, BarChart3, Download, Camera } from "lucide-react";
function VisionAnalysis() {
// 🧠 STATE: Vision analysis data management
const [selectedImage, setSelectedImage] = useState(null); // Uploaded image
const [analysisType, setAnalysisType] = useState("general"); // Analysis mode
const [isAnalyzing, setIsAnalyzing] = useState(false); // Processing status
const [analysisResult, setAnalysisResult] = useState(null); // Analysis results
const [error, setError] = useState(null); // Error messages
const [previewUrl, setPreviewUrl] = useState(null); // Image preview
const [options, setOptions] = useState({ // Analysis options
includeOCR: true,
extractData: true
});
const fileInputRef = useRef(null);
// 🔧 FUNCTIONS: Vision analysis logic engine
// Handle image selection
const handleImageSelect = (event) => {
const file = event.target.files[0];
if (file) {
// Validate file size (25MB limit)
if (file.size > 25 * 1024 * 1024) {
setError('Image too large. Maximum size is 25MB.');
return;
}
// Validate file type
const allowedTypes = ['image/jpeg', 'image/png', 'image/webp', 'image/gif'];
if (!allowedTypes.includes(file.type)) {
setError('Unsupported image type. Please upload JPEG, PNG, WebP, or GIF files.');
return;
}
setSelectedImage(file);
setAnalysisResult(null);
setError(null);
// Create preview URL
const url = URL.createObjectURL(file);
setPreviewUrl(url);
}
};
// Clear selected image
const clearImage = () => {
setSelectedImage(null);
setAnalysisResult(null);
setError(null);
if (previewUrl) {
URL.revokeObjectURL(previewUrl);
setPreviewUrl(null);
}
if (fileInputRef.current) {
fileInputRef.current.value = '';
}
};
// Main vision analysis function
const analyzeImage = async () => {
// 🛡️ GUARDS: Prevent invalid analysis
if (!selectedImage || isAnalyzing) return;
// 🔄 SETUP: Prepare for analysis
setIsAnalyzing(true);
setError(null);
setAnalysisResult(null);
try {
// 📤 FORM DATA: Prepare multipart form data
const formData = new FormData();
formData.append('image', selectedImage);
formData.append('analysisType', analysisType);
formData.append('includeOCR', options.includeOCR);
formData.append('extractData', options.extractData);
// 📡 API CALL: Send to your backend
const response = await fetch("http://localhost:8000/api/vision/analyze", {
method: "POST",
body: formData
});
const data = await response.json();
if (!response.ok) {
throw new Error(data.error || 'Failed to analyze image');
}
// ✅ SUCCESS: Store analysis results
setAnalysisResult(data);
} catch (error) {
// 🚨 ERROR HANDLING: Show user-friendly message
console.error('Vision analysis failed:', error);
setError(error.message || 'Something went wrong while analyzing the image');
} finally {
// 🧹 CLEANUP: Reset processing state
setIsAnalyzing(false);
}
};
// Download analysis results
const downloadAnalysis = () => {
if (!analysisResult) return;
const element = document.createElement('a');
const file = new Blob([JSON.stringify(analysisResult, null, 2)], { type: 'application/json' });
element.href = URL.createObjectURL(file);
element.download = `vision-analysis-${selectedImage.name}-${Date.now()}.json`;
document.body.appendChild(element);
element.click();
document.body.removeChild(element);
};
// Analysis type options
const analysisTypes = [
{ value: "general", label: "General Analysis", desc: "Comprehensive visual understanding", icon: Eye },
{ value: "document", label: "Document Analysis", desc: "OCR and text extraction focus", icon: FileText },
{ value: "chart", label: "Chart Analysis", desc: "Data visualization interpretation", icon: BarChart3 }
];
// Format file size
const formatFileSize = (bytes) => {
if (bytes === 0) return '0 Bytes';
const k = 1024;
const sizes = ['Bytes', 'KB', 'MB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + ' ' + sizes[i];
};
// 🎨 UI: Interface components
return (
<div className="min-h-screen bg-gradient-to-br from-indigo-50 to-purple-50 flex items-center justify-center p-4">
<div className="bg-white rounded-2xl shadow-2xl w-full max-w-6xl flex flex-col overflow-hidden">
{/* Header */}
<div className="bg-gradient-to-r from-indigo-600 to-purple-600 text-white p-6">
<div className="flex items-center space-x-3">
<div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center">
<Eye className="w-5 h-5" />
</div>
<div>
<h1 className="text-xl font-bold">👁️ AI Vision Analysis</h1>
<p className="text-indigo-100 text-sm">Analyze any image with AI intelligence!</p>
</div>
</div>
</div>
{/* Analysis Type Selection */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4 flex items-center">
<Camera className="w-5 h-5 mr-2 text-indigo-600" />
Analysis Type
</h3>
<div className="grid grid-cols-1 md:grid-cols-3 gap-4">
{analysisTypes.map((type) => {
const IconComponent = type.icon;
return (
<button
key={type.value}
onClick={() => setAnalysisType(type.value)}
className={`p-4 rounded-lg border-2 text-left transition-all duration-200 ${
analysisType === type.value
? 'border-indigo-500 bg-indigo-50 shadow-md'
: 'border-gray-200 hover:border-indigo-300 hover:bg-indigo-50'
}`}
>
<div className="flex items-center mb-2">
<IconComponent className="w-5 h-5 mr-2 text-indigo-600" />
<h4 className="font-medium text-gray-900">{type.label}</h4>
</div>
<p className="text-sm text-gray-600">{type.desc}</p>
</button>
);
})}
</div>
</div>
{/* Analysis Options */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4">Analysis Options</h3>
<div className="grid grid-cols-1 md:grid-cols-2 gap-4">
<label className="flex items-center space-x-3 p-3 rounded-lg border border-gray-200 hover:bg-gray-50 cursor-pointer">
<input
type="checkbox"
checked={options.includeOCR}
onChange={(e) => setOptions(prev => ({ ...prev, includeOCR: e.target.checked }))}
className="w-4 h-4 text-indigo-600 rounded focus:ring-indigo-500"
/>
<div>
<span className="font-medium text-gray-900">Include OCR</span>
<p className="text-sm text-gray-600">Extract text content from images</p>
</div>
</label>
<label className="flex items-center space-x-3 p-3 rounded-lg border border-gray-200 hover:bg-gray-50 cursor-pointer">
<input
type="checkbox"
checked={options.extractData}
onChange={(e) => setOptions(prev => ({ ...prev, extractData: e.target.checked }))}
className="w-4 h-4 text-indigo-600 rounded focus:ring-indigo-500"
/>
<div>
<span className="font-medium text-gray-900">Extract Data</span>
<p className="text-sm text-gray-600">Find numerical data and structured information</p>
</div>
</label>
</div>
</div>
{/* Image Upload Section */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4 flex items-center">
<Upload className="w-5 h-5 mr-2 text-indigo-600" />
Upload Image for Analysis
</h3>
{!selectedImage ? (
<div
onClick={() => fileInputRef.current?.click()}
className="border-2 border-dashed border-gray-300 rounded-xl p-8 text-center cursor-pointer hover:border-indigo-400 hover:bg-indigo-50 transition-colors duration-200"
>
<Upload className="w-12 h-12 text-gray-400 mx-auto mb-4" />
<h4 className="text-lg font-semibold text-gray-700 mb-2">Upload Image</h4>
<p className="text-gray-600 mb-4">
Support for JPEG, PNG, WebP, and GIF files up to 25MB
</p>
<button className="px-6 py-3 bg-gradient-to-r from-indigo-600 to-purple-600 text-white rounded-xl hover:from-indigo-700 hover:to-purple-700 transition-all duration-200 inline-flex items-center space-x-2 shadow-lg">
<Upload className="w-4 h-4" />
<span>Choose Image</span>
</button>
</div>
) : (
<div className="bg-gray-50 rounded-lg p-4 border border-gray-200">
<div className="grid grid-cols-1 md:grid-cols-2 gap-4">
{/* Image Preview */}
<div>
<h4 className="font-medium text-gray-900 mb-2">Preview:</h4>
<img
src={previewUrl}
alt={selectedImage.name}
className="w-full h-48 object-cover rounded-lg border border-gray-200"
/>
</div>
{/* Image Info */}
<div>
<div className="flex items-center justify-between mb-4">
<div>
<h4 className="font-medium text-gray-900">{selectedImage.name}</h4>
<p className="text-sm text-gray-600">{formatFileSize(selectedImage.size)}</p>
</div>
<button
onClick={clearImage}
className="p-2 text-gray-400 hover:text-red-600 transition-colors duration-200"
>
×
</button>
</div>
<button
onClick={analyzeImage}
disabled={isAnalyzing}
className="w-full bg-gradient-to-r from-indigo-600 to-purple-600 hover:from-indigo-700 hover:to-purple-700 disabled:from-gray-300 disabled:to-gray-300 text-white px-6 py-3 rounded-lg transition-all duration-200 flex items-center justify-center space-x-2 shadow-lg disabled:shadow-none"
>
{isAnalyzing ? (
<>
<div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div>
<span>Analyzing...</span>
</>
) : (
<>
<Eye className="w-4 h-4" />
<span>Analyze Image</span>
</>
)}
</button>
</div>
</div>
</div>
)}
<input
ref={fileInputRef}
type="file"
accept="image/jpeg,image/png,image/webp,image/gif"
onChange={handleImageSelect}
className="hidden"
/>
</div>
{/* Results Section */}
<div className="flex-1 p-6">
{/* Error Display */}
{error && (
<div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4">
<p className="text-red-700">
<strong>Error:</strong> {error}
</p>
</div>
)}
{/* Analysis Results */}
{analysisResult ? (
<div className="bg-gray-50 rounded-lg p-4">
<div className="flex items-center justify-between mb-4">
<h4 className="font-semibold text-gray-900">Vision Analysis Results</h4>
<button
onClick={downloadAnalysis}
className="bg-gradient-to-r from-blue-500 to-blue-600 hover:from-blue-600 hover:to-blue-700 text-white px-4 py-2 rounded-lg transition-all duration-200 flex items-center space-x-2"
>
<Download className="w-4 h-4" />
<span>Download</span>
</button>
</div>
<div className="space-y-4">
{/* File Information */}
<div className="bg-white rounded-lg p-4">
<h5 className="font-medium text-gray-700 mb-2">Image Information:</h5>
<div className="grid grid-cols-2 md:grid-cols-4 gap-4 text-sm">
<div>
<span className="text-gray-600">Name:</span>
<p className="font-medium">{analysisResult.file_info.name}</p>
</div>
<div>
<span className="text-gray-600">Size:</span>
<p className="font-medium">{formatFileSize(analysisResult.file_info.size)}</p>
</div>
<div>
<span className="text-gray-600">Type:</span>
<p className="font-medium">{analysisResult.file_info.type}</p>
</div>
<div>
<span className="text-gray-600">Analysis:</span>
<p className="font-medium capitalize">{analysisResult.analysis.type}</p>
</div>
</div>
</div>
{/* Analysis Content */}
<div className="bg-white rounded-lg p-4">
<h5 className="font-medium text-gray-700 mb-2">AI Vision Analysis:</h5>
<div className="text-gray-900 leading-relaxed whitespace-pre-wrap max-h-96 overflow-y-auto">
{analysisResult.analysis.result}
</div>
</div>
</div>
</div>
) : !isAnalyzing && !error && (
// Welcome State
<div className="text-center py-12">
<div className="w-16 h-16 bg-indigo-100 rounded-2xl flex items-center justify-center mx-auto mb-4">
<Eye className="w-8 h-8 text-indigo-600" />
</div>
<h3 className="text-lg font-semibold text-gray-700 mb-2">
Ready to Analyze!
</h3>
<p className="text-gray-600 max-w-md mx-auto">
Upload any image to get AI-powered visual analysis, text extraction, and intelligent insights.
</p>
</div>
)}
</div>
</div>
</div>
);
}
export default VisionAnalysis;

Step 3B: Adding Vision Analysis to Navigation

Section titled “Step 3B: Adding Vision Analysis to Navigation”

Update your src/App.jsx to include the new vision analysis component:

import { useState } from "react";
import StreamingChat from "./StreamingChat";
import ImageGenerator from "./ImageGenerator";
import AudioTranscription from "./AudioTranscription";
import FileAnalysis from "./FileAnalysis";
import TextToSpeech from "./TextToSpeech";
import VisionAnalysis from "./VisionAnalysis";
import { MessageSquare, Image, Mic, Folder, Volume2, Eye } from "lucide-react";
function App() {
// 🧠 STATE: Navigation management
const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', 'audio', 'files', 'speech', or 'vision'
// 🎨 UI: Main app with navigation
return (
<div className="min-h-screen bg-gray-100">
{/* Navigation Header */}
<nav className="bg-white shadow-sm border-b border-gray-200">
<div className="max-w-6xl mx-auto px-4">
<div className="flex items-center justify-between h-16">
{/* Logo */}
<div className="flex items-center space-x-3">
<div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center">
<span className="text-white font-bold text-sm">AI</span>
</div>
<h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1>
</div>
{/* Navigation Buttons */}
<div className="flex space-x-2">
<button
onClick={() => setCurrentView("chat")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "chat"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<MessageSquare className="w-4 h-4" />
<span>Chat</span>
</button>
<button
onClick={() => setCurrentView("images")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "images"
? "bg-purple-100 text-purple-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Image className="w-4 h-4" />
<span>Images</span>
</button>
<button
onClick={() => setCurrentView("audio")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "audio"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Mic className="w-4 h-4" />
<span>Audio</span>
</button>
<button
onClick={() => setCurrentView("files")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "files"
? "bg-green-100 text-green-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Folder className="w-4 h-4" />
<span>Files</span>
</button>
<button
onClick={() => setCurrentView("speech")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "speech"
? "bg-orange-100 text-orange-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Volume2 className="w-4 h-4" />
<span>Speech</span>
</button>
<button
onClick={() => setCurrentView("vision")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "vision"
? "bg-indigo-100 text-indigo-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Eye className="w-4 h-4" />
<span>Vision</span>
</button>
</div>
</div>
</div>
</nav>
{/* Main Content */}
<main className="h-[calc(100vh-4rem)]">
{currentView === "chat" && <StreamingChat />}
{currentView === "images" && <ImageGenerator />}
{currentView === "audio" && <AudioTranscription />}
{currentView === "files" && <FileAnalysis />}
{currentView === "speech" && <TextToSpeech />}
{currentView === "vision" && <VisionAnalysis />}
</main>
</div>
);
}
export default App;

Let’s test your vision analysis feature step by step to make sure everything works correctly.

First, verify your backend route works by testing it directly:

Test with a simple image:

Terminal window
# Test the endpoint with an image file
curl -X POST http://localhost:8000/api/vision/analyze \
-F "image=@test-image.jpg" \
-F "analysisType=general" \
-F "includeOCR=true" \
-F "extractData=true"

Expected response:

{
"success": true,
"file_info": {
"name": "test-image.jpg",
"size": 245678,
"type": "image/jpeg"
},
"analysis": {
"type": "general",
"include_ocr": true,
"extract_data": true,
"result": "This image shows...",
"model": "gpt-4o"
},
"timestamp": "2024-01-15T10:30:00.000Z"
}

Start both servers:

Backend (in your backend folder):

Terminal window
npm run dev

Frontend (in your frontend folder):

Terminal window
npm run dev

Test the complete flow:

  1. Navigate to Vision → Click the “Vision” tab in navigation
  2. Select analysis type → Choose “General”, “Document”, or “Chart” analysis
  3. Configure options → Enable OCR or data extraction as needed
  4. Upload an image → Try a screenshot, document, or chart
  5. Analyze → Click “Analyze Image” and see loading state
  6. View results → See AI analysis with image information
  7. Download → Test downloading analysis as JSON file
  8. Switch images → Try different image types and analysis modes

Test error scenarios:

❌ Large image: Upload image larger than 25MB
❌ Wrong type: Upload unsupported file (like .txt or .mp4)
❌ Empty upload: Try to analyze without selecting an image
❌ Corrupt image: Upload damaged image file

Expected behavior:

  • Clear error messages displayed
  • No application crashes
  • User can try again with different image
  • Image upload resets properly after errors

Congratulations! You’ve extended your existing application with complete AI vision analysis:

  • Extended your backend with vision processing and GPT-4o integration
  • Added React vision component following the same patterns as your other features
  • Implemented intelligent image analysis for documents, charts, and general content
  • Created flexible analysis modes with OCR and data extraction options
  • Added download functionality for analysis results
  • Maintained consistent design with your existing application

Your application now has:

  • Text chat with streaming responses
  • Image generation with DALL-E 3 and GPT-Image-1
  • Audio transcription with Whisper voice recognition
  • File analysis with intelligent document processing
  • Text-to-speech with natural voice synthesis
  • Vision analysis with GPT-4o visual intelligence
  • Unified navigation between all features
  • Professional UI with consistent TailwindCSS styling

Complete OpenAI mastery achieved! You now have a comprehensive application that leverages all major OpenAI capabilities in a unified, professional interface. 👁️

<function_calls> [{“content”: “Create Module 2 folder structure for advanced OpenAI features”, “status”: “completed”, “priority”: “high”, “id”: “13”}, {“content”: “Create Image Generation guide with DALL-E and GPT-image-1”, “status”: “completed”, “priority”: “high”, “id”: “14”}, {“content”: “Create Audio Transcription guide with Whisper integration”, “status”: “completed”, “priority”: “high”, “id”: “15”}, {“content”: “Create Vision Analysis guide with GPT-4o”, “status”: “completed”, “priority”: “high”, “id”: “16”}, {“content”: “Create File Interaction guide with Response API”, “status”: “in_progress”, “priority”: “high”, “id”: “17”}, {“content”: “Create Text-to-Speech guide with voice synthesis”, “status”: “pending”, “priority”: “high”, “id”: “18”}]