Skip to content

🔊 AI Text-to-Speech Made Simple

Right now, you have chat, images, audio transcription, and file analysis working in your application. But what if your AI could also speak back to users?

Text-to-speech adds voice capabilities. Instead of just showing text responses, your AI can speak them aloud with natural-sounding voices, creating more engaging and accessible experiences for your users.

You’re about to learn exactly how to add voice synthesis to your existing application.


🧠 Step 1: Understanding AI Text-to-Speech

Section titled “🧠 Step 1: Understanding AI Text-to-Speech”

Before we write any code, let’s understand what AI text-to-speech actually means and why it’s useful for your applications.

AI text-to-speech is like having professional voice actors inside your application. Users can have any text read aloud with natural-sounding voices that have different personalities and speaking styles.

Real-world analogy: It’s like hiring a team of voice actors who can instantly read any text in their unique style. Instead of users reading everything themselves, they can listen while multitasking, or get an audio version for accessibility.

Think about all the times you or your users would benefit from audio:

  • Accessibility for users with visual impairments or reading difficulties
  • Multitasking - users can listen while doing other activities
  • Learning styles - some people learn better by hearing information
  • Content consumption - turn articles into podcasts instantly
  • Hands-free interaction - perfect for mobile or automotive use

Without AI text-to-speech, you’d need to:

  1. Record everything manually (time-consuming and expensive)
  2. Use robotic computer voices (poor user experience)
  3. Miss accessibility opportunities (limiting your audience)
  4. Provide only visual content (excluding audio learners)

With AI text-to-speech, you just send any text and get natural speech instantly.

OpenAI provides six distinct AI voices, each with their own personality:

🎙️ Alloy - The Professional

  • Best for: Business content, presentations, formal communication
  • Personality: Neutral, clear, and professional
  • Think of it as: Your corporate spokesperson

🌊 Echo - The Calming Voice

  • Best for: Meditation, relaxation content, soothing narration
  • Personality: Calm, gentle, and peaceful
  • Think of it as: Your meditation instructor

📚 Fable - The Storyteller

  • Best for: Stories, creative content, engaging narratives
  • Personality: Expressive, dynamic, and captivating
  • Think of it as: Your favorite audiobook narrator

🎯 Onyx - The Authority

  • Best for: News, announcements, important information
  • Personality: Deep, confident, and commanding
  • Think of it as: Your news anchor

☀️ Nova - The Friendly Guide

  • Best for: Tutorials, customer service, welcoming content
  • Personality: Warm, approachable, and helpful
  • Think of it as: Your friendly assistant

✨ Shimmer - The Energetic Motivator

  • Best for: Marketing, motivational content, upbeat messages
  • Personality: Bright, enthusiastic, and energetic
  • Think of it as: Your marketing spokesperson

We’ll start by building basic text-to-speech functionality, and you can explore different voices to find the perfect match for your content.


🔧 Step 2: Adding Text-to-Speech to Your Backend

Section titled “🔧 Step 2: Adding Text-to-Speech to Your Backend”

Let’s add text-to-speech to your existing backend using the same patterns you learned in Module 1. We’ll add new routes to handle text input and voice generation.

Building on your foundation: You already have a working Node.js server with OpenAI integration. We’re simply adding voice synthesis capabilities to what you’ve built.

Step 2A: Understanding Text-to-Speech State

Section titled “Step 2A: Understanding Text-to-Speech State”

Before writing code, let’s understand what data our text-to-speech system needs to manage:

// 🧠 TEXT-TO-SPEECH STATE CONCEPTS:
// 1. Text Input - The text content to convert to speech
// 2. Voice Selection - Which AI voice personality to use
// 3. Audio Settings - Speed, format, and quality preferences
// 4. Generated Audio - The resulting audio file and metadata
// 5. Error States - Invalid text, processing failures, file limits

Key text-to-speech concepts:

  • Voice Models: TTS-1 (fast) vs TTS-1-HD (high quality)
  • Voice Personalities: Six different AI voices with unique characteristics
  • Audio Formats: MP3, Opus, AAC, and FLAC options
  • Speed Control: Adjust speaking rate from 0.25x to 4x normal speed

Add this new endpoint to your existing index.js file, right after your file analysis routes:

import fs from 'fs';
import path from 'path';
// 🔊 VOICE PROFILES: Available AI voices with personalities
const VOICE_PROFILES = {
alloy: {
name: "Alloy",
description: "Professional and versatile",
bestFor: "Business content, presentations"
},
echo: {
name: "Echo",
description: "Calm and soothing",
bestFor: "Meditation, relaxation content"
},
fable: {
name: "Fable",
description: "Expressive storyteller",
bestFor: "Stories, creative content"
},
onyx: {
name: "Onyx",
description: "Deep and authoritative",
bestFor: "News, formal announcements"
},
nova: {
name: "Nova",
description: "Warm and friendly",
bestFor: "Customer service, tutorials"
},
shimmer: {
name: "Shimmer",
description: "Bright and energetic",
bestFor: "Marketing, upbeat content"
}
};
// 🔧 HELPER FUNCTIONS: Audio processing utilities
const saveAudioToTemp = async (audioBuffer, format = 'mp3') => {
const tempDir = path.join(process.cwd(), "temp");
// Create temp directory if it doesn't exist
if (!fs.existsSync(tempDir)) {
fs.mkdirSync(tempDir, { recursive: true });
}
// Create unique filename
const filename = `tts-${Date.now()}.${format}`;
const filepath = path.join(tempDir, filename);
// Write audio file
fs.writeFileSync(filepath, audioBuffer);
// Auto-cleanup after 1 hour
setTimeout(() => {
try {
if (fs.existsSync(filepath)) {
fs.unlinkSync(filepath);
console.log(`🧹 Cleaned up: ${filename}`);
}
} catch (error) {
console.error("Error cleaning up audio file:", error);
}
}, 3600000); // 1 hour
return { filepath, filename };
};
// 🔊 AI Text-to-Speech endpoint - add this to your existing server
app.post("/api/tts/generate", async (req, res) => {
try {
// 🛡️ VALIDATION: Check required inputs
const {
text,
voice = "alloy",
model = "tts-1",
speed = 1.0,
format = "mp3"
} = req.body;
if (!text || text.trim() === "") {
return res.status(400).json({
error: "Text is required",
success: false
});
}
if (text.length > 4096) {
return res.status(400).json({
error: "Text too long. Maximum 4096 characters allowed.",
current_length: text.length,
success: false
});
}
console.log(`🔊 Generating speech: ${text.substring(0, 50)}... (${voice})`);
// 🎙️ AI SPEECH GENERATION: Convert text to speech
const response = await openai.audio.speech.create({
model: model, // tts-1 (fast) or tts-1-hd (high quality)
voice: voice, // AI voice personality
input: text.trim(), // Text to convert
response_format: format, // Audio format (mp3, opus, aac, flac)
speed: Math.max(0.25, Math.min(4.0, speed)) // Speaking speed (0.25x to 4x)
});
// 💾 AUDIO PROCESSING: Save audio file
const audioBuffer = Buffer.from(await response.arrayBuffer());
const { filepath, filename } = await saveAudioToTemp(audioBuffer, format);
// 📤 SUCCESS RESPONSE: Send audio info and download link
res.json({
success: true,
audio: {
filename: filename,
format: format,
size: audioBuffer.length,
duration_estimate: Math.ceil(text.length / 14), // ~14 characters per second
download_url: `/api/tts/download/${filename}`
},
generation: {
voice: voice,
voice_info: VOICE_PROFILES[voice],
model: model,
speed: speed,
text_length: text.length
},
timestamp: new Date().toISOString()
});
} catch (error) {
// 🚨 ERROR HANDLING: Handle TTS failures
console.error("Text-to-speech error:", error);
res.status(500).json({
error: "Failed to generate speech",
details: error.message,
success: false
});
}
});
// 📥 Audio Download endpoint - serve generated audio files
app.get("/api/tts/download/:filename", (req, res) => {
try {
const { filename } = req.params;
const filepath = path.join(process.cwd(), "temp", filename);
// Security check - ensure filename is safe
if (!filename.match(/^tts-\d+\.(mp3|opus|aac|flac)$/)) {
return res.status(400).json({ error: "Invalid filename" });
}
// Check if file exists
if (!fs.existsSync(filepath)) {
return res.status(404).json({ error: "Audio file not found or expired" });
}
// Serve audio file
const extension = path.extname(filename).substring(1);
res.setHeader('Content-Type', `audio/${extension}`);
res.setHeader('Content-Disposition', `attachment; filename="${filename}"`);
const audioBuffer = fs.readFileSync(filepath);
res.send(audioBuffer);
} catch (error) {
console.error("Audio download error:", error);
res.status(500).json({
error: "Failed to download audio",
message: error.message
});
}
});
// 🎙️ Voice Information endpoint - get available voices
app.get("/api/tts/voices", (req, res) => {
res.json({
success: true,
voices: VOICE_PROFILES,
models: [
{
id: "tts-1",
name: "TTS-1",
description: "Fast, cost-effective synthesis",
quality: "standard"
},
{
id: "tts-1-hd",
name: "TTS-1 HD",
description: "High-definition audio quality",
quality: "premium"
}
],
formats: ["mp3", "opus", "aac", "flac"],
speed_range: { min: 0.25, max: 4.0, default: 1.0 },
text_limit: 4096
});
});

Function breakdown:

  1. Text validation - Ensure text exists and is within length limits
  2. Voice configuration - Set up AI voice, model, and audio settings
  3. Speech generation - Call OpenAI’s TTS API to create audio
  4. Audio storage - Save audio file temporarily for download
  5. Response formatting - Return audio info and download link
  6. File serving - Provide secure download endpoint for audio files

Add this middleware to handle text-to-speech specific errors:

// 🚨 TTS ERROR HANDLING: Handle text-to-speech errors
app.use((error, req, res, next) => {
if (error.message && error.message.includes('Invalid voice')) {
return res.status(400).json({
error: "Invalid voice selected. Please choose from: alloy, echo, fable, onyx, nova, shimmer",
success: false
});
}
if (error.message && error.message.includes('text too long')) {
return res.status(400).json({
error: "Text exceeds maximum length of 4096 characters",
success: false
});
}
next(error);
});

Your backend now supports:

  • Text chat (existing functionality)
  • Streaming chat (existing functionality)
  • Image generation (existing functionality)
  • Audio transcription (existing functionality)
  • File analysis (existing functionality)
  • Text-to-speech (new functionality)
---
## 🔧 Step 3: Building the React Text-to-Speech Component
Now let's create a React component for text-to-speech using the same patterns from your existing components.
### **Step 3A: Creating the Text-to-Speech Component**
Create a new file `src/TextToSpeech.jsx`:
```jsx
import { useState, useRef, useEffect } from "react";
import { Volume2, Play, Pause, Download, Settings } from "lucide-react";
function TextToSpeech() {
// 🧠 STATE: Text-to-speech data management
const [text, setText] = useState(""); // Text to convert
const [selectedVoice, setSelectedVoice] = useState("alloy"); // AI voice selection
const [audioSettings, setAudioSettings] = useState({ // TTS settings
model: "tts-1",
speed: 1.0,
format: "mp3"
});
const [isGenerating, setIsGenerating] = useState(false); // Processing status
const [generatedAudio, setGeneratedAudio] = useState([]); // Generated audio list
const [currentlyPlaying, setCurrentlyPlaying] = useState(null); // Audio playback state
const [voices, setVoices] = useState({}); // Available voices
const [error, setError] = useState(null); // Error messages
const audioRef = useRef(null);
// Load available voices on component mount
useEffect(() => {
fetchVoices();
}, []);
const fetchVoices = async () => {
try {
const response = await fetch("http://localhost:8000/api/tts/voices");
const data = await response.json();
if (data.success) {
setVoices(data.voices);
}
} catch (error) {
console.error('Failed to fetch voices:', error);
}
};
// 🔧 FUNCTIONS: Text-to-speech logic engine
// Main speech generation function
const generateSpeech = async () => {
// 🛡️ GUARDS: Prevent invalid generation
if (!text.trim() || isGenerating) return;
// 🔄 SETUP: Prepare for generation
setIsGenerating(true);
setError(null);
try {
// 📤 API CALL: Send to your backend
const response = await fetch("http://localhost:8000/api/tts/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: text.trim(),
voice: selectedVoice,
...audioSettings
})
});
const data = await response.json();
if (!response.ok) {
throw new Error(data.error || 'Failed to generate speech');
}
// ✅ SUCCESS: Store generated audio
const newAudio = {
id: Date.now(),
text: text.trim(),
voice: selectedVoice,
settings: audioSettings,
audio: data.audio,
generation: data.generation,
timestamp: new Date().toISOString()
};
setGeneratedAudio(prev => [newAudio, ...prev]);
setText(""); // Clear input after successful generation
} catch (error) {
// 🚨 ERROR HANDLING: Show user-friendly message
console.error('Speech generation failed:', error);
setError(error.message || 'Something went wrong while generating speech');
} finally {
// 🧹 CLEANUP: Reset generation state
setIsGenerating(false);
}
};
// Audio playback function
const playAudio = async (audioItem) => {
try {
if (currentlyPlaying?.id === audioItem.id) {
// Pause current audio
if (audioRef.current) {
audioRef.current.pause();
setCurrentlyPlaying(null);
}
return;
}
// Stop any currently playing audio
if (audioRef.current) {
audioRef.current.pause();
}
// Create new audio element
const audio = new Audio(`http://localhost:8000${audioItem.audio.download_url}`);
audioRef.current = audio;
audio.onloadstart = () => setCurrentlyPlaying({ ...audioItem, status: 'loading' });
audio.oncanplay = () => setCurrentlyPlaying({ ...audioItem, status: 'ready' });
audio.onplay = () => setCurrentlyPlaying({ ...audioItem, status: 'playing' });
audio.onpause = () => setCurrentlyPlaying({ ...audioItem, status: 'paused' });
audio.onended = () => setCurrentlyPlaying(null);
audio.onerror = () => {
setCurrentlyPlaying(null);
setError('Failed to play audio');
};
await audio.play();
} catch (error) {
console.error('Audio playback error:', error);
setCurrentlyPlaying(null);
setError('Failed to play audio');
}
};
// Download audio function
const downloadAudio = (audioItem) => {
try {
const link = document.createElement('a');
link.href = `http://localhost:8000${audioItem.audio.download_url}`;
link.download = `speech-${audioItem.id}.${audioItem.audio.format}`;
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
} catch (error) {
console.error('Download error:', error);
setError('Failed to download audio');
}
};
// Sample texts for quick testing
const sampleTexts = [
"Welcome to our application! I'm excited to help you with AI-powered text-to-speech.",
"Once upon a time, in the world of artificial intelligence, voices came alive with just a few lines of code.",
"This is a test of the emergency broadcast system. This is only a test.",
"Take a deep breath and relax as you listen to this calming AI-generated voice.",
"Breaking news: AI technology continues to amaze us with natural-sounding speech synthesis."
];
// Utility functions
const formatFileSize = (bytes) => {
if (bytes === 0) return '0 Bytes';
const k = 1024;
const sizes = ['Bytes', 'KB', 'MB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + ' ' + sizes[i];
};
const formatDuration = (seconds) => {
const mins = Math.floor(seconds / 60);
const secs = Math.floor(seconds % 60);
return `${mins}:${secs.toString().padStart(2, '0')}`;
};
// 🎨 UI: Interface components
return (
<div className="min-h-screen bg-gradient-to-br from-orange-50 to-red-50 flex items-center justify-center p-4">
<div className="bg-white rounded-2xl shadow-2xl w-full max-w-4xl flex flex-col overflow-hidden">
{/* Header */}
<div className="bg-gradient-to-r from-orange-600 to-red-600 text-white p-6">
<div className="flex items-center space-x-3">
<div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center">
<Volume2 className="w-5 h-5" />
</div>
<div>
<h1 className="text-xl font-bold">🔊 AI Text-to-Speech</h1>
<p className="text-orange-100 text-sm">Convert any text to natural speech!</p>
</div>
</div>
</div>
{/* Voice Settings Section */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4 flex items-center">
<Settings className="w-5 h-5 mr-2 text-orange-600" />
Voice Settings
</h3>
<div className="grid grid-cols-1 md:grid-cols-4 gap-4">
{/* Voice Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Voice</label>
<select
value={selectedVoice}
onChange={(e) => setSelectedVoice(e.target.value)}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
{Object.entries(voices).map(([key, voice]) => (
<option key={key} value={key}>
{voice.name} - {voice.description}
</option>
))}
</select>
</div>
{/* Model Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Quality</label>
<select
value={audioSettings.model}
onChange={(e) => setAudioSettings(prev => ({ ...prev, model: e.target.value }))}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
<option value="tts-1">Standard (Fast)</option>
<option value="tts-1-hd">HD (High Quality)</option>
</select>
</div>
{/* Speed Control */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">
Speed ({audioSettings.speed}x)
</label>
<input
type="range"
min="0.25"
max="4"
step="0.05"
value={audioSettings.speed}
onChange={(e) => setAudioSettings(prev => ({ ...prev, speed: parseFloat(e.target.value) }))}
disabled={isGenerating}
className="w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer disabled:cursor-not-allowed"
/>
</div>
{/* Format Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Format</label>
<select
value={audioSettings.format}
onChange={(e) => setAudioSettings(prev => ({ ...prev, format: e.target.value }))}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
<option value="mp3">MP3</option>
<option value="opus">Opus</option>
<option value="aac">AAC</option>
<option value="flac">FLAC</option>
</select>
</div>
</div>
</div>
{/* Text Input Section */}
<div className="p-6 border-b border-gray-200">
<div className="mb-4">
<div className="flex justify-between items-center mb-2">
<label className="block text-sm font-medium text-gray-700">Text to Convert</label>
<span className="text-sm text-gray-500">{text.length}/4096 characters</span>
</div>
<textarea
value={text}
onChange={(e) => setText(e.target.value)}
placeholder="Enter the text you want to convert to speech..."
className="w-full px-4 py-3 border border-gray-300 rounded-xl focus:outline-none focus:ring-2 focus:ring-orange-500 focus:border-transparent transition-all duration-200 resize-none"
rows={4}
maxLength={4096}
disabled={isGenerating}
/>
</div>
{/* Sample Texts */}
<div className="mb-4">
<p className="text-sm text-gray-600 mb-2">Quick samples:</p>
<div className="flex flex-wrap gap-2">
{sampleTexts.map((sample, index) => (
<button
key={index}
onClick={() => setText(sample)}
disabled={isGenerating}
className="px-3 py-1 text-sm bg-gray-100 hover:bg-orange-100 text-gray-700 hover:text-orange-700 rounded-full transition-colors duration-200 disabled:opacity-50 disabled:cursor-not-allowed"
>
{sample.substring(0, 30)}...
</button>
))}
</div>
</div>
{/* Generate Button */}
<div className="flex justify-center">
<button
onClick={generateSpeech}
disabled={isGenerating || !text.trim()}
className="px-8 py-3 bg-gradient-to-r from-orange-600 to-red-600 hover:from-orange-700 hover:to-red-700 disabled:from-gray-300 disabled:to-gray-300 text-white rounded-xl transition-all duration-200 flex items-center space-x-2 shadow-lg disabled:shadow-none"
>
{isGenerating ? (
<>
<div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div>
<span>Generating...</span>
</>
) : (
<>
<Volume2 className="w-4 h-4" />
<span>Generate Speech</span>
</>
)}
</button>
</div>
</div>
{/* Results Section */}
<div className="flex-1 p-6">
{/* Error Display */}
{error && (
<div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4">
<p className="text-red-700">
<strong>Error:</strong> {error}
</p>
</div>
)}
{/* Generated Audio List */}
{generatedAudio.length === 0 ? (
<div className="text-center py-12">
<div className="w-16 h-16 bg-orange-100 rounded-2xl flex items-center justify-center mx-auto mb-4">
<Volume2 className="w-8 h-8 text-orange-600" />
</div>
<h3 className="text-lg font-semibold text-gray-700 mb-2">
No Audio Generated Yet
</h3>
<p className="text-gray-600 max-w-md mx-auto">
Enter some text above and click "Generate Speech" to create your first AI voice.
</p>
</div>
) : (
<div className="space-y-4">
<h4 className="font-semibold text-gray-900 mb-4">
Generated Audio ({generatedAudio.length})
</h4>
{generatedAudio.map((audioItem) => (
<div key={audioItem.id} className="bg-gray-50 rounded-lg p-4 border border-gray-200">
<div className="flex items-start justify-between mb-3">
<div className="flex-1">
<div className="flex items-center space-x-2 mb-2">
<div className="p-1 bg-orange-100 rounded">
<Volume2 className="w-4 h-4 text-orange-600" />
</div>
<span className="font-medium text-gray-900 text-sm">
{voices[audioItem.voice]?.name || audioItem.voice}
</span>
<span className="text-xs text-gray-500">
{new Date(audioItem.timestamp).toLocaleTimeString()}
</span>
</div>
<p className="text-sm text-gray-700 mb-2 line-clamp-2">
{audioItem.text}
</p>
<div className="flex flex-wrap gap-1 text-xs">
<span className="px-2 py-1 bg-orange-100 text-orange-800 rounded-full">
{audioItem.settings.model}
</span>
<span className="px-2 py-1 bg-blue-100 text-blue-800 rounded-full">
{audioItem.settings.speed}x speed
</span>
<span className="px-2 py-1 bg-green-100 text-green-800 rounded-full">
{formatFileSize(audioItem.audio.size)}
</span>
<span className="px-2 py-1 bg-gray-100 text-gray-800 rounded-full">
~{formatDuration(audioItem.audio.duration_estimate)}
</span>
</div>
</div>
<div className="flex items-center space-x-2">
<button
onClick={() => playAudio(audioItem)}
className="p-2 bg-orange-500 hover:bg-orange-600 text-white rounded-lg transition-colors duration-200"
title={currentlyPlaying?.id === audioItem.id ? "Pause" : "Play"}
>
{currentlyPlaying?.id === audioItem.id && currentlyPlaying?.status === 'playing' ? (
<Pause className="w-4 h-4" />
) : (
<Play className="w-4 h-4" />
)}
</button>
<button
onClick={() => downloadAudio(audioItem)}
className="p-2 bg-green-500 hover:bg-green-600 text-white rounded-lg transition-colors duration-200"
title="Download audio"
>
<Download className="w-4 h-4" />
</button>
</div>
</div>
</div>
))}
</div>
)}
</div>
</div>
</div>
);
}
export default TextToSpeech;

Step 3B: Adding Text-to-Speech to Navigation

Section titled “Step 3B: Adding Text-to-Speech to Navigation”

Update your src/App.jsx to include the new text-to-speech component:

import { useState } from "react";
import StreamingChat from "./StreamingChat";
import ImageGenerator from "./ImageGenerator";
import AudioTranscription from "./AudioTranscription";
import FileAnalysis from "./FileAnalysis";
import TextToSpeech from "./TextToSpeech";
import { MessageSquare, Image, Mic, Folder, Volume2 } from "lucide-react";
function App() {
// 🧠 STATE: Navigation management
const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', 'audio', 'files', or 'speech'
// 🎨 UI: Main app with navigation
return (
<div className="min-h-screen bg-gray-100">
{/* Navigation Header */}
<nav className="bg-white shadow-sm border-b border-gray-200">
<div className="max-w-6xl mx-auto px-4">
<div className="flex items-center justify-between h-16">
{/* Logo */}
<div className="flex items-center space-x-3">
<div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center">
<span className="text-white font-bold text-sm">AI</span>
</div>
<h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1>
</div>
{/* Navigation Buttons */}
<div className="flex space-x-2">
<button
onClick={() => setCurrentView("chat")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "chat"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<MessageSquare className="w-4 h-4" />
<span>Chat</span>
</button>
<button
onClick={() => setCurrentView("images")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "images"
? "bg-purple-100 text-purple-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Image className="w-4 h-4" />
<span>Images</span>
</button>
<button
onClick={() => setCurrentView("audio")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "audio"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Mic className="w-4 h-4" />
<span>Audio</span>
</button>
<button
onClick={() => setCurrentView("files")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "files"
? "bg-green-100 text-green-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Folder className="w-4 h-4" />
<span>Files</span>
</button>
<button
onClick={() => setCurrentView("speech")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "speech"
? "bg-orange-100 text-orange-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Volume2 className="w-4 h-4" />
<span>Speech</span>
</button>
</div>
</div>
</div>
</nav>
{/* Main Content */}
<main className="h-[calc(100vh-4rem)]">
{currentView === "chat" && <StreamingChat />}
{currentView === "images" && <ImageGenerator />}
{currentView === "audio" && <AudioTranscription />}
{currentView === "files" && <FileAnalysis />}
{currentView === "speech" && <TextToSpeech />}
</main>
</div>
);
}
export default App;

Let’s test your text-to-speech feature step by step to make sure everything works correctly.

First, verify your backend route works by testing it directly:

Test with a simple text:

Terminal window
curl -X POST http://localhost:8000/api/tts/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello, this is a test of AI voice synthesis.", "voice": "alloy", "model": "tts-1"}'

Expected response:

{
"success": true,
"audio": {
"filename": "tts-1234567890.mp3",
"format": "mp3",
"size": 15420,
"duration_estimate": 3,
"download_url": "/api/tts/download/tts-1234567890.mp3"
},
"generation": {
"voice": "alloy",
"voice_info": {
"name": "Alloy",
"description": "Professional and versatile"
},
"model": "tts-1",
"speed": 1.0,
"text_length": 44
}
}

Start both servers:

Backend (in your backend folder):

Terminal window
npm run dev

Frontend (in your frontend folder):

Terminal window
npm run dev

Test the complete flow:

  1. Navigate to Speech → Click the “Speech” tab in navigation
  2. Select voice settings → Choose voice, quality, speed, and format
  3. Enter text → Type or select a sample text
  4. Generate speech → Click “Generate Speech” and see loading state
  5. Listen to audio → Click play button to hear the generated voice
  6. Download audio → Test downloading the speech file
  7. Try different voices → Test all six AI voices with the same text

Test all six voices with the same text to hear their personalities:

🎙️ Alloy: Professional and neutral
🌊 Echo: Calm and soothing
📚 Fable: Expressive storyteller
🎯 Onyx: Deep and authoritative
☀️ Nova: Warm and friendly
✨ Shimmer: Bright and energetic

Expected behavior:

  • Each voice has distinct personality and tone
  • Audio quality is clear and natural
  • Playback controls work smoothly
  • Download generates proper audio files

Congratulations! You’ve completed your comprehensive OpenAI mastery application with text-to-speech:

  • Extended your backend with voice synthesis and audio file management
  • Added React speech component following the same patterns as your other features
  • Implemented six AI voices with distinct personalities and use cases
  • Created flexible audio settings for quality, speed, and format control
  • Added playback functionality with play/pause controls
  • Maintained consistent design with your existing application

Your complete application now has:

  • Text chat with streaming responses
  • Image generation with DALL-E 3 and GPT-Image-1
  • Audio transcription with Whisper voice recognition
  • File analysis with intelligent document processing
  • Text-to-speech with six AI voice personalities
  • Unified navigation between all features
  • Professional UI with consistent TailwindCSS styling

🎉 You’ve built a complete OpenAI mastery application! Your users can now chat with AI, generate images, transcribe audio, analyze files, and hear AI responses spoken aloud - all in one seamless experience.

Your application demonstrates mastery of OpenAI’s entire ecosystem and provides a solid foundation for building even more advanced AI-powered applications. 🔊