🔊 AI Text-to-Speech Made Simple
Right now, you have chat, images, audio transcription, and file analysis working in your application. But what if your AI could also speak back to users?
Text-to-speech adds voice capabilities. Instead of just showing text responses, your AI can speak them aloud with natural-sounding voices, creating more engaging and accessible experiences for your users.
You’re about to learn exactly how to add voice synthesis to your existing application.
🧠 Step 1: Understanding AI Text-to-Speech
Section titled “🧠 Step 1: Understanding AI Text-to-Speech”Before we write any code, let’s understand what AI text-to-speech actually means and why it’s useful for your applications.
What AI Text-to-Speech Actually Means
Section titled “What AI Text-to-Speech Actually Means”AI text-to-speech is like having professional voice actors inside your application. Users can have any text read aloud with natural-sounding voices that have different personalities and speaking styles.
Real-world analogy: It’s like hiring a team of voice actors who can instantly read any text in their unique style. Instead of users reading everything themselves, they can listen while multitasking, or get an audio version for accessibility.
Why You Need This in Your Applications
Section titled “Why You Need This in Your Applications”Think about all the times you or your users would benefit from audio:
- Accessibility for users with visual impairments or reading difficulties
- Multitasking - users can listen while doing other activities
- Learning styles - some people learn better by hearing information
- Content consumption - turn articles into podcasts instantly
- Hands-free interaction - perfect for mobile or automotive use
Without AI text-to-speech, you’d need to:
- Record everything manually (time-consuming and expensive)
- Use robotic computer voices (poor user experience)
- Miss accessibility opportunities (limiting your audience)
- Provide only visual content (excluding audio learners)
With AI text-to-speech, you just send any text and get natural speech instantly.
OpenAI’s Voice Options
Section titled “OpenAI’s Voice Options”OpenAI provides six distinct AI voices, each with their own personality:
🎙️ Alloy - The Professional
- Best for: Business content, presentations, formal communication
- Personality: Neutral, clear, and professional
- Think of it as: Your corporate spokesperson
🌊 Echo - The Calming Voice
- Best for: Meditation, relaxation content, soothing narration
- Personality: Calm, gentle, and peaceful
- Think of it as: Your meditation instructor
📚 Fable - The Storyteller
- Best for: Stories, creative content, engaging narratives
- Personality: Expressive, dynamic, and captivating
- Think of it as: Your favorite audiobook narrator
🎯 Onyx - The Authority
- Best for: News, announcements, important information
- Personality: Deep, confident, and commanding
- Think of it as: Your news anchor
☀️ Nova - The Friendly Guide
- Best for: Tutorials, customer service, welcoming content
- Personality: Warm, approachable, and helpful
- Think of it as: Your friendly assistant
✨ Shimmer - The Energetic Motivator
- Best for: Marketing, motivational content, upbeat messages
- Personality: Bright, enthusiastic, and energetic
- Think of it as: Your marketing spokesperson
We’ll start by building basic text-to-speech functionality, and you can explore different voices to find the perfect match for your content.
🔧 Step 2: Adding Text-to-Speech to Your Backend
Section titled “🔧 Step 2: Adding Text-to-Speech to Your Backend”Let’s add text-to-speech to your existing backend using the same patterns you learned in Module 1. We’ll add new routes to handle text input and voice generation.
Building on your foundation: You already have a working Node.js server with OpenAI integration. We’re simply adding voice synthesis capabilities to what you’ve built.
Step 2A: Understanding Text-to-Speech State
Section titled “Step 2A: Understanding Text-to-Speech State”Before writing code, let’s understand what data our text-to-speech system needs to manage:
// 🧠 TEXT-TO-SPEECH STATE CONCEPTS:// 1. Text Input - The text content to convert to speech// 2. Voice Selection - Which AI voice personality to use// 3. Audio Settings - Speed, format, and quality preferences// 4. Generated Audio - The resulting audio file and metadata// 5. Error States - Invalid text, processing failures, file limits
Key text-to-speech concepts:
- Voice Models: TTS-1 (fast) vs TTS-1-HD (high quality)
- Voice Personalities: Six different AI voices with unique characteristics
- Audio Formats: MP3, Opus, AAC, and FLAC options
- Speed Control: Adjust speaking rate from 0.25x to 4x normal speed
Step 2B: Adding the Text-to-Speech Route
Section titled “Step 2B: Adding the Text-to-Speech Route”Add this new endpoint to your existing index.js
file, right after your file analysis routes:
import fs from 'fs';import path from 'path';
// 🔊 VOICE PROFILES: Available AI voices with personalitiesconst VOICE_PROFILES = { alloy: { name: "Alloy", description: "Professional and versatile", bestFor: "Business content, presentations" }, echo: { name: "Echo", description: "Calm and soothing", bestFor: "Meditation, relaxation content" }, fable: { name: "Fable", description: "Expressive storyteller", bestFor: "Stories, creative content" }, onyx: { name: "Onyx", description: "Deep and authoritative", bestFor: "News, formal announcements" }, nova: { name: "Nova", description: "Warm and friendly", bestFor: "Customer service, tutorials" }, shimmer: { name: "Shimmer", description: "Bright and energetic", bestFor: "Marketing, upbeat content" }};
// 🔧 HELPER FUNCTIONS: Audio processing utilitiesconst saveAudioToTemp = async (audioBuffer, format = 'mp3') => { const tempDir = path.join(process.cwd(), "temp");
// Create temp directory if it doesn't exist if (!fs.existsSync(tempDir)) { fs.mkdirSync(tempDir, { recursive: true }); }
// Create unique filename const filename = `tts-${Date.now()}.${format}`; const filepath = path.join(tempDir, filename);
// Write audio file fs.writeFileSync(filepath, audioBuffer);
// Auto-cleanup after 1 hour setTimeout(() => { try { if (fs.existsSync(filepath)) { fs.unlinkSync(filepath); console.log(`🧹 Cleaned up: ${filename}`); } } catch (error) { console.error("Error cleaning up audio file:", error); } }, 3600000); // 1 hour
return { filepath, filename };};
// 🔊 AI Text-to-Speech endpoint - add this to your existing serverapp.post("/api/tts/generate", async (req, res) => { try { // 🛡️ VALIDATION: Check required inputs const { text, voice = "alloy", model = "tts-1", speed = 1.0, format = "mp3" } = req.body;
if (!text || text.trim() === "") { return res.status(400).json({ error: "Text is required", success: false }); }
if (text.length > 4096) { return res.status(400).json({ error: "Text too long. Maximum 4096 characters allowed.", current_length: text.length, success: false }); }
console.log(`🔊 Generating speech: ${text.substring(0, 50)}... (${voice})`);
// 🎙️ AI SPEECH GENERATION: Convert text to speech const response = await openai.audio.speech.create({ model: model, // tts-1 (fast) or tts-1-hd (high quality) voice: voice, // AI voice personality input: text.trim(), // Text to convert response_format: format, // Audio format (mp3, opus, aac, flac) speed: Math.max(0.25, Math.min(4.0, speed)) // Speaking speed (0.25x to 4x) });
// 💾 AUDIO PROCESSING: Save audio file const audioBuffer = Buffer.from(await response.arrayBuffer()); const { filepath, filename } = await saveAudioToTemp(audioBuffer, format);
// 📤 SUCCESS RESPONSE: Send audio info and download link res.json({ success: true, audio: { filename: filename, format: format, size: audioBuffer.length, duration_estimate: Math.ceil(text.length / 14), // ~14 characters per second download_url: `/api/tts/download/${filename}` }, generation: { voice: voice, voice_info: VOICE_PROFILES[voice], model: model, speed: speed, text_length: text.length }, timestamp: new Date().toISOString() });
} catch (error) { // 🚨 ERROR HANDLING: Handle TTS failures console.error("Text-to-speech error:", error); res.status(500).json({ error: "Failed to generate speech", details: error.message, success: false }); }});
// 📥 Audio Download endpoint - serve generated audio filesapp.get("/api/tts/download/:filename", (req, res) => { try { const { filename } = req.params; const filepath = path.join(process.cwd(), "temp", filename);
// Security check - ensure filename is safe if (!filename.match(/^tts-\d+\.(mp3|opus|aac|flac)$/)) { return res.status(400).json({ error: "Invalid filename" }); }
// Check if file exists if (!fs.existsSync(filepath)) { return res.status(404).json({ error: "Audio file not found or expired" }); }
// Serve audio file const extension = path.extname(filename).substring(1); res.setHeader('Content-Type', `audio/${extension}`); res.setHeader('Content-Disposition', `attachment; filename="${filename}"`);
const audioBuffer = fs.readFileSync(filepath); res.send(audioBuffer);
} catch (error) { console.error("Audio download error:", error); res.status(500).json({ error: "Failed to download audio", message: error.message }); }});
// 🎙️ Voice Information endpoint - get available voicesapp.get("/api/tts/voices", (req, res) => { res.json({ success: true, voices: VOICE_PROFILES, models: [ { id: "tts-1", name: "TTS-1", description: "Fast, cost-effective synthesis", quality: "standard" }, { id: "tts-1-hd", name: "TTS-1 HD", description: "High-definition audio quality", quality: "premium" } ], formats: ["mp3", "opus", "aac", "flac"], speed_range: { min: 0.25, max: 4.0, default: 1.0 }, text_limit: 4096 });});
Function breakdown:
- Text validation - Ensure text exists and is within length limits
- Voice configuration - Set up AI voice, model, and audio settings
- Speech generation - Call OpenAI’s TTS API to create audio
- Audio storage - Save audio file temporarily for download
- Response formatting - Return audio info and download link
- File serving - Provide secure download endpoint for audio files
Step 2C: Adding Error Handling for TTS
Section titled “Step 2C: Adding Error Handling for TTS”Add this middleware to handle text-to-speech specific errors:
// 🚨 TTS ERROR HANDLING: Handle text-to-speech errorsapp.use((error, req, res, next) => { if (error.message && error.message.includes('Invalid voice')) { return res.status(400).json({ error: "Invalid voice selected. Please choose from: alloy, echo, fable, onyx, nova, shimmer", success: false }); }
if (error.message && error.message.includes('text too long')) { return res.status(400).json({ error: "Text exceeds maximum length of 4096 characters", success: false }); }
next(error);});
Your backend now supports:
- Text chat (existing functionality)
- Streaming chat (existing functionality)
- Image generation (existing functionality)
- Audio transcription (existing functionality)
- File analysis (existing functionality)
- Text-to-speech (new functionality)
---
## 🔧 Step 3: Building the React Text-to-Speech Component
Now let's create a React component for text-to-speech using the same patterns from your existing components.
### **Step 3A: Creating the Text-to-Speech Component**
Create a new file `src/TextToSpeech.jsx`:
```jsximport { useState, useRef, useEffect } from "react";import { Volume2, Play, Pause, Download, Settings } from "lucide-react";
function TextToSpeech() { // 🧠 STATE: Text-to-speech data management const [text, setText] = useState(""); // Text to convert const [selectedVoice, setSelectedVoice] = useState("alloy"); // AI voice selection const [audioSettings, setAudioSettings] = useState({ // TTS settings model: "tts-1", speed: 1.0, format: "mp3" }); const [isGenerating, setIsGenerating] = useState(false); // Processing status const [generatedAudio, setGeneratedAudio] = useState([]); // Generated audio list const [currentlyPlaying, setCurrentlyPlaying] = useState(null); // Audio playback state const [voices, setVoices] = useState({}); // Available voices const [error, setError] = useState(null); // Error messages
const audioRef = useRef(null);
// Load available voices on component mount useEffect(() => { fetchVoices(); }, []);
const fetchVoices = async () => { try { const response = await fetch("http://localhost:8000/api/tts/voices"); const data = await response.json(); if (data.success) { setVoices(data.voices); } } catch (error) { console.error('Failed to fetch voices:', error); } };
// 🔧 FUNCTIONS: Text-to-speech logic engine
// Main speech generation function const generateSpeech = async () => { // 🛡️ GUARDS: Prevent invalid generation if (!text.trim() || isGenerating) return;
// 🔄 SETUP: Prepare for generation setIsGenerating(true); setError(null);
try { // 📤 API CALL: Send to your backend const response = await fetch("http://localhost:8000/api/tts/generate", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ text: text.trim(), voice: selectedVoice, ...audioSettings }) });
const data = await response.json();
if (!response.ok) { throw new Error(data.error || 'Failed to generate speech'); }
// ✅ SUCCESS: Store generated audio const newAudio = { id: Date.now(), text: text.trim(), voice: selectedVoice, settings: audioSettings, audio: data.audio, generation: data.generation, timestamp: new Date().toISOString() };
setGeneratedAudio(prev => [newAudio, ...prev]); setText(""); // Clear input after successful generation
} catch (error) { // 🚨 ERROR HANDLING: Show user-friendly message console.error('Speech generation failed:', error); setError(error.message || 'Something went wrong while generating speech'); } finally { // 🧹 CLEANUP: Reset generation state setIsGenerating(false); } };
// Audio playback function const playAudio = async (audioItem) => { try { if (currentlyPlaying?.id === audioItem.id) { // Pause current audio if (audioRef.current) { audioRef.current.pause(); setCurrentlyPlaying(null); } return; }
// Stop any currently playing audio if (audioRef.current) { audioRef.current.pause(); }
// Create new audio element const audio = new Audio(`http://localhost:8000${audioItem.audio.download_url}`); audioRef.current = audio;
audio.onloadstart = () => setCurrentlyPlaying({ ...audioItem, status: 'loading' }); audio.oncanplay = () => setCurrentlyPlaying({ ...audioItem, status: 'ready' }); audio.onplay = () => setCurrentlyPlaying({ ...audioItem, status: 'playing' }); audio.onpause = () => setCurrentlyPlaying({ ...audioItem, status: 'paused' }); audio.onended = () => setCurrentlyPlaying(null); audio.onerror = () => { setCurrentlyPlaying(null); setError('Failed to play audio'); };
await audio.play(); } catch (error) { console.error('Audio playback error:', error); setCurrentlyPlaying(null); setError('Failed to play audio'); } };
// Download audio function const downloadAudio = (audioItem) => { try { const link = document.createElement('a'); link.href = `http://localhost:8000${audioItem.audio.download_url}`; link.download = `speech-${audioItem.id}.${audioItem.audio.format}`; document.body.appendChild(link); link.click(); document.body.removeChild(link); } catch (error) { console.error('Download error:', error); setError('Failed to download audio'); } };
// Sample texts for quick testing const sampleTexts = [ "Welcome to our application! I'm excited to help you with AI-powered text-to-speech.", "Once upon a time, in the world of artificial intelligence, voices came alive with just a few lines of code.", "This is a test of the emergency broadcast system. This is only a test.", "Take a deep breath and relax as you listen to this calming AI-generated voice.", "Breaking news: AI technology continues to amaze us with natural-sounding speech synthesis." ];
// Utility functions const formatFileSize = (bytes) => { if (bytes === 0) return '0 Bytes'; const k = 1024; const sizes = ['Bytes', 'KB', 'MB']; const i = Math.floor(Math.log(bytes) / Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + ' ' + sizes[i]; };
const formatDuration = (seconds) => { const mins = Math.floor(seconds / 60); const secs = Math.floor(seconds % 60); return `${mins}:${secs.toString().padStart(2, '0')}`; };
// 🎨 UI: Interface components return ( <div className="min-h-screen bg-gradient-to-br from-orange-50 to-red-50 flex items-center justify-center p-4"> <div className="bg-white rounded-2xl shadow-2xl w-full max-w-4xl flex flex-col overflow-hidden">
{/* Header */} <div className="bg-gradient-to-r from-orange-600 to-red-600 text-white p-6"> <div className="flex items-center space-x-3"> <div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center"> <Volume2 className="w-5 h-5" /> </div> <div> <h1 className="text-xl font-bold">🔊 AI Text-to-Speech</h1> <p className="text-orange-100 text-sm">Convert any text to natural speech!</p> </div> </div> </div>
{/* Voice Settings Section */} <div className="p-6 border-b border-gray-200"> <h3 className="font-semibold text-gray-900 mb-4 flex items-center"> <Settings className="w-5 h-5 mr-2 text-orange-600" /> Voice Settings </h3>
<div className="grid grid-cols-1 md:grid-cols-4 gap-4"> {/* Voice Selection */} <div> <label className="block text-sm font-medium text-gray-700 mb-2">Voice</label> <select value={selectedVoice} onChange={(e) => setSelectedVoice(e.target.value)} disabled={isGenerating} className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100" > {Object.entries(voices).map(([key, voice]) => ( <option key={key} value={key}> {voice.name} - {voice.description} </option> ))} </select> </div>
{/* Model Selection */} <div> <label className="block text-sm font-medium text-gray-700 mb-2">Quality</label> <select value={audioSettings.model} onChange={(e) => setAudioSettings(prev => ({ ...prev, model: e.target.value }))} disabled={isGenerating} className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100" > <option value="tts-1">Standard (Fast)</option> <option value="tts-1-hd">HD (High Quality)</option> </select> </div>
{/* Speed Control */} <div> <label className="block text-sm font-medium text-gray-700 mb-2"> Speed ({audioSettings.speed}x) </label> <input type="range" min="0.25" max="4" step="0.05" value={audioSettings.speed} onChange={(e) => setAudioSettings(prev => ({ ...prev, speed: parseFloat(e.target.value) }))} disabled={isGenerating} className="w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer disabled:cursor-not-allowed" /> </div>
{/* Format Selection */} <div> <label className="block text-sm font-medium text-gray-700 mb-2">Format</label> <select value={audioSettings.format} onChange={(e) => setAudioSettings(prev => ({ ...prev, format: e.target.value }))} disabled={isGenerating} className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100" > <option value="mp3">MP3</option> <option value="opus">Opus</option> <option value="aac">AAC</option> <option value="flac">FLAC</option> </select> </div> </div> </div>
{/* Text Input Section */} <div className="p-6 border-b border-gray-200"> <div className="mb-4"> <div className="flex justify-between items-center mb-2"> <label className="block text-sm font-medium text-gray-700">Text to Convert</label> <span className="text-sm text-gray-500">{text.length}/4096 characters</span> </div> <textarea value={text} onChange={(e) => setText(e.target.value)} placeholder="Enter the text you want to convert to speech..." className="w-full px-4 py-3 border border-gray-300 rounded-xl focus:outline-none focus:ring-2 focus:ring-orange-500 focus:border-transparent transition-all duration-200 resize-none" rows={4} maxLength={4096} disabled={isGenerating} /> </div>
{/* Sample Texts */} <div className="mb-4"> <p className="text-sm text-gray-600 mb-2">Quick samples:</p> <div className="flex flex-wrap gap-2"> {sampleTexts.map((sample, index) => ( <button key={index} onClick={() => setText(sample)} disabled={isGenerating} className="px-3 py-1 text-sm bg-gray-100 hover:bg-orange-100 text-gray-700 hover:text-orange-700 rounded-full transition-colors duration-200 disabled:opacity-50 disabled:cursor-not-allowed" > {sample.substring(0, 30)}... </button> ))} </div> </div>
{/* Generate Button */} <div className="flex justify-center"> <button onClick={generateSpeech} disabled={isGenerating || !text.trim()} className="px-8 py-3 bg-gradient-to-r from-orange-600 to-red-600 hover:from-orange-700 hover:to-red-700 disabled:from-gray-300 disabled:to-gray-300 text-white rounded-xl transition-all duration-200 flex items-center space-x-2 shadow-lg disabled:shadow-none" > {isGenerating ? ( <> <div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div> <span>Generating...</span> </> ) : ( <> <Volume2 className="w-4 h-4" /> <span>Generate Speech</span> </> )} </button> </div> </div>
{/* Results Section */} <div className="flex-1 p-6"> {/* Error Display */} {error && ( <div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4"> <p className="text-red-700"> <strong>Error:</strong> {error} </p> </div> )}
{/* Generated Audio List */} {generatedAudio.length === 0 ? ( <div className="text-center py-12"> <div className="w-16 h-16 bg-orange-100 rounded-2xl flex items-center justify-center mx-auto mb-4"> <Volume2 className="w-8 h-8 text-orange-600" /> </div> <h3 className="text-lg font-semibold text-gray-700 mb-2"> No Audio Generated Yet </h3> <p className="text-gray-600 max-w-md mx-auto"> Enter some text above and click "Generate Speech" to create your first AI voice. </p> </div> ) : ( <div className="space-y-4"> <h4 className="font-semibold text-gray-900 mb-4"> Generated Audio ({generatedAudio.length}) </h4>
{generatedAudio.map((audioItem) => ( <div key={audioItem.id} className="bg-gray-50 rounded-lg p-4 border border-gray-200"> <div className="flex items-start justify-between mb-3"> <div className="flex-1"> <div className="flex items-center space-x-2 mb-2"> <div className="p-1 bg-orange-100 rounded"> <Volume2 className="w-4 h-4 text-orange-600" /> </div> <span className="font-medium text-gray-900 text-sm"> {voices[audioItem.voice]?.name || audioItem.voice} </span> <span className="text-xs text-gray-500"> {new Date(audioItem.timestamp).toLocaleTimeString()} </span> </div>
<p className="text-sm text-gray-700 mb-2 line-clamp-2"> {audioItem.text} </p>
<div className="flex flex-wrap gap-1 text-xs"> <span className="px-2 py-1 bg-orange-100 text-orange-800 rounded-full"> {audioItem.settings.model} </span> <span className="px-2 py-1 bg-blue-100 text-blue-800 rounded-full"> {audioItem.settings.speed}x speed </span> <span className="px-2 py-1 bg-green-100 text-green-800 rounded-full"> {formatFileSize(audioItem.audio.size)} </span> <span className="px-2 py-1 bg-gray-100 text-gray-800 rounded-full"> ~{formatDuration(audioItem.audio.duration_estimate)} </span> </div> </div>
<div className="flex items-center space-x-2"> <button onClick={() => playAudio(audioItem)} className="p-2 bg-orange-500 hover:bg-orange-600 text-white rounded-lg transition-colors duration-200" title={currentlyPlaying?.id === audioItem.id ? "Pause" : "Play"} > {currentlyPlaying?.id === audioItem.id && currentlyPlaying?.status === 'playing' ? ( <Pause className="w-4 h-4" /> ) : ( <Play className="w-4 h-4" /> )} </button>
<button onClick={() => downloadAudio(audioItem)} className="p-2 bg-green-500 hover:bg-green-600 text-white rounded-lg transition-colors duration-200" title="Download audio" > <Download className="w-4 h-4" /> </button> </div> </div> </div> ))} </div> )} </div> </div> </div> );}
export default TextToSpeech;
Step 3B: Adding Text-to-Speech to Navigation
Section titled “Step 3B: Adding Text-to-Speech to Navigation”Update your src/App.jsx
to include the new text-to-speech component:
import { useState } from "react";import StreamingChat from "./StreamingChat";import ImageGenerator from "./ImageGenerator";import AudioTranscription from "./AudioTranscription";import FileAnalysis from "./FileAnalysis";import TextToSpeech from "./TextToSpeech";import { MessageSquare, Image, Mic, Folder, Volume2 } from "lucide-react";
function App() { // 🧠 STATE: Navigation management const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', 'audio', 'files', or 'speech'
// 🎨 UI: Main app with navigation return ( <div className="min-h-screen bg-gray-100"> {/* Navigation Header */} <nav className="bg-white shadow-sm border-b border-gray-200"> <div className="max-w-6xl mx-auto px-4"> <div className="flex items-center justify-between h-16"> {/* Logo */} <div className="flex items-center space-x-3"> <div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center"> <span className="text-white font-bold text-sm">AI</span> </div> <h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1> </div>
{/* Navigation Buttons */} <div className="flex space-x-2"> <button onClick={() => setCurrentView("chat")} className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${ currentView === "chat" ? "bg-blue-100 text-blue-700 shadow-sm" : "text-gray-600 hover:text-gray-900 hover:bg-gray-100" }`} > <MessageSquare className="w-4 h-4" /> <span>Chat</span> </button>
<button onClick={() => setCurrentView("images")} className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${ currentView === "images" ? "bg-purple-100 text-purple-700 shadow-sm" : "text-gray-600 hover:text-gray-900 hover:bg-gray-100" }`} > <Image className="w-4 h-4" /> <span>Images</span> </button>
<button onClick={() => setCurrentView("audio")} className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${ currentView === "audio" ? "bg-blue-100 text-blue-700 shadow-sm" : "text-gray-600 hover:text-gray-900 hover:bg-gray-100" }`} > <Mic className="w-4 h-4" /> <span>Audio</span> </button>
<button onClick={() => setCurrentView("files")} className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${ currentView === "files" ? "bg-green-100 text-green-700 shadow-sm" : "text-gray-600 hover:text-gray-900 hover:bg-gray-100" }`} > <Folder className="w-4 h-4" /> <span>Files</span> </button>
<button onClick={() => setCurrentView("speech")} className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${ currentView === "speech" ? "bg-orange-100 text-orange-700 shadow-sm" : "text-gray-600 hover:text-gray-900 hover:bg-gray-100" }`} > <Volume2 className="w-4 h-4" /> <span>Speech</span> </button> </div> </div> </div> </nav>
{/* Main Content */} <main className="h-[calc(100vh-4rem)]"> {currentView === "chat" && <StreamingChat />} {currentView === "images" && <ImageGenerator />} {currentView === "audio" && <AudioTranscription />} {currentView === "files" && <FileAnalysis />} {currentView === "speech" && <TextToSpeech />} </main> </div> );}
export default App;
🧪 Testing Your Text-to-Speech
Section titled “🧪 Testing Your Text-to-Speech”Let’s test your text-to-speech feature step by step to make sure everything works correctly.
Step 1: Backend Route Test
Section titled “Step 1: Backend Route Test”First, verify your backend route works by testing it directly:
Test with a simple text:
curl -X POST http://localhost:8000/api/tts/generate \ -H "Content-Type: application/json" \ -d '{"text": "Hello, this is a test of AI voice synthesis.", "voice": "alloy", "model": "tts-1"}'
Expected response:
{ "success": true, "audio": { "filename": "tts-1234567890.mp3", "format": "mp3", "size": 15420, "duration_estimate": 3, "download_url": "/api/tts/download/tts-1234567890.mp3" }, "generation": { "voice": "alloy", "voice_info": { "name": "Alloy", "description": "Professional and versatile" }, "model": "tts-1", "speed": 1.0, "text_length": 44 }}
Step 2: Full Application Test
Section titled “Step 2: Full Application Test”Start both servers:
Backend (in your backend folder):
npm run dev
Frontend (in your frontend folder):
npm run dev
Test the complete flow:
- Navigate to Speech → Click the “Speech” tab in navigation
- Select voice settings → Choose voice, quality, speed, and format
- Enter text → Type or select a sample text
- Generate speech → Click “Generate Speech” and see loading state
- Listen to audio → Click play button to hear the generated voice
- Download audio → Test downloading the speech file
- Try different voices → Test all six AI voices with the same text
Step 3: Voice Comparison Test
Section titled “Step 3: Voice Comparison Test”Test all six voices with the same text to hear their personalities:
🎙️ Alloy: Professional and neutral🌊 Echo: Calm and soothing📚 Fable: Expressive storyteller🎯 Onyx: Deep and authoritative☀️ Nova: Warm and friendly✨ Shimmer: Bright and energetic
Expected behavior:
- Each voice has distinct personality and tone
- Audio quality is clear and natural
- Playback controls work smoothly
- Download generates proper audio files
✅ What You Built
Section titled “✅ What You Built”Congratulations! You’ve completed your comprehensive OpenAI mastery application with text-to-speech:
- ✅ Extended your backend with voice synthesis and audio file management
- ✅ Added React speech component following the same patterns as your other features
- ✅ Implemented six AI voices with distinct personalities and use cases
- ✅ Created flexible audio settings for quality, speed, and format control
- ✅ Added playback functionality with play/pause controls
- ✅ Maintained consistent design with your existing application
Your complete application now has:
- Text chat with streaming responses
- Image generation with DALL-E 3 and GPT-Image-1
- Audio transcription with Whisper voice recognition
- File analysis with intelligent document processing
- Text-to-speech with six AI voice personalities
- Unified navigation between all features
- Professional UI with consistent TailwindCSS styling
🎉 You’ve built a complete OpenAI mastery application! Your users can now chat with AI, generate images, transcribe audio, analyze files, and hear AI responses spoken aloud - all in one seamless experience.
Your application demonstrates mastery of OpenAI’s entire ecosystem and provides a solid foundation for building even more advanced AI-powered applications. 🔊