Skip to content

🎤 AI Audio Transcription Made Simple

Right now, you know how to build chat applications with text and generate images with AI. But what if your AI could also understand and process audio?

Audio transcription opens up voice capabilities. Instead of just typing messages, users can speak to your AI, record voice notes, transcribe meetings, and create voice-powered applications.

You’re about to learn exactly how to add voice processing to your existing application.


🧠 Step 1: Understanding AI Audio Transcription

Section titled “🧠 Step 1: Understanding AI Audio Transcription”

Before we write any code, let’s understand what AI audio transcription actually means and why it’s useful for your applications.

What AI Audio Transcription Actually Means

Section titled “What AI Audio Transcription Actually Means”

AI audio transcription is like having a professional transcriptionist inside your application. Users upload audio files or record their voice, and the AI converts speech to text with incredible accuracy in seconds.

Real-world analogy: It’s like hiring a stenographer who works instantly. Instead of manually typing out recordings or paying for transcription services, you upload an audio file and get accurate text immediately.

Think about all the times you or your users need to convert audio to text:

  • Meeting recordings need to be converted to searchable notes
  • Voice messages need to be transcribed for accessibility
  • Podcast content needs text versions for SEO and accessibility
  • Voice commands need to be processed by your application
  • Language learners need pronunciation feedback and practice

Without AI audio transcription, you’d need to:

  1. Manually type out recordings (time-consuming)
  2. Pay expensive transcription services (costly)
  3. Use basic speech recognition (inaccurate)
  4. Miss accessibility opportunities (limiting)

With AI audio transcription, you just upload audio and get accurate text instantly.

OpenAI provides one incredibly powerful audio model:

🎤 Whisper-1 - The Speech Recognition Expert

  • Best for: Converting any speech to text with high accuracy
  • Strengths: Multi-language support, noise handling, natural conversation understanding
  • Supports: 25+ languages, various audio formats (MP3, WAV, M4A, etc.)
  • Think of it as: Your professional transcriptionist who never gets tired

Whisper is perfect for beginners - you just upload an audio file and it returns accurate text with timestamps and confidence scores.


🔧 Step 2: Adding Audio Transcription to Your Backend

Section titled “🔧 Step 2: Adding Audio Transcription to Your Backend”

Let’s add audio transcription to your existing backend using the same patterns you learned in Module 1. We’ll add new routes to handle audio file uploads and processing.

Building on your foundation: You already have a working Node.js server with OpenAI integration. We’re simply adding audio capabilities to what you’ve built.

Step 2A: Understanding Audio Processing State

Section titled “Step 2A: Understanding Audio Processing State”

Before writing code, let’s understand what data our audio transcription system needs to manage:

// 🧠 AUDIO TRANSCRIPTION STATE CONCEPTS:
// 1. Audio File - The uploaded or recorded audio data
// 2. File Metadata - Original filename, size, format information
// 3. Transcription Settings - Language, response format, temperature
// 4. Processing Results - Text, timestamps, confidence scores
// 5. Error States - Invalid files, processing failures, file size limits

Key audio transcription concepts:

  • File Handling: Temporary storage and cleanup of uploaded audio files
  • Format Support: MP3, WAV, M4A, and other common audio formats
  • Response Formats: Simple text or detailed JSON with timestamps
  • Language Detection: Automatic or manual language specification

First, add the file upload dependency to your backend. In your backend folder, run:

Terminal window
npm install multer

What multer does: Handles file uploads in Express applications, allowing users to upload audio files to your server.

Step 2C: Adding the Audio Transcription Route

Section titled “Step 2C: Adding the Audio Transcription Route”

Add this new endpoint to your existing index.js file, right after your image generation routes:

import multer from 'multer';
import fs from 'fs';
import path from 'path';
// 🎤 MULTER SETUP: Configure file upload handling
const upload = multer({
storage: multer.memoryStorage(),
limits: {
fileSize: 25 * 1024 * 1024 // 25MB limit (OpenAI's max)
},
fileFilter: (req, file, cb) => {
// Accept only audio files
if (file.mimetype.startsWith('audio/')) {
cb(null, true);
} else {
cb(new Error('Only audio files are allowed'), false);
}
}
});
// 🔧 HELPER FUNCTIONS: File management utilities
const createTempFile = async (file) => {
const tempDir = path.join(process.cwd(), "temp");
// Create temp directory if it doesn't exist
if (!fs.existsSync(tempDir)) {
fs.mkdirSync(tempDir, { recursive: true });
}
// Create unique filename
const fileExtension = path.extname(file.originalname) || '.wav';
const tempFilePath = path.join(tempDir, `audio-${Date.now()}${fileExtension}`);
// Write file to disk
fs.writeFileSync(tempFilePath, file.buffer);
return tempFilePath;
};
const cleanupTempFile = (filePath) => {
try {
if (fs.existsSync(filePath)) {
fs.unlinkSync(filePath);
console.log(`🧹 Cleaned up: ${path.basename(filePath)}`);
}
} catch (error) {
console.error("Error cleaning up file:", error);
}
};
// 🎤 AI Audio Transcription endpoint - add this to your existing server
app.post("/api/audio/transcribe", upload.single("audio"), async (req, res) => {
let tempFilePath = null;
try {
// 🛡️ VALIDATION: Check if audio file was uploaded
const audioFile = req.file;
const {
language = null, // Optional: specify language (e.g., "en", "es")
response_format = "text" // "text" or "verbose_json"
} = req.body;
if (!audioFile) {
return res.status(400).json({
error: "No audio file uploaded",
success: false
});
}
console.log(`🎤 Processing: ${audioFile.originalname} (${audioFile.size} bytes)`);
// 💾 TEMP FILE: Create temporary file for OpenAI processing
tempFilePath = await createTempFile(audioFile);
// 🤖 AI TRANSCRIPTION: Process with Whisper
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream(tempFilePath),
model: "whisper-1",
response_format: response_format,
temperature: 0.0, // Lower temperature for more consistent results
...(language && { language }) // Add language if specified
});
// 🧹 CLEANUP: Remove temporary file immediately
cleanupTempFile(tempFilePath);
tempFilePath = null;
// 📤 SUCCESS RESPONSE: Send results based on format
if (response_format === "verbose_json") {
res.json({
success: true,
transcription: {
text: transcription.text,
language: transcription.language,
duration: transcription.duration,
segments: transcription.segments.map(segment => ({
start: segment.start,
end: segment.end,
text: segment.text
}))
},
metadata: {
filename: audioFile.originalname,
size: audioFile.size,
model: "whisper-1",
timestamp: new Date().toISOString()
}
});
} else {
res.json({
success: true,
transcription: {
text: transcription
},
metadata: {
filename: audioFile.originalname,
size: audioFile.size,
model: "whisper-1",
timestamp: new Date().toISOString()
}
});
}
} catch (error) {
// 🚨 ERROR HANDLING: Clean up and return error
console.error("Audio transcription error:", error);
if (tempFilePath) {
cleanupTempFile(tempFilePath);
}
res.status(500).json({
error: "Failed to transcribe audio",
details: error.message,
success: false
});
}
});

Function breakdown:

  1. File validation - Ensure audio file is uploaded and within size limits
  2. Temporary storage - Save uploaded file temporarily for OpenAI processing
  3. Transcription - Call OpenAI’s Whisper model to convert speech to text
  4. Response formatting - Return either simple text or detailed JSON with timestamps
  5. Cleanup - Remove temporary files to prevent storage buildup

Step 2D: Adding Error Handling for File Uploads

Section titled “Step 2D: Adding Error Handling for File Uploads”

Add this middleware to handle multer errors:

// 🚨 MULTER ERROR HANDLING: Handle file upload errors
app.use((error, req, res, next) => {
if (error instanceof multer.MulterError) {
if (error.code === 'LIMIT_FILE_SIZE') {
return res.status(400).json({
error: "File too large. Maximum size is 25MB.",
success: false
});
}
return res.status(400).json({
error: error.message,
success: false
});
}
if (error.message === 'Only audio files are allowed') {
return res.status(400).json({
error: "Please upload an audio file (MP3, WAV, M4A, etc.)",
success: false
});
}
next(error);
});

Your backend now supports:

  • Text chat (existing functionality)
  • Streaming chat (existing functionality)
  • Image generation (existing functionality)
  • Audio transcription (new functionality)

🔧 Step 3: Building the React Audio Component

Section titled “🔧 Step 3: Building the React Audio Component”

Now let’s create a React component for audio transcription using the same patterns from your existing components.

Step 3A: Creating the Audio Transcription Component

Section titled “Step 3A: Creating the Audio Transcription Component”

Create a new file src/AudioTranscription.jsx:

import { useState, useRef } from "react";
import { Upload, Mic, FileAudio, Play, Pause, Download, MessageSquare } from "lucide-react";
function AudioTranscription() {
// 🧠 STATE: Audio transcription data management
const [audioFile, setAudioFile] = useState(null); // Uploaded audio file
const [isRecording, setIsRecording] = useState(false); // Recording status
const [recordedBlob, setRecordedBlob] = useState(null); // Recorded audio data
const [isTranscribing, setIsTranscribing] = useState(false); // Processing status
const [transcription, setTranscription] = useState(null); // Transcription results
const [error, setError] = useState(null); // Error messages
const [responseFormat, setResponseFormat] = useState("text"); // Response format
const [language, setLanguage] = useState(""); // Language selection
// 🎤 RECORDING: Media recorder and audio playback refs
const mediaRecorderRef = useRef(null);
const audioPlayerRef = useRef(null);
const fileInputRef = useRef(null);
// 🔧 FUNCTIONS: Audio processing logic engine
// Start voice recording
const startRecording = async () => {
try {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const mediaRecorder = new MediaRecorder(stream);
const audioChunks = [];
mediaRecorder.ondataavailable = (event) => {
audioChunks.push(event.data);
};
mediaRecorder.onstop = () => {
const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });
setRecordedBlob(audioBlob);
// Stop all tracks to release microphone
stream.getTracks().forEach(track => track.stop());
};
mediaRecorder.start();
mediaRecorderRef.current = mediaRecorder;
setIsRecording(true);
setError(null);
} catch (error) {
console.error('Recording error:', error);
setError('Could not access microphone. Please check permissions.');
}
};
// Stop voice recording
const stopRecording = () => {
if (mediaRecorderRef.current && isRecording) {
mediaRecorderRef.current.stop();
setIsRecording(false);
mediaRecorderRef.current = null;
}
};
// Handle file upload
const handleFileUpload = (event) => {
const file = event.target.files[0];
if (file) {
// Validate file type
if (!file.type.startsWith('audio/')) {
setError('Please select an audio file (MP3, WAV, M4A, etc.)');
return;
}
// Validate file size (25MB limit)
if (file.size > 25 * 1024 * 1024) {
setError('File too large. Maximum size is 25MB.');
return;
}
setAudioFile(file);
setRecordedBlob(null);
setTranscription(null);
setError(null);
}
};
// Main transcription function
const transcribeAudio = async () => {
const fileToProcess = audioFile || recordedBlob;
// 🛡️ GUARDS: Prevent invalid transcription
if (!fileToProcess || isTranscribing) return;
// 🔄 SETUP: Prepare for transcription
setIsTranscribing(true);
setError(null);
setTranscription(null);
try {
// 📤 FORM DATA: Prepare multipart form data
const formData = new FormData();
formData.append('audio', fileToProcess, audioFile?.name || 'recorded_audio.wav');
formData.append('response_format', responseFormat);
if (language) {
formData.append('language', language);
}
// 📡 API CALL: Send to your backend
const response = await fetch("http://localhost:8000/api/audio/transcribe", {
method: "POST",
body: formData
});
const data = await response.json();
if (!response.ok) {
throw new Error(data.error || 'Failed to transcribe audio');
}
// ✅ SUCCESS: Store transcription results
setTranscription(data);
} catch (error) {
// 🚨 ERROR HANDLING: Show user-friendly message
console.error('Transcription failed:', error);
setError(error.message || 'Something went wrong while transcribing the audio');
} finally {
// 🧹 CLEANUP: Reset processing state
setIsTranscribing(false);
}
};
// Clear all audio data
const clearAudio = () => {
setAudioFile(null);
setRecordedBlob(null);
setTranscription(null);
setError(null);
if (fileInputRef.current) {
fileInputRef.current.value = '';
}
};
// Download transcription as text file
const downloadTranscription = () => {
if (!transcription?.transcription?.text) return;
const element = document.createElement('a');
const file = new Blob([transcription.transcription.text], { type: 'text/plain' });
element.href = URL.createObjectURL(file);
element.download = `transcription-${Date.now()}.txt`;
document.body.appendChild(element);
element.click();
document.body.removeChild(element);
};
// Language options for transcription
const languages = [
{ value: "", label: "Auto-detect" },
{ value: "en", label: "English" },
{ value: "es", label: "Spanish" },
{ value: "fr", label: "French" },
{ value: "de", label: "German" },
{ value: "it", label: "Italian" },
{ value: "pt", label: "Portuguese" },
{ value: "ja", label: "Japanese" },
{ value: "ko", label: "Korean" },
{ value: "zh", label: "Chinese" }
];
// 🎨 UI: Interface components
return (
<div className="min-h-screen bg-gradient-to-br from-blue-50 to-indigo-50 flex items-center justify-center p-4">
<div className="bg-white rounded-2xl shadow-2xl w-full max-w-4xl flex flex-col overflow-hidden">
{/* Header */}
<div className="bg-gradient-to-r from-blue-600 to-indigo-600 text-white p-6">
<div className="flex items-center space-x-3">
<div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center">
<Mic className="w-5 h-5" />
</div>
<div>
<h1 className="text-xl font-bold">🎤 AI Audio Transcription</h1>
<p className="text-blue-100 text-sm">Convert speech to text with AI!</p>
</div>
</div>
</div>
{/* Audio Input Section */}
<div className="p-6 border-b border-gray-200">
<div className="grid grid-cols-1 md:grid-cols-2 gap-6 mb-6">
{/* File Upload */}
<div>
<h3 className="font-semibold text-gray-900 mb-3 flex items-center">
<Upload className="w-5 h-5 mr-2 text-blue-600" />
Upload Audio File
</h3>
<div
onClick={() => fileInputRef.current?.click()}
className="border-2 border-dashed border-gray-300 rounded-xl p-6 text-center cursor-pointer hover:border-blue-400 hover:bg-blue-50 transition-colors duration-200"
>
<Upload className="w-8 h-8 text-gray-400 mx-auto mb-2" />
<p className="text-gray-600">
{audioFile ? audioFile.name : 'Click to upload audio file'}
</p>
<p className="text-sm text-gray-500 mt-1">
MP3, WAV, M4A • Max 25MB
</p>
</div>
<input
ref={fileInputRef}
type="file"
accept="audio/*"
onChange={handleFileUpload}
className="hidden"
/>
</div>
{/* Voice Recording */}
<div>
<h3 className="font-semibold text-gray-900 mb-3 flex items-center">
<Mic className="w-5 h-5 mr-2 text-blue-600" />
Record Audio
</h3>
<div className="border-2 border-gray-300 rounded-xl p-6 text-center">
<div className="flex flex-col items-center space-y-4">
<button
onClick={isRecording ? stopRecording : startRecording}
className={`w-16 h-16 rounded-full flex items-center justify-center transition-all duration-200 ${
isRecording
? 'bg-red-500 hover:bg-red-600 animate-pulse'
: 'bg-blue-500 hover:bg-blue-600'
}`}
>
<Mic className="w-8 h-8 text-white" />
</button>
<p className="text-gray-600">
{isRecording
? 'Recording... Click to stop'
: recordedBlob
? 'Recording ready'
: 'Click to start recording'
}
</p>
</div>
</div>
</div>
</div>
{/* Settings Row */}
<div className="grid grid-cols-1 md:grid-cols-3 gap-4 mb-4">
{/* Language Selection */}
<div>
<label className="block text-sm font-semibold text-gray-700 mb-2">
Language
</label>
<select
value={language}
onChange={(e) => setLanguage(e.target.value)}
disabled={isTranscribing}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500 disabled:bg-gray-100"
>
{languages.map(lang => (
<option key={lang.value} value={lang.value}>
{lang.label}
</option>
))}
</select>
</div>
{/* Response Format */}
<div>
<label className="block text-sm font-semibold text-gray-700 mb-2">
Detail Level
</label>
<select
value={responseFormat}
onChange={(e) => setResponseFormat(e.target.value)}
disabled={isTranscribing}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500 disabled:bg-gray-100"
>
<option value="text">Simple Text</option>
<option value="verbose_json">Detailed with Timestamps</option>
</select>
</div>
{/* Action Buttons */}
<div className="flex space-x-2">
<button
onClick={transcribeAudio}
disabled={isTranscribing || (!audioFile && !recordedBlob)}
className="flex-1 bg-gradient-to-r from-blue-600 to-indigo-600 hover:from-blue-700 hover:to-indigo-700 disabled:from-gray-300 disabled:to-gray-300 text-white px-4 py-2 rounded-lg transition-all duration-200 flex items-center justify-center space-x-2 shadow-lg disabled:shadow-none"
>
{isTranscribing ? (
<>
<div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div>
<span>Processing...</span>
</>
) : (
<>
<MessageSquare className="w-4 h-4" />
<span>Transcribe</span>
</>
)}
</button>
{(audioFile || recordedBlob) && (
<button
onClick={clearAudio}
disabled={isTranscribing}
className="px-4 py-2 border border-gray-300 text-gray-700 rounded-lg hover:bg-gray-50 transition-colors duration-200"
>
Clear
</button>
)}
</div>
</div>
</div>
{/* Results Section */}
<div className="flex-1 p-6">
{/* Error Display */}
{error && (
<div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4">
<p className="text-red-700">
<strong>Error:</strong> {error}
</p>
</div>
)}
{/* Audio Preview */}
{(audioFile || recordedBlob) && (
<div className="bg-gray-50 rounded-lg p-4 mb-4">
<h4 className="font-semibold text-gray-900 mb-2 flex items-center">
<FileAudio className="w-4 h-4 mr-2" />
Audio Preview
</h4>
<audio
ref={audioPlayerRef}
controls
src={audioFile ? URL.createObjectURL(audioFile) : recordedBlob ? URL.createObjectURL(recordedBlob) : ''}
className="w-full"
/>
<p className="text-sm text-gray-600 mt-2">
{audioFile ? `File: ${audioFile.name}` : 'Recorded Audio'}
</p>
</div>
)}
{/* Transcription Results */}
{transcription ? (
<div className="bg-gray-50 rounded-lg p-4">
<div className="flex items-center justify-between mb-4">
<h4 className="font-semibold text-gray-900">Transcription Result</h4>
<button
onClick={downloadTranscription}
className="bg-gradient-to-r from-green-500 to-green-600 hover:from-green-600 hover:to-green-700 text-white px-4 py-2 rounded-lg transition-all duration-200 flex items-center space-x-2"
>
<Download className="w-4 h-4" />
<span>Download</span>
</button>
</div>
<div className="space-y-4">
{/* Transcribed Text */}
<div className="bg-white rounded-lg p-4">
<h5 className="font-medium text-gray-700 mb-2">Transcribed Text:</h5>
<p className="text-gray-900 leading-relaxed whitespace-pre-wrap">
{transcription.transcription.text}
</p>
</div>
{/* Metadata */}
<div className="grid grid-cols-1 md:grid-cols-3 gap-4">
<div className="bg-white rounded-lg p-3 text-center">
<p className="text-sm text-gray-600">File</p>
<p className="font-semibold text-gray-900 text-sm">
{transcription.metadata.filename}
</p>
</div>
<div className="bg-white rounded-lg p-3 text-center">
<p className="text-sm text-gray-600">Size</p>
<p className="font-semibold text-gray-900">
{(transcription.metadata.size / 1024 / 1024).toFixed(1)} MB
</p>
</div>
<div className="bg-white rounded-lg p-3 text-center">
<p className="text-sm text-gray-600">Model</p>
<p className="font-semibold text-gray-900">
{transcription.metadata.model}
</p>
</div>
</div>
{/* Detailed Information (if verbose_json) */}
{transcription.transcription.duration && (
<div className="grid grid-cols-1 md:grid-cols-2 gap-4">
<div className="bg-white rounded-lg p-3 text-center">
<p className="text-sm text-gray-600">Duration</p>
<p className="font-semibold text-gray-900">
{transcription.transcription.duration.toFixed(1)}s
</p>
</div>
<div className="bg-white rounded-lg p-3 text-center">
<p className="text-sm text-gray-600">Language</p>
<p className="font-semibold text-gray-900">
{transcription.transcription.language || 'Auto-detected'}
</p>
</div>
</div>
)}
</div>
</div>
) : !isTranscribing && !error && (
// Welcome State
<div className="text-center py-12">
<div className="w-16 h-16 bg-blue-100 rounded-2xl flex items-center justify-center mx-auto mb-4">
<Mic className="w-8 h-8 text-blue-600" />
</div>
<h3 className="text-lg font-semibold text-gray-700 mb-2">
Ready to Transcribe!
</h3>
<p className="text-gray-600 max-w-md mx-auto">
Upload an audio file or record your voice, then click "Transcribe" to convert speech to text with AI.
</p>
</div>
)}
</div>
</div>
</div>
);
}
export default AudioTranscription;

Update your src/App.jsx to include the new audio transcription component:

import { useState } from "react";
import StreamingChat from "./StreamingChat";
import ImageGenerator from "./ImageGenerator";
import AudioTranscription from "./AudioTranscription";
import { MessageSquare, Image, Mic } from "lucide-react";
function App() {
// 🧠 STATE: Navigation management
const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', or 'audio'
// 🎨 UI: Main app with navigation
return (
<div className="min-h-screen bg-gray-100">
{/* Navigation Header */}
<nav className="bg-white shadow-sm border-b border-gray-200">
<div className="max-w-6xl mx-auto px-4">
<div className="flex items-center justify-between h-16">
{/* Logo */}
<div className="flex items-center space-x-3">
<div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center">
<span className="text-white font-bold text-sm">AI</span>
</div>
<h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1>
</div>
{/* Navigation Buttons */}
<div className="flex space-x-2">
<button
onClick={() => setCurrentView("chat")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "chat"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<MessageSquare className="w-4 h-4" />
<span>Chat</span>
</button>
<button
onClick={() => setCurrentView("images")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "images"
? "bg-purple-100 text-purple-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Image className="w-4 h-4" />
<span>Images</span>
</button>
<button
onClick={() => setCurrentView("audio")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "audio"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Mic className="w-4 h-4" />
<span>Audio</span>
</button>
</div>
</div>
</div>
</nav>
{/* Main Content */}
<main className="h-[calc(100vh-4rem)]">
{currentView === "chat" && <StreamingChat />}
{currentView === "images" && <ImageGenerator />}
{currentView === "audio" && <AudioTranscription />}
</main>
</div>
);
}
export default App;

Let’s test your audio transcription feature step by step to make sure everything works correctly.

First, verify your backend route works by testing it directly:

Test with a small audio file:

Terminal window
# Create a test audio file or use an existing one
curl -X POST http://localhost:8000/api/audio/transcribe \
-F "audio=@test_audio.mp3" \
-F "response_format=text"

Expected response:

{
"success": true,
"transcription": {
"text": "This is a test of the audio transcription feature..."
},
"metadata": {
"filename": "test_audio.mp3",
"size": 45612,
"model": "whisper-1",
"timestamp": "2024-01-15T10:30:00.000Z"
}
}

Start both servers:

Backend (in your backend folder):

Terminal window
npm run dev

Frontend (in your frontend folder):

Terminal window
npm run dev

Test the complete flow:

  1. Navigate to Audio → Click the “Audio” tab in navigation
  2. Test file upload → Upload an MP3 or WAV file
  3. Test recording → Click microphone to record voice
  4. Test transcription → Click “Transcribe” and see loading state
  5. View results → See transcribed text with metadata
  6. Test download → Download transcription as text file
  7. Test settings → Try different languages and detail levels

Test browser microphone access:

  1. Click record button → Browser should ask for microphone permission
  2. Allow access → Recording should start with red pulsing button
  3. Record voice → Speak clearly for 5-10 seconds
  4. Stop recording → Click button again to stop
  5. Transcribe → Process the recorded audio

Expected behavior:

  • Smooth recording start/stop
  • Clear audio playback preview
  • Accurate transcription of recorded speech

Test error scenarios:

❌ No audio file: Click transcribe without uploading/recording
❌ Wrong file type: Upload a PDF or image file
❌ Large file: Upload audio file larger than 25MB
❌ Microphone denied: Deny microphone permissions

Expected behavior:

  • Clear error messages displayed
  • No application crashes
  • User can try again with different input

Congratulations! You’ve extended your existing application with complete AI audio transcription:

  • Extended your backend with audio file upload and processing
  • Added React audio component following the same patterns as chat and images
  • Implemented voice recording with browser microphone access
  • Created audio file upload with drag-and-drop interface
  • Added transcription settings for language and detail level
  • Included download functionality for transcribed text
  • Maintained consistent design with your existing application

Your application now has:

  • Text chat with streaming responses
  • Image generation with DALL-E 3 and GPT-Image-1
  • Audio transcription with Whisper voice recognition
  • Unified navigation between all features
  • Professional UI with consistent TailwindCSS styling

Next up: You’ll learn about text-to-speech synthesis, where you can convert text back into natural-sounding speech using OpenAI’s voice models.

Your OpenAI mastery application is becoming incredibly versatile! 🎤