🎤 Your AI Gets Super Hearing!

Your AI can chat and create images. Now let’s give it perfect hearing! 👂

Imagine users recording a 1-hour meeting and your AI instantly provides a perfect transcript with timestamps. Or speaking their messages instead of typing them. Or uploading voice memos that become searchable text!

What we’re building: Your AI will understand speech in 25+ languages with better accuracy than human transcriptionists - powered by OpenAI’s incredible Whisper model!

🎯 From Silent AI to Audio Expert

Current state: Your AI only processes text and images Target state: Your AI hears and understands any spoken content!

🔄 The Audio Intelligence Transformation

Before (Deaf AI):

User: [Uploads meeting recording] "What was discussed?"
AI: "I can't process audio files" 😔

After (AI with Super Hearing):

User: [Uploads meeting recording] "What was discussed?"
AI: "This 45-minute meeting covered Q3 budget planning. Key decisions: Marketing budget increased to $50K, new hire approved for engineering, launch date moved to October 15th." 🎧

The magic: Your AI becomes an audio processing expert that never misses a word!

Why You Need This in Your Applications

Think about all the times you or your users need to convert audio to text:

Meeting recordings need to be converted to searchable notes
Voice messages need to be transcribed for accessibility
Podcast content needs text versions for SEO and accessibility
Voice commands need to be processed by your application
Language learners need pronunciation feedback and practice

Without AI audio transcription, you’d need to:

Manually type out recordings (time-consuming)
Pay expensive transcription services (costly)
Use basic speech recognition (inaccurate)
Miss accessibility opportunities (limiting)

With AI audio transcription, you just upload audio and get accurate text instantly.

OpenAI’s Whisper Model

OpenAI provides one incredibly powerful audio model:

🎤 Whisper-1 - The Speech Recognition Expert

Best for: Converting any speech to text with high accuracy
Strengths: Multi-language support, noise handling, natural conversation understanding
Supports: 25+ languages, various audio formats (MP3, WAV, M4A, etc.)
Think of it as: Your professional transcriptionist who never gets tired

Whisper is perfect for beginners - you just upload an audio file and it returns accurate text with timestamps and confidence scores.

🔧 Step 2: Adding Audio Transcription to Your Backend

Let’s add audio transcription to your existing backend using the same patterns you learned in Module 1. We’ll add new routes to handle audio file uploads and processing.

Building on your foundation: You already have a working Node.js server with OpenAI integration. We’re simply adding audio capabilities to what you’ve built.

Step 2A: Understanding Audio Processing State

Before writing code, let’s understand what data our audio transcription system needs to manage:

// 🧠 AUDIO TRANSCRIPTION STATE CONCEPTS:
// 1. Audio File - The uploaded or recorded audio data
// 2. File Metadata - Original filename, size, format information
// 3. Transcription Settings - Language, response format, temperature
// 4. Processing Results - Text, timestamps, confidence scores
// 5. Error States - Invalid files, processing failures, file size limits

Key audio transcription concepts:

File Handling: Temporary storage and cleanup of uploaded audio files
Format Support: MP3, WAV, M4A, and other common audio formats
Response Formats: Simple text or detailed JSON with timestamps
Language Detection: Automatic or manual language specification

Step 2B: Installing Required Dependencies

First, add the file upload dependency to your backend. In your backend folder, run:

npm install multer

What multer does: Handles file uploads in Express applications, allowing users to upload audio files to your server.

Step 2C: Adding the Audio Transcription Route

Add this new endpoint to your existing index.js file, right after your image generation routes:

import multer from 'multer';
import fs from 'fs';
import path from 'path';

// 🎤 MULTER SETUP: Configure file upload handling
const upload = multer({
  storage: multer.memoryStorage(),
  limits: {
    fileSize: 25 * 1024 * 1024 // 25MB limit (OpenAI's max)
  },
  fileFilter: (req, file, cb) => {
    // Accept only audio files
    if (file.mimetype.startsWith('audio/')) {
      cb(null, true);
    } else {
      cb(new Error('Only audio files are allowed'), false);
    }
  }
});

// 🔧 HELPER FUNCTIONS: File management utilities
const createTempFile = async (file) => {
  const tempDir = path.join(process.cwd(), "temp");

  // Create temp directory if it doesn't exist
  if (!fs.existsSync(tempDir)) {
    fs.mkdirSync(tempDir, { recursive: true });
  }

  // Create unique filename
  const fileExtension = path.extname(file.originalname) || '.wav';
  const tempFilePath = path.join(tempDir, `audio-${Date.now()}${fileExtension}`);

  // Write file to disk
  fs.writeFileSync(tempFilePath, file.buffer);
  return tempFilePath;
};

const cleanupTempFile = (filePath) => {
  try {
    if (fs.existsSync(filePath)) {
      fs.unlinkSync(filePath);
      console.log(`🧹 Cleaned up: ${path.basename(filePath)}`);
    }
  } catch (error) {
    console.error("Error cleaning up file:", error);
  }
};

// 🎤 AI Audio Transcription endpoint - add this to your existing server
app.post("/api/audio/transcribe", upload.single("audio"), async (req, res) => {
  let tempFilePath = null;

  try {
    // 🛡️ VALIDATION: Check if audio file was uploaded
    const audioFile = req.file;
    const {
      language = null,           // Optional: specify language (e.g., "en", "es")
      response_format = "text"   // "text" or "verbose_json"
    } = req.body;

    if (!audioFile) {
      return res.status(400).json({
        error: "No audio file uploaded",
        success: false
      });
    }

    console.log(`🎤 Processing: ${audioFile.originalname} (${audioFile.size} bytes)`);

    // 💾 TEMP FILE: Create temporary file for OpenAI processing
    tempFilePath = await createTempFile(audioFile);

    // 🤖 AI TRANSCRIPTION: Process with Whisper
    const transcription = await openai.audio.transcriptions.create({
      file: fs.createReadStream(tempFilePath),
      model: "whisper-1",
      response_format: response_format,
      temperature: 0.0,  // Lower temperature for more consistent results
      ...(language && { language })  // Add language if specified
    });

    // 🧹 CLEANUP: Remove temporary file immediately
    cleanupTempFile(tempFilePath);
    tempFilePath = null;

    // 📤 SUCCESS RESPONSE: Send results based on format
    if (response_format === "verbose_json") {
      res.json({
        success: true,
        transcription: {
          text: transcription.text,
          language: transcription.language,
          duration: transcription.duration,
          segments: transcription.segments.map(segment => ({
            start: segment.start,
            end: segment.end,
            text: segment.text
          }))
        },
        metadata: {
          filename: audioFile.originalname,
          size: audioFile.size,
          model: "whisper-1",
          timestamp: new Date().toISOString()
        }
      });
    } else {
      res.json({
        success: true,
        transcription: {
          text: transcription
        },
        metadata: {
          filename: audioFile.originalname,
          size: audioFile.size,
          model: "whisper-1",
          timestamp: new Date().toISOString()
        }
      });
    }

  } catch (error) {
    // 🚨 ERROR HANDLING: Clean up and return error
    console.error("Audio transcription error:", error);

    if (tempFilePath) {
      cleanupTempFile(tempFilePath);
    }

    res.status(500).json({
      error: "Failed to transcribe audio",
      details: error.message,
      success: false
    });
  }
});

Function breakdown:

File validation - Ensure audio file is uploaded and within size limits
Temporary storage - Save uploaded file temporarily for OpenAI processing
Transcription - Call OpenAI’s Whisper model to convert speech to text
Response formatting - Return either simple text or detailed JSON with timestamps
Cleanup - Remove temporary files to prevent storage buildup

Step 2D: Adding Error Handling for File Uploads

Add this middleware to handle multer errors:

// 🚨 MULTER ERROR HANDLING: Handle file upload errors
app.use((error, req, res, next) => {
  if (error instanceof multer.MulterError) {
    if (error.code === 'LIMIT_FILE_SIZE') {
      return res.status(400).json({
        error: "File too large. Maximum size is 25MB.",
        success: false
      });
    }
    return res.status(400).json({
      error: error.message,
      success: false
    });
  }

  if (error.message === 'Only audio files are allowed') {
    return res.status(400).json({
      error: "Please upload an audio file (MP3, WAV, M4A, etc.)",
      success: false
    });
  }

  next(error);
});

Your backend now supports:

Text chat (existing functionality)
Streaming chat (existing functionality)
Image generation (existing functionality)
Audio transcription (new functionality)

🔧 Step 3: Building the React Audio Component

Now let’s create a React component for audio transcription using the same patterns from your existing components.

Step 3A: Creating the Audio Transcription Component

Create a new file src/AudioTranscription.jsx:

import { useState, useRef } from "react";
import { Upload, Mic, FileAudio, Play, Pause, Download, MessageSquare } from "lucide-react";

function AudioTranscription() {
  // 🧠 STATE: Audio transcription data management
  const [audioFile, setAudioFile] = useState(null);           // Uploaded audio file
  const [isRecording, setIsRecording] = useState(false);      // Recording status
  const [recordedBlob, setRecordedBlob] = useState(null);     // Recorded audio data
  const [isTranscribing, setIsTranscribing] = useState(false); // Processing status
  const [transcription, setTranscription] = useState(null);   // Transcription results
  const [error, setError] = useState(null);                   // Error messages
  const [responseFormat, setResponseFormat] = useState("text"); // Response format
  const [language, setLanguage] = useState("");               // Language selection

  // 🎤 RECORDING: Media recorder and audio playback refs
  const mediaRecorderRef = useRef(null);
  const audioPlayerRef = useRef(null);
  const fileInputRef = useRef(null);

  // 🔧 FUNCTIONS: Audio processing logic engine

  // Start voice recording
  const startRecording = async () => {
    try {
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
      const mediaRecorder = new MediaRecorder(stream);
      const audioChunks = [];

      mediaRecorder.ondataavailable = (event) => {
        audioChunks.push(event.data);
      };

      mediaRecorder.onstop = () => {
        const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });
        setRecordedBlob(audioBlob);

        // Stop all tracks to release microphone
        stream.getTracks().forEach(track => track.stop());
      };

      mediaRecorder.start();
      mediaRecorderRef.current = mediaRecorder;
      setIsRecording(true);
      setError(null);

    } catch (error) {
      console.error('Recording error:', error);
      setError('Could not access microphone. Please check permissions.');
    }
  };

  // Stop voice recording
  const stopRecording = () => {
    if (mediaRecorderRef.current && isRecording) {
      mediaRecorderRef.current.stop();
      setIsRecording(false);
      mediaRecorderRef.current = null;
    }
  };

  // Handle file upload
  const handleFileUpload = (event) => {
    const file = event.target.files[0];
    if (file) {
      // Validate file type
      if (!file.type.startsWith('audio/')) {
        setError('Please select an audio file (MP3, WAV, M4A, etc.)');
        return;
      }

      // Validate file size (25MB limit)
      if (file.size > 25 * 1024 * 1024) {
        setError('File too large. Maximum size is 25MB.');
        return;
      }

      setAudioFile(file);
      setRecordedBlob(null);
      setTranscription(null);
      setError(null);
    }
  };

  // Main transcription function
  const transcribeAudio = async () => {
    const fileToProcess = audioFile || recordedBlob;

    // 🛡️ GUARDS: Prevent invalid transcription
    if (!fileToProcess || isTranscribing) return;

    // 🔄 SETUP: Prepare for transcription
    setIsTranscribing(true);
    setError(null);
    setTranscription(null);

    try {
      // 📤 FORM DATA: Prepare multipart form data
      const formData = new FormData();
      formData.append('audio', fileToProcess, audioFile?.name || 'recorded_audio.wav');
      formData.append('response_format', responseFormat);

      if (language) {
        formData.append('language', language);
      }

      // 📡 API CALL: Send to your backend
      const response = await fetch("http://localhost:8000/api/audio/transcribe", {
        method: "POST",
        body: formData
      });

      const data = await response.json();

      if (!response.ok) {
        throw new Error(data.error || 'Failed to transcribe audio');
      }

      // ✅ SUCCESS: Store transcription results
      setTranscription(data);

    } catch (error) {
      // 🚨 ERROR HANDLING: Show user-friendly message
      console.error('Transcription failed:', error);
      setError(error.message || 'Something went wrong while transcribing the audio');
    } finally {
      // 🧹 CLEANUP: Reset processing state
      setIsTranscribing(false);
    }
  };

  // Clear all audio data
  const clearAudio = () => {
    setAudioFile(null);
    setRecordedBlob(null);
    setTranscription(null);
    setError(null);
    if (fileInputRef.current) {
      fileInputRef.current.value = '';
    }
  };

  // Download transcription as text file
  const downloadTranscription = () => {
    if (!transcription?.transcription?.text) return;

    const element = document.createElement('a');
    const file = new Blob([transcription.transcription.text], { type: 'text/plain' });
    element.href = URL.createObjectURL(file);
    element.download = `transcription-${Date.now()}.txt`;
    document.body.appendChild(element);
    element.click();
    document.body.removeChild(element);
  };

  // Language options for transcription
  const languages = [
    { value: "", label: "Auto-detect" },
    { value: "en", label: "English" },
    { value: "es", label: "Spanish" },
    { value: "fr", label: "French" },
    { value: "de", label: "German" },
    { value: "it", label: "Italian" },
    { value: "pt", label: "Portuguese" },
    { value: "ja", label: "Japanese" },
    { value: "ko", label: "Korean" },
    { value: "zh", label: "Chinese" }
  ];

  // 🎨 UI: Interface components
  return (
    <div className="min-h-screen bg-gradient-to-br from-blue-50 to-indigo-50 flex items-center justify-center p-4">
      <div className="bg-white rounded-2xl shadow-2xl w-full max-w-4xl flex flex-col overflow-hidden">

        {/* Header */}
        <div className="bg-gradient-to-r from-blue-600 to-indigo-600 text-white p-6">
          <div className="flex items-center space-x-3">
            <div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center">
              <Mic className="w-5 h-5" />
            </div>
            <div>
              <h1 className="text-xl font-bold">🎤 AI Audio Transcription</h1>
              <p className="text-blue-100 text-sm">Convert speech to text with AI!</p>
            </div>
          </div>
        </div>

        {/* Audio Input Section */}
        <div className="p-6 border-b border-gray-200">
          <div className="grid grid-cols-1 md:grid-cols-2 gap-6 mb-6">

            {/* File Upload */}
            <div>
              <h3 className="font-semibold text-gray-900 mb-3 flex items-center">
                <Upload className="w-5 h-5 mr-2 text-blue-600" />
                Upload Audio File
              </h3>
              <div
                onClick={() => fileInputRef.current?.click()}
                className="border-2 border-dashed border-gray-300 rounded-xl p-6 text-center cursor-pointer hover:border-blue-400 hover:bg-blue-50 transition-colors duration-200"
              >
                <Upload className="w-8 h-8 text-gray-400 mx-auto mb-2" />
                <p className="text-gray-600">
                  {audioFile ? audioFile.name : 'Click to upload audio file'}
                </p>
                <p className="text-sm text-gray-500 mt-1">
                  MP3, WAV, M4A • Max 25MB
                </p>
              </div>
              <input
                ref={fileInputRef}
                type="file"
                accept="audio/*"
                onChange={handleFileUpload}
                className="hidden"
              />
            </div>

            {/* Voice Recording */}
            <div>
              <h3 className="font-semibold text-gray-900 mb-3 flex items-center">
                <Mic className="w-5 h-5 mr-2 text-blue-600" />
                Record Audio
              </h3>
              <div className="border-2 border-gray-300 rounded-xl p-6 text-center">
                <div className="flex flex-col items-center space-y-4">
                  <button
                    onClick={isRecording ? stopRecording : startRecording}
                    className={`w-16 h-16 rounded-full flex items-center justify-center transition-all duration-200 ${
                      isRecording
                        ? 'bg-red-500 hover:bg-red-600 animate-pulse'
                        : 'bg-blue-500 hover:bg-blue-600'
                    }`}
                  >
                    <Mic className="w-8 h-8 text-white" />
                  </button>
                  <p className="text-gray-600">
                    {isRecording
                      ? 'Recording... Click to stop'
                      : recordedBlob
                        ? 'Recording ready'
                        : 'Click to start recording'
                    }
                  </p>
                </div>
              </div>
            </div>
          </div>

          {/* Settings Row */}
          <div className="grid grid-cols-1 md:grid-cols-3 gap-4 mb-4">
            {/* Language Selection */}
            <div>
              <label className="block text-sm font-semibold text-gray-700 mb-2">
                Language
              </label>
              <select
                value={language}
                onChange={(e) => setLanguage(e.target.value)}
                disabled={isTranscribing}
                className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500 disabled:bg-gray-100"
              >
                {languages.map(lang => (
                  <option key={lang.value} value={lang.value}>
                    {lang.label}
                  </option>
                ))}
              </select>
            </div>

            {/* Response Format */}
            <div>
              <label className="block text-sm font-semibold text-gray-700 mb-2">
                Detail Level
              </label>
              <select
                value={responseFormat}
                onChange={(e) => setResponseFormat(e.target.value)}
                disabled={isTranscribing}
                className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500 disabled:bg-gray-100"
              >
                <option value="text">Simple Text</option>
                <option value="verbose_json">Detailed with Timestamps</option>
              </select>
            </div>

            {/* Action Buttons */}
            <div className="flex space-x-2">
              <button
                onClick={transcribeAudio}
                disabled={isTranscribing || (!audioFile && !recordedBlob)}
                className="flex-1 bg-gradient-to-r from-blue-600 to-indigo-600 hover:from-blue-700 hover:to-indigo-700 disabled:from-gray-300 disabled:to-gray-300 text-white px-4 py-2 rounded-lg transition-all duration-200 flex items-center justify-center space-x-2 shadow-lg disabled:shadow-none"
              >
                {isTranscribing ? (
                  <>
                    <div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div>
                    <span>Processing...</span>
                  </>
                ) : (
                  <>
                    <MessageSquare className="w-4 h-4" />
                    <span>Transcribe</span>
                  </>
                )}
              </button>

              {(audioFile || recordedBlob) && (
                <button
                  onClick={clearAudio}
                  disabled={isTranscribing}
                  className="px-4 py-2 border border-gray-300 text-gray-700 rounded-lg hover:bg-gray-50 transition-colors duration-200"
                >
                  Clear
                </button>
              )}
            </div>
          </div>
        </div>

        {/* Results Section */}
        <div className="flex-1 p-6">
          {/* Error Display */}
          {error && (
            <div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4">
              <p className="text-red-700">
                <strong>Error:</strong> {error}
              </p>
            </div>
          )}

          {/* Audio Preview */}
          {(audioFile || recordedBlob) && (
            <div className="bg-gray-50 rounded-lg p-4 mb-4">
              <h4 className="font-semibold text-gray-900 mb-2 flex items-center">
                <FileAudio className="w-4 h-4 mr-2" />
                Audio Preview
              </h4>
              <audio
                ref={audioPlayerRef}
                controls
                src={audioFile ? URL.createObjectURL(audioFile) : recordedBlob ? URL.createObjectURL(recordedBlob) : ''}
                className="w-full"
              />
              <p className="text-sm text-gray-600 mt-2">
                {audioFile ? `File: ${audioFile.name}` : 'Recorded Audio'}
              </p>
            </div>
          )}

          {/* Transcription Results */}
          {transcription ? (
            <div className="bg-gray-50 rounded-lg p-4">
              <div className="flex items-center justify-between mb-4">
                <h4 className="font-semibold text-gray-900">Transcription Result</h4>
                <button
                  onClick={downloadTranscription}
                  className="bg-gradient-to-r from-green-500 to-green-600 hover:from-green-600 hover:to-green-700 text-white px-4 py-2 rounded-lg transition-all duration-200 flex items-center space-x-2"
                >
                  <Download className="w-4 h-4" />
                  <span>Download</span>
                </button>
              </div>

              <div className="space-y-4">
                {/* Transcribed Text */}
                <div className="bg-white rounded-lg p-4">
                  <h5 className="font-medium text-gray-700 mb-2">Transcribed Text:</h5>
                  <p className="text-gray-900 leading-relaxed whitespace-pre-wrap">
                    {transcription.transcription.text}
                  </p>
                </div>

                {/* Metadata */}
                <div className="grid grid-cols-1 md:grid-cols-3 gap-4">
                  <div className="bg-white rounded-lg p-3 text-center">
                    <p className="text-sm text-gray-600">File</p>
                    <p className="font-semibold text-gray-900 text-sm">
                      {transcription.metadata.filename}
                    </p>
                  </div>
                  <div className="bg-white rounded-lg p-3 text-center">
                    <p className="text-sm text-gray-600">Size</p>
                    <p className="font-semibold text-gray-900">
                      {(transcription.metadata.size / 1024 / 1024).toFixed(1)} MB
                    </p>
                  </div>
                  <div className="bg-white rounded-lg p-3 text-center">
                    <p className="text-sm text-gray-600">Model</p>
                    <p className="font-semibold text-gray-900">
                      {transcription.metadata.model}
                    </p>
                  </div>
                </div>

                {/* Detailed Information (if verbose_json) */}
                {transcription.transcription.duration && (
                  <div className="grid grid-cols-1 md:grid-cols-2 gap-4">
                    <div className="bg-white rounded-lg p-3 text-center">
                      <p className="text-sm text-gray-600">Duration</p>
                      <p className="font-semibold text-gray-900">
                        {transcription.transcription.duration.toFixed(1)}s
                      </p>
                    </div>
                    <div className="bg-white rounded-lg p-3 text-center">
                      <p className="text-sm text-gray-600">Language</p>
                      <p className="font-semibold text-gray-900">
                        {transcription.transcription.language || 'Auto-detected'}
                      </p>
                    </div>
                  </div>
                )}
              </div>
            </div>
          ) : !isTranscribing && !error && (
            // Welcome State
            <div className="text-center py-12">
              <div className="w-16 h-16 bg-blue-100 rounded-2xl flex items-center justify-center mx-auto mb-4">
                <Mic className="w-8 h-8 text-blue-600" />
              </div>
              <h3 className="text-lg font-semibold text-gray-700 mb-2">
                Ready to Transcribe!
              </h3>
              <p className="text-gray-600 max-w-md mx-auto">
                Upload an audio file or record your voice, then click "Transcribe" to convert speech to text with AI.
              </p>
            </div>
          )}
        </div>
      </div>
    </div>
  );
}

export default AudioTranscription;

Update your src/App.jsx to include the new audio transcription component:

import { useState } from "react";
import StreamingChat from "./StreamingChat";
import ImageGenerator from "./ImageGenerator";
import AudioTranscription from "./AudioTranscription";
import { MessageSquare, Image, Mic } from "lucide-react";

function App() {
  // 🧠 STATE: Navigation management
  const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', or 'audio'

  // 🎨 UI: Main app with navigation
  return (
    <div className="min-h-screen bg-gray-100">
      {/* Navigation Header */}
      <nav className="bg-white shadow-sm border-b border-gray-200">
        <div className="max-w-6xl mx-auto px-4">
          <div className="flex items-center justify-between h-16">
            {/* Logo */}
            <div className="flex items-center space-x-3">
              <div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center">
                <span className="text-white font-bold text-sm">AI</span>
              </div>
              <h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1>
            </div>

            {/* Navigation Buttons */}
            <div className="flex space-x-2">
              <button
                onClick={() => setCurrentView("chat")}
                className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
                  currentView === "chat"
                    ? "bg-blue-100 text-blue-700 shadow-sm"
                    : "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
                }`}
              >
                <MessageSquare className="w-4 h-4" />
                <span>Chat</span>
              </button>

              <button
                onClick={() => setCurrentView("images")}
                className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
                  currentView === "images"
                    ? "bg-purple-100 text-purple-700 shadow-sm"
                    : "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
                }`}
              >
                <Image className="w-4 h-4" />
                <span>Images</span>
              </button>

              <button
                onClick={() => setCurrentView("audio")}
                className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
                  currentView === "audio"
                    ? "bg-blue-100 text-blue-700 shadow-sm"
                    : "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
                }`}
              >
                <Mic className="w-4 h-4" />
                <span>Audio</span>
              </button>
            </div>
          </div>
        </div>
      </nav>

      {/* Main Content */}
      <main className="h-[calc(100vh-4rem)]">
        {currentView === "chat" && <StreamingChat />}
        {currentView === "images" && <ImageGenerator />}
        {currentView === "audio" && <AudioTranscription />}
      </main>
    </div>
  );
}

export default App;

🧪 Testing Your Audio Transcription

Let’s test your audio transcription feature step by step to make sure everything works correctly.

Step 1: Backend Route Test

First, verify your backend route works by testing it directly:

Test with a small audio file:

# Create a test audio file or use an existing one
curl -X POST http://localhost:8000/api/audio/transcribe \
  -F "audio=@test_audio.mp3" \
  -F "response_format=text"

Expected response:

{
  "success": true,
  "transcription": {
    "text": "This is a test of the audio transcription feature..."
  },
  "metadata": {
    "filename": "test_audio.mp3",
    "size": 45612,
    "model": "whisper-1",
    "timestamp": "2024-01-15T10:30:00.000Z"
  }
}

Step 2: Full Application Test

Start both servers:

Backend (in your backend folder):

npm run dev

Frontend (in your frontend folder):

npm run dev

Test the complete flow:

Navigate to Audio → Click the “Audio” tab in navigation
Test file upload → Upload an MP3 or WAV file
Test recording → Click microphone to record voice
Test transcription → Click “Transcribe” and see loading state
View results → See transcribed text with metadata
Test download → Download transcription as text file
Test settings → Try different languages and detail levels

Step 3: Recording Permission Test

Test browser microphone access:

Click record button → Browser should ask for microphone permission
Allow access → Recording should start with red pulsing button
Record voice → Speak clearly for 5-10 seconds
Stop recording → Click button again to stop
Transcribe → Process the recorded audio

Expected behavior:

Smooth recording start/stop
Clear audio playback preview
Accurate transcription of recorded speech

Step 4: Error Handling Test

Test error scenarios:

❌ No audio file: Click transcribe without uploading/recording
❌ Wrong file type: Upload a PDF or image file
❌ Large file: Upload audio file larger than 25MB
❌ Microphone denied: Deny microphone permissions