Utilizing Whisper and ChatGPT to Add Subtitles to Japanese-Language Videos

Objective

The primary aim of this project was to enhance the accessibility of Japanese-language instructional videos for non-Japanese speakers. To achieve this, I sought to develop a method for adding English subtitles to video files containing Japanese audio.

Background

In a previous article, I introduced a Python script capable of generating either Japanese or English subtitles for video files. This script utilized Whisper's speech recognition capabilities. However, when employing Whisper for translation, I encountered issues with omissions and quality inconsistencies.

Methodology

To address these limitations, I have augmented my approach by incorporating OpenAI's API for translation purposes. This updated method involves two primary steps:

  1. Utilizing Whisper for transcription of the source language (Japanese)

  2. Employing OpenAI's API to translate the transcribed text into the target language (English)

Script Overview

The Python script I have developed serves as a comprehensive tool for adding subtitles to video files. It automates the entire process, from subtitle generation to video integration, by combining speech recognition and translation functionalities. The script offers three main features:

  1. Subtitle Generation via Speech Recognition

  2. Subtitle Integration Using Existing SRT Files

  3. Translation of Subtitle Files

Key Functionalities

1. Subtitle Generation (SRTGenerator Class)

  • Employs the Whisper AI model for speech recognition and subtitle creation

  • Processes audio in appropriate segments and saves generated subtitles in SRT format

  • Offers the option to translate subtitles into English using Whisper's translation feature

2. Subtitle Integration (SubtitleAdder Class)

  • Imports existing SRT files and incorporates subtitles into videos

  • Allows customization of subtitle attributes such as font size, color, and positioning

3. Subtitle Translation (SRTTranslator Class)

  • Utilizes the OpenAI API to translate existing SRT files into specified target languages

  • Enables adjustment of the translation's temperature parameter to control output diversity

Technical Specifications

  • Utilizes the moviepy library for video editing and subtitle integration

  • Implements the whisper library to achieve high-accuracy speech recognition

  • Incorporates OpenAI API for automated translation capabilities

  • Features a user-friendly Command Line Interface (CLI) for ease of operation

Environment Setup

To utilize this script, users must first install the Docker engine. Detailed installation instructions can be found in the official Docker documentation. It is advisable to remove any outdated Docker installations prior to proceeding with the setup.

for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt remove $pkg; done

Installing Docker Engine

sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
docker -v
wsl -—shutdown

Configuring Docker for Non-Root Usage

sudo groupadd docker
sudo usermod -aG docker rui
wsl --shutdown

Configuring Docker for Non-Root Usage

To configure Docker to start automatically on system boot:

sudo visudo

Add the following to the last line

docker ALL=(ALL)  NOPASSWD: /usr/sbin/service docker start
sudo nano ~/.bashrc

Add the following to the last line

if [[ $(service docker status | awk '{print $4}') = "not" ]]; then
sudo service docker start > /dev/null
fi
source ~/.bashrc

NVIDIA Docker Installation

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Creating a Docker container

Ubuntu on WSL
In the home folder.
CUDA | NVIDIA NGC

docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
docker run -it --gpus all nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
apt update && apt full-upgrade -y
apt install git wget nano ffmpeg -y

Miniconda Installation

cd ~
mkdir tmp
cd tmp

https://docs.anaconda.com/miniconda/#miniconda-latest-installer-links
Copy the link for linux 64-bit

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# yes, enter, yes
# rm a temp folder:
cd ..
rm -rf tmp

exit

docker container ls -a

docker start <container id>

docker exec -it <container id> /bin/bash

Building the conda environment

mkdir subtitle_generator
cd subtitle_generator
nano subtitle-generator.yml
name: subtitle-generator
channels:
  - conda-forge
  - pytorch
  - nvidia
  - defaults
dependencies:
  - python=3.9
  - pip
  - cudatoolkit=11.8
  - tiktoken
  - pillow
  - tqdm
  - srt
  - moviepy
  - python-dotenv
  - pip:
    - openai-whisper
    - openai
    - torch
    - torchvision
    - torchaudio
conda env create -f subtitle-generator.yml
conda activate subtitle-generator

Script

nano subtitle-generator.py
import os
import logging
from typing import List, Dict, Any
import torch
from moviepy.editor import VideoFileClip, CompositeVideoClip, ColorClip, ImageClip
import whisper
import srt
from datetime import timedelta
from PIL import Image, ImageDraw, ImageFont
import numpy as np
from tqdm import tqdm
import tiktoken
import textwrap
import argparse
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

class Config:
    # Paths
    FONT_PATH = os.getenv('FONT_PATH', "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf")
    JAPANESE_FONT_PATH = os.getenv('JAPANESE_FONT_PATH', "/usr/share/fonts/opentype/noto/NotoSansCJK-Bold.ttc")
    TEMP_AUDIO_FILE = os.getenv('TEMP_AUDIO_FILE', "temp_audio.wav")

    # Video processing
    DEFAULT_SUBTITLE_HEIGHT = int(os.getenv('DEFAULT_SUBTITLE_HEIGHT', 200))
    DEFAULT_FONT_SIZE = int(os.getenv('DEFAULT_FONT_SIZE', 32))
    MAX_SUBTITLE_LINES = int(os.getenv('MAX_SUBTITLE_LINES', 3))

    # Video encoding
    VIDEO_CODEC = os.getenv('VIDEO_CODEC', 'libx264')
    AUDIO_CODEC = os.getenv('AUDIO_CODEC', 'aac')
    VIDEO_PRESET = os.getenv('VIDEO_PRESET', 'medium')
    CRF = os.getenv('CRF', '23')
    PIXEL_FORMAT = os.getenv('PIXEL_FORMAT', 'yuv420p')

    # Tiktoken related settings
    TIKTOKEN_MODEL = "cl100k_base"
    MAX_TOKENS_PER_CHUNK = 4000

    # OpenAI settings
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    DEFAULT_GPT_MODEL = "gpt-4o"
    GPT_MAX_TOKENS = 4000

# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class SubtitleProcessor:
    def __init__(self, video_path: str, srt_path: str):
        self.video_path = video_path
        self.srt_path = srt_path
        self.temp_files = []

    def cleanup_temp_files(self):
        logger.info("Cleaning up temporary files...")
        for file_path in self.temp_files:
            try:
                if os.path.exists(file_path):
                    os.remove(file_path)
                    logger.info(f"Removed temporary file: {file_path}")
            except Exception as e:
                logger.error(f"Error removing {file_path}: {e}")

class SRTTranslator:
    def __init__(self, model: str = Config.DEFAULT_GPT_MODEL, temperature: float = 0.3):
        api_key = Config.OPENAI_API_KEY
        if not api_key:
            raise ValueError("OpenAI API key is required. Set it in the environment variable 'OPENAI_API_KEY'.")
        self.client = OpenAI(api_key=api_key)
        self.model = model
        self.temperature = temperature

    def translate_srt(self, input_srt: str, output_srt: str, source_lang: str, target_lang: str):
        with open(input_srt, 'r', encoding='utf-8') as f:
            subtitle_generator = srt.parse(f.read())
            subtitles = list(subtitle_generator)

        translated_subtitles = []
        for subtitle in tqdm(subtitles, desc="Translating subtitles"):
            translated_content = self.translate_text(subtitle.content, source_lang, target_lang)
            translated_subtitle = srt.Subtitle(
                index=subtitle.index,
                start=subtitle.start,
                end=subtitle.end,
                content=translated_content
            )
            translated_subtitles.append(translated_subtitle)

        with open(output_srt, 'w', encoding='utf-8') as f:
            f.write(srt.compose(translated_subtitles))

    def translate_text(self, text: str, source_lang: str, target_lang: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": f"You are a professional translator. Translate the following text from {source_lang} to {target_lang}. Maintain the original meaning and nuance as much as possible. Do not modify any formatting or line breaks."},
                {"role": "user", "content": text}
            ],
            temperature=self.temperature,
            max_tokens=Config.GPT_MAX_TOKENS
        )
        return response.choices[0].message.content.strip()

class SRTGenerator(SubtitleProcessor):
    def __init__(self, video_path: str, output_srt: str, model_name: str, language: str = "japanese", translate: bool = False, api_key: str = None):
        super().__init__(video_path, output_srt)
        self.model_name = model_name
        self.translate = translate
        self.tokenizer = tiktoken.get_encoding(Config.TIKTOKEN_MODEL)
        self.language = language
        self.api_key = api_key

    def run(self):
        try:
            self.extract_audio()
            transcription = self.transcribe_audio()
            chunks = self.split_into_chunks(transcription)
            results = self.process_chunks(chunks)
            self.create_srt(results)

            logger.info(f"SRT file has been generated: {self.srt_path}")
        finally:
            self.cleanup_temp_files()

    def extract_audio(self):
        logger.info("Extracting audio from video...")
        video = VideoFileClip(self.video_path)
        video.audio.write_audiofile(Config.TEMP_AUDIO_FILE)
        self.temp_files.append(Config.TEMP_AUDIO_FILE)

    def transcribe_audio(self) -> Dict[str, Any]:
        logger.info("Transcribing audio with Whisper...")
        device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"Using device: {device}")
        logger.info(f"Loading Whisper model: {self.model_name}")
        model = whisper.load_model(self.model_name).to(device)

        logger.info(f"Performing task: transcribe with language: {self.language}")
        result = model.transcribe(Config.TEMP_AUDIO_FILE, task="transcribe", language=self.language)
        return result

    def split_into_chunks(self, transcription: Dict[str, Any]) -> List[Dict[str, Any]]:
        logger.info("Splitting transcription into chunks...")
        chunks = []
        current_chunk = {"text": "", "segments": []}
        current_tokens = 0

        for segment in transcription['segments']:
            segment_tokens = self.tokenizer.encode(segment['text'])
            if current_tokens + len(segment_tokens) > Config.MAX_TOKENS_PER_CHUNK:
                chunks.append(current_chunk)
                current_chunk = {"text": "", "segments": []}
                current_tokens = 0
            
            current_chunk['text'] += segment['text'] + " "
            current_chunk['segments'].append(segment)
            current_tokens += len(segment_tokens)

        if current_chunk['segments']:
            chunks.append(current_chunk)

        return chunks

    def process_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        logger.info("Processing chunks...")
        results = []
        for chunk in tqdm(chunks, desc="Processing chunks"):
            results.extend(chunk['segments'])
        return results

    def create_srt(self, results: List[Dict[str, Any]]):
        logger.info("Creating SRT file...")
        subs = []
        for i, segment in enumerate(results, start=1):
            start = timedelta(seconds=segment['start'])
            end = timedelta(seconds=segment['end'])
            text = segment['text']
            sub = srt.Subtitle(index=i, start=start, end=end, content=text)
            subs.append(sub)
        
        with open(self.srt_path, 'w', encoding='utf-8') as f:
            f.write(srt.compose(subs))

class SubtitleAdder(SubtitleProcessor):
    def __init__(self, video_path: str, output_video: str, input_srt: str, subtitle_height: int = Config.DEFAULT_SUBTITLE_HEIGHT):
        super().__init__(video_path, input_srt)
        self.output_video = output_video
        self.subtitle_height = subtitle_height

    def run(self):
        try:
            subs = self.load_srt(self.srt_path)
            self.add_subtitles_to_video(subs)
            logger.info(f"Video with subtitles has been generated: {self.output_video}")
        finally:
            self.cleanup_temp_files()

    @staticmethod
    def load_srt(srt_path: str) -> List[srt.Subtitle]:
        logger.info(f"Loading SRT file: {srt_path}")
        with open(srt_path, 'r', encoding='utf-8') as f:
            return list(srt.parse(f.read()))

    def add_subtitles_to_video(self, subs: List[srt.Subtitle]):
        logger.info(f"Adding subtitles to video with subtitle space height of {self.subtitle_height} pixels...")
        video = VideoFileClip(self.video_path)
        
        original_width, original_height = video.w, video.h
        new_height = original_height + self.subtitle_height
        
        background = ColorClip(size=(original_width, new_height), color=(0,0,0), duration=video.duration)
        video_clip = video.set_position((0, 0))
        
        subtitle_clips = [
            self.create_subtitle_clip(sub.content, original_width)
                .set_start(sub.start.total_seconds())
                .set_end(sub.end.total_seconds())
                .set_position((0, original_height))
            for sub in subs
        ]
        
        final_video = CompositeVideoClip([background, video_clip] + subtitle_clips, size=(original_width, new_height))
        final_video = final_video.set_duration(video.duration)
        
        final_video.write_videofile(
            self.output_video, 
            codec=Config.VIDEO_CODEC, 
            audio_codec=Config.AUDIO_CODEC,
            preset=Config.VIDEO_PRESET,
            ffmpeg_params=['-crf', Config.CRF, '-pix_fmt', Config.PIXEL_FORMAT],
            verbose=False,
            logger=None
        )

    @staticmethod
    def create_subtitle_clip(txt: str, video_width: int, font_size: int = Config.DEFAULT_FONT_SIZE, max_lines: int = Config.MAX_SUBTITLE_LINES) -> ImageClip:
        if any(ord(char) > 127 for char in txt):
            font_path = Config.JAPANESE_FONT_PATH
        else:
            font_path = Config.FONT_PATH

        try:
            font = ImageFont.truetype(font_path, font_size)
        except IOError:
            logger.warning(f"Failed to load font from {font_path}. Falling back to default font.")
            font = ImageFont.load_default()
        
        max_char_count = int(video_width / (font_size * 0.6))
        wrapped_text = textwrap.fill(txt, width=max_char_count)
        lines = wrapped_text.split('\n')[:max_lines]
        
        dummy_img = Image.new('RGB', (video_width, font_size * len(lines)))
        dummy_draw = ImageDraw.Draw(dummy_img)
        max_line_width = max(dummy_draw.textbbox((0, 0), line, font=font)[2] for line in lines)
        total_height = sum(dummy_draw.textbbox((0, 0), line, font=font)[3] for line in lines)
        
        img_width, img_height = video_width, total_height + 20
        img = Image.new('RGBA', (img_width, img_height), (0, 0, 0, 0))
        draw = ImageDraw.Draw(img)
        
        y_text = 10
        for line in lines:
            bbox = draw.textbbox((0, 0), line, font=font)
            x_text = (img_width - bbox[2]) // 2
            
            for adj in range(-2, 3):
                for adj2 in range(-2, 3):
                    draw.text((x_text+adj, y_text+adj2), line, font=font, fill=(0, 0, 0, 255))
            
            draw.text((x_text, y_text), line, font=font, fill=(255, 255, 255, 255))
            y_text += bbox[3]
        
        return ImageClip(np.array(img))

def main():
    parser = argparse.ArgumentParser(description="Subtitle Generator and Adder", formatter_class=argparse.RawTextHelpFormatter)
    subparsers = parser.add_subparsers(dest="action", required=True)

    # Common arguments for generate and add commands
    common_parser = argparse.ArgumentParser(add_help=False)
    common_parser.add_argument("--input", required=True, help="Input video file path")

    # Generate subparser for creating subtitles
    generate_parser = subparsers.add_parser("generate", parents=[common_parser])
    generate_parser.add_argument("--output_srt", required=True, help="Output SRT file path")
    generate_parser.add_argument("--model", default="large-v3", help="Whisper model name (default: large-v3)")
    generate_parser.add_argument("--language", default="Japanese", help="Language of the audio (default: Japanese)")
    generate_parser.add_argument("--translate", action="store_true", help="Translate the audio to English")

    # Add subparser for adding subtitles to a video
    add_parser = subparsers.add_parser("add", parents=[common_parser])
    add_parser.add_argument("--output_video", required=True, help="Output video file path")
    add_parser.add_argument("--input_srt", required=True, help="Input SRT file path")

    # Translate subparser for translating an SRT file
    translate_parser = subparsers.add_parser("translate")
    translate_parser.add_argument("--input_srt", required=True, help="Input SRT file path")
    translate_parser.add_argument("--output_srt", required=True, help="Output SRT file path")
    translate_parser.add_argument("--source_lang", default="Japanese", help="Source language of the SRT file (default: Japanese)")
    translate_parser.add_argument("--target_lang", default="English", help="Target language for translation (default: English)")
    translate_parser.add_argument("--temperature", type=float, default=0.3, help="Temperature setting for OpenAI (default: 0.3)")

    args = parser.parse_args()

    if args.action == "generate":
        generator = SRTGenerator(args.input, args.output_srt, args.model, args.language, args.translate)
        generator.run()
    elif args.action == "add":
        adder = SubtitleAdder(args.input, args.output_video, args.input_srt)
        adder.run()
    elif args.action == "translate":
        translator = SRTTranslator(temperature=args.temperature)
        translator.translate_srt(args.input_srt, args.output_srt, args.source_lang, args.target_lang)
        logger.info(f"Translation completed. Output saved to {args.output_srt}")

if __name__ == "__main__":
    main()

.env

Create a .env file with your OpenAI API key:

# .env
OPENAI_API_KEY=sk-your_key

Usage

python subtitle_generator.py --help
usage: subtitle_generator.py [-h] {generate,add,translate} ...

Subtitle Generator and Adder

positional arguments:
  {generate,add,translate}

optional arguments:
  -h, --help            show this help message and exit
  


python subtitle_generator.py add --help
usage: subtitle_generator.py add [-h] --input INPUT --output_video OUTPUT_VIDEO --input_srt INPUT_SRT

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         Input video file path
  --output_video OUTPUT_VIDEO
                        Output video file path
  --input_srt INPUT_SRT
                        Input SRT file path



python subtitle_generator.py generate --help
usage: subtitle_generator.py generate [-h] --input INPUT --output_srt OUTPUT_SRT [--model MODEL] [--language LANGUAGE] [--translate]

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         Input video file path
  --output_srt OUTPUT_SRT
                        Output SRT file path
  --model MODEL         Whisper model name (default: large-v3)
  --language LANGUAGE   Language of the audio (default: Japanese)
  --translate           Translate the audio to English




python subtitle_generator.py translate --help
usage: subtitle_generator.py translate [-h] --input_srt INPUT_SRT --output_srt OUTPUT_SRT [--source_lang SOURCE_LANG]
                                     [--target_lang TARGET_LANG] [--temperature TEMPERATURE]

optional arguments:
  -h, --help            show this help message and exit
  --input_srt INPUT_SRT
                        Input SRT file path
  --output_srt OUTPUT_SRT
                        Output SRT file path
  --source_lang SOURCE_LANG
                        Source language of the SRT file (default: Japanese)
  --target_lang TARGET_LANG
                        Target language for translation (default: English)
  --temperature TEMPERATURE
                        Temperature setting for OpenAI (default: 0.3)

Copy the file from Windows to the container

# on ubuntu command line
# docker container ls -aでコンテナ名を調べる
docker cp "/mnt/c/Windows_path/VVVV.mp4" docker_container_name:root/speech_to_text

Example of use

1. Generation of Japanese SRT Files
The Whisper model is utilized to generate a Japanese SRT file directly from the audio content of the video.
python subtitle_generator.py generate --input test.mp4 --output_srt japa.srt --model large-v3

2. Integration of Japanese Subtitles into Video
The generated Japanese SRT file is used to incorporate Japanese subtitles into the video.
python subtitle_generator.py add --input test.mp4 --output_video japa.mp4 --input_srt japa.srt

3. Translation and Generation of English SRT Files
The Whisper model is employed to translate the video's Japanese audio content into English and generate an English SRT file.
python subtitle_generator.py generate --input test.mp4 --output_srt eng.srt --model large-v3 --translate

4. Integration of English Subtitles into Video
The generated English SRT file is utilized to incorporate English subtitles into the video.
python subtitle_generator.py add --input test.mp4 --output_video eng.mp4 --input_srt eng.srt

5. Translation of Japanese SRT Files to English
Existing Japanese SRT files are translated into English using the OpenAI API to produce English SRT files. The --temperature option allows for adjustment of the variability in the API's output.
python subtitle_generator.py translate --input_srt japa.srt --output_srt eng_translated.srt --temperature 0.7

Explanation:
The generate command employs the Whisper model for speech recognition and SRT file creation. Adding the --translate option enables automatic translation from Japanese to English using Whisper.
The add command integrates the generated SRT file as subtitles into the video.
The translate command utilizes the OpenAI API to translate existing SRT files.

Copying Files from Container to Windows

docker cp docker_container_name:root/path/xxx.mp4 "/mnt/c/Windows path/"

Summary

This Python script combines Whisper and OpenAI's API to automatically generate high-accuracy subtitles for videos, perform translations, and add subtitles. It serves as a powerful tool for efficient time management in educational, research, and content production contexts.

I encourage you to experiment with this tool if it piques your interest!

It is our hope that this will allow you to allocate more time to educational and research activities, thereby enhancing your overall productivity and satisfaction.

The advancements in AI technology are truly remarkable. I strongly encourage you to leverage such AI technologies to create more efficient work environments for yourselves!

Should you have any questions or suggestions for improvement, please don't hesitate to share them in the comments section. Your feedback is invaluable to me.

いいなと思ったら応援しよう!