Adding Subtitles to Japanese Videos Using Whisper

Japanese version is here.

日本語を喋っている動画にWhisperを使って字幕を付ける|だっち (note.com)

Achieved Goal

I wanted to find a way to effectively provide training to people who don't understand Japanese. So, I decided to add English subtitles to video files of lectures given in Japanese.

In this post, I'll introduce a Python code that can add either Japanese or English subtitles to video files.

This code uses Whisper for speech recognition.

Script Overview

This Python script is a comprehensive tool for adding subtitles to video files. It mainly provides two functions:

  1. Subtitle generation using speech recognition

  2. Adding subtitles to videos using existing SRT files

Key Features

1. Subtitle Generation (SRTGenerator class)

  • Uses the Whisper AI model for speech recognition and subtitle generation

  • Processes long audio by splitting it into appropriate chunks

  • Saves generated subtitles in SRT format

  • Optional feature to translate audio into English

2. Subtitle Addition (SubtitleAdder class)

  • Loads subtitles from existing SRT files

  • Adds subtitles to videos and generates new video files

  • Allows customization of subtitle font size, color, position, etc.

Special Features

  • Uses moviepy for video editing

  • Achieves high-accuracy speech recognition using the whisper library

  • Manages token count efficiently using tiktoken

  • Supports multiple languages (including Japanese font support)

  • Provides a Command Line Interface (CLI) for ease of use

Setting Up the Environment

Install Docker Engine

https://docs.docker.com/engine/install/ubuntu/

Remove any old versions of Docker that might be installed:

for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt remove $pkg; done

Installing Docker Engine

sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
docker -v
wsl -—shutdown

Making Docker Commands Usable Without Sudo

sudo groupadd docker
sudo usermod -aG docker rui
wsl --shutdown

Setting Docker to Start Automatically

It can be annoying to start Docker manually after every system restart. To make Docker start automatically:

sudo visudo

Add the following to the end of the file:

docker ALL=(ALL)  NOPASSWD: /usr/sbin/service docker start
sudo nano ~/.bashrc

Add the following to the end of the file:

if [[ $(service docker status | awk '{print $4}') = "not" ]]; then
sudo service docker start > /dev/null
fi
source ~/.bashrc

Installing NVIDIA Docker

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Creating a Docker Container

Do this in your home folder on Ubuntu on WSL (Windows Subsystem for Linux):
CUDA | NVIDIA NGC

docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
docker run -it --gpus all nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
apt update && apt full-upgrade -y
apt install git wget nano ffmpeg -y

Installing Miniconda

cd ~
mkdir tmp
cd tmp

https://docs.anaconda.com/miniconda/#miniconda-latest-installer-links
Copy the link for the Linux 64-bit version:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# yes, enter, yes
# Remove the tmp folder
cd ..
rm -rf tmp

exit

docker container ls -a

docker start <container id>

docker exec -it <container id> /bin/bash

Setting up the Conda Environment

mkdir subtitle_generator
cd subtitle_generator
nano subtitle-generator.yml
name: subtitle-generator
channels:
  - conda-forge
  - pytorch
  - nvidia
  - defaults
dependencies:
  - python=3.9
  - pip
  - cudatoolkit=11.8
  - tiktoken
  - pillow
  - tqdm
  - srt
  - moviepy
  - pip:
    - openai-whisper
    - torch
    - torchvision
    - torchaudio
conda env create -f subtitle-generator.yml
conda activate subtitle-generator

Python script

nano subtitle-generator.py

import os
import logging
from typing import List, Dict, Any
import torch
from moviepy.editor import VideoFileClip, ImageClip, CompositeVideoClip, ColorClip
import whisper
import srt
from datetime import timedelta
from PIL import Image, ImageDraw, ImageFont
import numpy as np
from tqdm import tqdm
import tiktoken
import textwrap
import argparse

class Config:
    # Paths
    FONT_PATH = os.getenv('FONT_PATH', "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf")
    JAPANESE_FONT_PATH = os.getenv('JAPANESE_FONT_PATH', "/usr/share/fonts/opentype/noto/NotoSansCJK-Bold.ttc")
    TEMP_AUDIO_FILE = os.getenv('TEMP_AUDIO_FILE', "temp_audio.wav")

    # Video processing
    DEFAULT_SUBTITLE_HEIGHT = int(os.getenv('DEFAULT_SUBTITLE_HEIGHT', 200))
    DEFAULT_FONT_SIZE = int(os.getenv('DEFAULT_FONT_SIZE', 32))
    MAX_SUBTITLE_LINES = int(os.getenv('MAX_SUBTITLE_LINES', 3))

    # Video encoding
    VIDEO_CODEC = os.getenv('VIDEO_CODEC', 'libx264')
    AUDIO_CODEC = os.getenv('AUDIO_CODEC', 'aac')
    VIDEO_PRESET = os.getenv('VIDEO_PRESET', 'medium')
    CRF = os.getenv('CRF', '23')
    PIXEL_FORMAT = os.getenv('PIXEL_FORMAT', 'yuv420p')

    # Tiktoken related settings
    TIKTOKEN_MODEL = "cl100k_base"
    MAX_TOKENS_PER_CHUNK = 4000

# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class SubtitleProcessor:
    def __init__(self, video_path: str, srt_path: str):
        self.video_path = video_path
        self.srt_path = srt_path
        self.temp_files = []

    def cleanup_temp_files(self):
        logger.info("Cleaning up temporary files...")
        for file_path in self.temp_files:
            try:
                if os.path.exists(file_path):
                    os.remove(file_path)
                    logger.info(f"Removed temporary file: {file_path}")
            except Exception as e:
                logger.error(f"Error removing {file_path}: {e}")

class SRTGenerator(SubtitleProcessor):
    def __init__(self, video_path: str, output_srt: str, model_name: str, language: str = "japanese", translate: bool = False):
        super().__init__(video_path, output_srt)
        self.model_name = model_name
        self.translate = translate
        self.tokenizer = tiktoken.get_encoding(Config.TIKTOKEN_MODEL)
        self.language = language

    def run(self):
        try:
            self.extract_audio()
            transcription = self.transcribe_audio()
            chunks = self.split_into_chunks(transcription)
            results = self.process_chunks(chunks)
            self.create_srt(results)
            logger.info(f"SRT file has been generated: {self.srt_path}")
        finally:
            self.cleanup_temp_files()

    def extract_audio(self):
        logger.info("Extracting audio from video...")
        video = VideoFileClip(self.video_path)
        video.audio.write_audiofile(Config.TEMP_AUDIO_FILE)
        self.temp_files.append(Config.TEMP_AUDIO_FILE)

    def transcribe_audio(self) -> Dict[str, Any]:
        logger.info("Transcribing audio with Whisper...")
        device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"Using device: {device}")
        logger.info(f"Loading Whisper model: {self.model_name}")
        model = whisper.load_model(self.model_name).to(device)

        task = "translate" if self.translate else "transcribe"
        logger.info(f"Performing task: {task} with language: {self.language}")
        result = model.transcribe(Config.TEMP_AUDIO_FILE, task=task, language=self.language)
        return result

    def split_into_chunks(self, transcription: Dict[str, Any]) -> List[Dict[str, Any]]:
        logger.info("Splitting transcription into chunks...")
        chunks = []
        current_chunk = {"text": "", "segments": []}
        current_tokens = 0

        for segment in transcription['segments']:
            segment_tokens = self.tokenizer.encode(segment['text'])
            if current_tokens + len(segment_tokens) > Config.MAX_TOKENS_PER_CHUNK:
                chunks.append(current_chunk)
                current_chunk = {"text": "", "segments": []}
                current_tokens = 0
            
            current_chunk['text'] += segment['text'] + " "
            current_chunk['segments'].append(segment)
            current_tokens += len(segment_tokens)

        if current_chunk['segments']:
            chunks.append(current_chunk)

        return chunks

    def process_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        logger.info("Processing chunks...")
        results = []
        for chunk in tqdm(chunks, desc="Processing chunks"):
            results.extend(chunk['segments'])
        return results

    def create_srt(self, results: List[Dict[str, Any]]):
        logger.info("Creating SRT file...")
        subs = []
        for i, segment in enumerate(results, start=1):
            start = timedelta(seconds=segment['start'])
            end = timedelta(seconds=segment['end'])
            text = segment['text']
            sub = srt.Subtitle(index=i, start=start, end=end, content=text)
            subs.append(sub)
        
        with open(self.srt_path, 'w', encoding='utf-8') as f:
            f.write(srt.compose(subs))

class SubtitleAdder(SubtitleProcessor):
    def __init__(self, video_path: str, output_video: str, input_srt: str, subtitle_height: int = Config.DEFAULT_SUBTITLE_HEIGHT):
        super().__init__(video_path, input_srt)
        self.output_video = output_video
        self.subtitle_height = subtitle_height

    def run(self):
        try:
            subs = self.load_srt(self.srt_path)
            self.add_subtitles_to_video(subs)
            logger.info(f"Video with subtitles has been generated: {self.output_video}")
        finally:
            self.cleanup_temp_files()

    @staticmethod
    def load_srt(srt_path: str) -> List[srt.Subtitle]:
        logger.info(f"Loading SRT file: {srt_path}")
        with open(srt_path, 'r', encoding='utf-8') as f:
            return list(srt.parse(f.read()))

    def add_subtitles_to_video(self, subs: List[srt.Subtitle]):
        logger.info(f"Adding subtitles to video with subtitle space height of {self.subtitle_height} pixels...")
        video = VideoFileClip(self.video_path)
        
        original_width, original_height = video.w, video.h
        new_height = original_height + self.subtitle_height
        
        background = ColorClip(size=(original_width, new_height), color=(0,0,0), duration=video.duration)
        video_clip = video.set_position((0, 0))
        
        subtitle_clips = [
            self.create_subtitle_clip(sub.content, original_width)
                .set_start(sub.start.total_seconds())
                .set_end(sub.end.total_seconds())
                .set_position((0, original_height))
            for sub in subs
        ]
        
        final_video = CompositeVideoClip([background, video_clip] + subtitle_clips, size=(original_width, new_height))
        final_video = final_video.set_duration(video.duration)
        
        final_video.write_videofile(
            self.output_video, 
            codec=Config.VIDEO_CODEC, 
            audio_codec=Config.AUDIO_CODEC,
            preset=Config.VIDEO_PRESET,
            ffmpeg_params=['-crf', Config.CRF, '-pix_fmt', Config.PIXEL_FORMAT],
            verbose=False,
            logger=None
        )

    @staticmethod
    def create_subtitle_clip(txt: str, video_width: int, font_size: int = Config.DEFAULT_FONT_SIZE, max_lines: int = Config.MAX_SUBTITLE_LINES) -> ImageClip:
        if any(ord(char) > 127 for char in txt):
            font_path = Config.JAPANESE_FONT_PATH
        else:
            font_path = Config.FONT_PATH

        try:
            font = ImageFont.truetype(font_path, font_size)
        except IOError:
            logger.warning(f"Failed to load font from {font_path}. Falling back to default font.")
            font = ImageFont.load_default()
        
        max_char_count = int(video_width / (font_size * 0.6))
        wrapped_text = textwrap.fill(txt, width=max_char_count)
        lines = wrapped_text.split('\n')[:max_lines]
        
        dummy_img = Image.new('RGB', (video_width, font_size * len(lines)))
        dummy_draw = ImageDraw.Draw(dummy_img)
        max_line_width = max(dummy_draw.textbbox((0, 0), line, font=font)[2] for line in lines)
        total_height = sum(dummy_draw.textbbox((0, 0), line, font=font)[3] for line in lines)
        
        img_width, img_height = video_width, total_height + 20
        img = Image.new('RGBA', (img_width, img_height), (0, 0, 0, 0))
        draw = ImageDraw.Draw(img)
        
        y_text = 10
        for line in lines:
            bbox = draw.textbbox((0, 0), line, font=font)
            x_text = (img_width - bbox[2]) // 2
            
            for adj in range(-2, 3):
                for adj2 in range(-2, 3):
                    draw.text((x_text+adj, y_text+adj2), line, font=font, fill=(0, 0, 0, 255))
            
            draw.text((x_text, y_text), line, font=font, fill=(255, 255, 255, 255))
            y_text += bbox[3]
        
        return ImageClip(np.array(img))

def main():
    parser = argparse.ArgumentParser(description="Subtitle Generator and Adder", formatter_class=argparse.RawTextHelpFormatter)
    subparsers = parser.add_subparsers(dest="action", required=True)

    # Common arguments
    common_parser = argparse.ArgumentParser(add_help=False)
    common_parser.add_argument("--input", required=True, help="Input video file path")

    # Generate subparser
    generate_parser = subparsers.add_parser("generate", parents=[common_parser])
    generate_parser.add_argument("--output_srt", required=True, help="Output SRT file path")
    generate_parser.add_argument("--model", default="large-v3", help="""Whisper model name (default: large-v3)
    Available models:
    - tiny: Smallest and fastest, lowest accuracy (39M parameters)
    - base: Improved accuracy over tiny, still fast (74M parameters)
    - small: Good balance of accuracy and speed (244M parameters)
    - medium: Higher accuracy, slower than small (769M parameters)
    - large: Most accurate, slowest (1550M parameters)
    - large-v1: Original large model
    - large-v2: Updated large model with improved performance
    - large-v3: Latest large model with further improvements

    English-only models (faster for English content):
    - tiny.en, base.en, small.en, medium.en

    Note: Larger models are more accurate but require more resources and time.""")
    generate_parser.add_argument("--language", default="japanese", help="Language of the audio (default: japanese)")
    generate_parser.add_argument("--translate", action="store_true", help="Translate the audio to English")

    # Add subparser
    add_parser = subparsers.add_parser("add", parents=[common_parser])
    add_parser.add_argument("--output_video", required=True, help="Output video file path")
    add_parser.add_argument("--input_srt", required=True, help="Input SRT file path")

    args = parser.parse_args()

    if args.action == "generate":
        generator = SRTGenerator(args.input, args.output_srt, args.model, args.language, args.translate)
        generator.run()
    elif args.action == "add":
        adder = SubtitleAdder(args.input, args.output_video, args.input_srt)
        adder.run()

if __name__ == "__main__":
    main()

How to Use

Help

python subtitle-generator.py generate --help
usage: subtitle1.py generate [-h] --input INPUT --output_srt OUTPUT_SRT [--model MODEL] [--language LANGUAGE]
                             [--translate]

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         Input video file path
  --output_srt OUTPUT_SRT
                        Output SRT file path
  --model MODEL         Whisper model name (default: large-v3) Available models: - tiny: Smallest and fastest, lowest
                        accuracy (39M parameters) - base: Improved accuracy over tiny, still fast (74M parameters) -
                        small: Good balance of accuracy and speed (244M parameters) - medium: Higher accuracy, slower
                        than small (769M parameters) - large: Most accurate, slowest (1550M parameters) - large-v1:
                        Original large model - large-v2: Updated large model with improved performance - large-v3:
                        Latest large model with further improvements English-only models (faster for English content):
                        - tiny.en, base.en, small.en, medium.en Note: Larger models are more accurate but require more
                        resources and time.
  --language LANGUAGE   Language of the audio (default: japanese)
  --translate           Translate the audio to English
python subtitle-generator.py add --help
usage: subtitle1.py add [-h] --input INPUT --output_video OUTPUT_VIDEO --input_srt INPUT_SRT

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         Input video file path
  --output_video OUTPUT_VIDEO
                        Output video file path
  --input_srt INPUT_SRT
                        Input SRT file path

Copying Files from Windows to the Container

# on ubuntu command line
# Use 'docker container ls -a' to find the container name
docker cp "/mnt/c/Windows_path/VVVV.mp4" docker_container_name:root/speech_to_text

Usage Example

# Generate an SRT file in Japanese
python subtitle-generator.py generate --input xxx.mp4
--output_srt japa.srt --model large-v3

# Generate an SRT file in English
python subtitle-generator.py generate --input xxx.mp4
--output_srt eng.srt --model large-v3 --translate

# Add Japanese subtitles to the video
python subtitle-generator.py add --input xxx.mp4 --output_video japa.mp4 --input_srt japa.srt

# Add English subtitles to the video
python subtitle-generator.py add --input xxx.mp4 --output_video eng.mp4 --input_srt eng.srt

Copying Files from the Container to Windows

docker cp docker_container_name:root/path/xxx.mp4 "/mnt/c/Windows path/"

Summary

This Python script combines cutting-edge AI technology to automatically create high-quality subtitles from audio or video files. You can also manually edit the AI-generated SRT files if needed. This makes it possible to automatically add English subtitles to videos.

We hope this tool will save you time, allowing you to focus more on your educational and research activities.

The progress of AI is truly remarkable. We encourage you to use AI technologies like this to create a more efficient work environment for yourself!

If you have any questions or suggestions for improvement, please let us know in the comments.






















いいなと思ったら応援しよう!