Adding Subtitles to Japanese Videos Using Whisper
Japanese version is here.
日本語を喋っている動画にWhisperを使って字幕を付ける|だっち (note.com)
Achieved Goal
I wanted to find a way to effectively provide training to people who don't understand Japanese. So, I decided to add English subtitles to video files of lectures given in Japanese.
In this post, I'll introduce a Python code that can add either Japanese or English subtitles to video files.
This code uses Whisper for speech recognition.
Script Overview
This Python script is a comprehensive tool for adding subtitles to video files. It mainly provides two functions:
Subtitle generation using speech recognition
Adding subtitles to videos using existing SRT files
Key Features
1. Subtitle Generation (SRTGenerator class)
Uses the Whisper AI model for speech recognition and subtitle generation
Processes long audio by splitting it into appropriate chunks
Saves generated subtitles in SRT format
Optional feature to translate audio into English
2. Subtitle Addition (SubtitleAdder class)
Loads subtitles from existing SRT files
Adds subtitles to videos and generates new video files
Allows customization of subtitle font size, color, position, etc.
Special Features
Uses moviepy for video editing
Achieves high-accuracy speech recognition using the whisper library
Manages token count efficiently using tiktoken
Supports multiple languages (including Japanese font support)
Provides a Command Line Interface (CLI) for ease of use
Setting Up the Environment
Install Docker Engine
https://docs.docker.com/engine/install/ubuntu/
Remove any old versions of Docker that might be installed:
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt remove $pkg; done
Installing Docker Engine
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
docker -v
wsl -—shutdown
Making Docker Commands Usable Without Sudo
sudo groupadd docker
sudo usermod -aG docker rui
wsl --shutdown
Setting Docker to Start Automatically
It can be annoying to start Docker manually after every system restart. To make Docker start automatically:
sudo visudo
Add the following to the end of the file:
docker ALL=(ALL) NOPASSWD: /usr/sbin/service docker start
sudo nano ~/.bashrc
Add the following to the end of the file:
if [[ $(service docker status | awk '{print $4}') = "not" ]]; then
sudo service docker start > /dev/null
fi
source ~/.bashrc
Installing NVIDIA Docker
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Creating a Docker Container
Do this in your home folder on Ubuntu on WSL (Windows Subsystem for Linux):
CUDA | NVIDIA NGC
docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
docker run -it --gpus all nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
apt update && apt full-upgrade -y
apt install git wget nano ffmpeg -y
Installing Miniconda
cd ~
mkdir tmp
cd tmp
https://docs.anaconda.com/miniconda/#miniconda-latest-installer-links
Copy the link for the Linux 64-bit version:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# yes, enter, yes
# Remove the tmp folder
cd ..
rm -rf tmp
exit
docker container ls -a
docker start <container id>
docker exec -it <container id> /bin/bash
Setting up the Conda Environment
mkdir subtitle_generator
cd subtitle_generator
nano subtitle-generator.yml
name: subtitle-generator
channels:
- conda-forge
- pytorch
- nvidia
- defaults
dependencies:
- python=3.9
- pip
- cudatoolkit=11.8
- tiktoken
- pillow
- tqdm
- srt
- moviepy
- pip:
- openai-whisper
- torch
- torchvision
- torchaudio
conda env create -f subtitle-generator.yml
conda activate subtitle-generator
Python script
nano subtitle-generator.py
import os
import logging
from typing import List, Dict, Any
import torch
from moviepy.editor import VideoFileClip, ImageClip, CompositeVideoClip, ColorClip
import whisper
import srt
from datetime import timedelta
from PIL import Image, ImageDraw, ImageFont
import numpy as np
from tqdm import tqdm
import tiktoken
import textwrap
import argparse
class Config:
# Paths
FONT_PATH = os.getenv('FONT_PATH', "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf")
JAPANESE_FONT_PATH = os.getenv('JAPANESE_FONT_PATH', "/usr/share/fonts/opentype/noto/NotoSansCJK-Bold.ttc")
TEMP_AUDIO_FILE = os.getenv('TEMP_AUDIO_FILE', "temp_audio.wav")
# Video processing
DEFAULT_SUBTITLE_HEIGHT = int(os.getenv('DEFAULT_SUBTITLE_HEIGHT', 200))
DEFAULT_FONT_SIZE = int(os.getenv('DEFAULT_FONT_SIZE', 32))
MAX_SUBTITLE_LINES = int(os.getenv('MAX_SUBTITLE_LINES', 3))
# Video encoding
VIDEO_CODEC = os.getenv('VIDEO_CODEC', 'libx264')
AUDIO_CODEC = os.getenv('AUDIO_CODEC', 'aac')
VIDEO_PRESET = os.getenv('VIDEO_PRESET', 'medium')
CRF = os.getenv('CRF', '23')
PIXEL_FORMAT = os.getenv('PIXEL_FORMAT', 'yuv420p')
# Tiktoken related settings
TIKTOKEN_MODEL = "cl100k_base"
MAX_TOKENS_PER_CHUNK = 4000
# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class SubtitleProcessor:
def __init__(self, video_path: str, srt_path: str):
self.video_path = video_path
self.srt_path = srt_path
self.temp_files = []
def cleanup_temp_files(self):
logger.info("Cleaning up temporary files...")
for file_path in self.temp_files:
try:
if os.path.exists(file_path):
os.remove(file_path)
logger.info(f"Removed temporary file: {file_path}")
except Exception as e:
logger.error(f"Error removing {file_path}: {e}")
class SRTGenerator(SubtitleProcessor):
def __init__(self, video_path: str, output_srt: str, model_name: str, language: str = "japanese", translate: bool = False):
super().__init__(video_path, output_srt)
self.model_name = model_name
self.translate = translate
self.tokenizer = tiktoken.get_encoding(Config.TIKTOKEN_MODEL)
self.language = language
def run(self):
try:
self.extract_audio()
transcription = self.transcribe_audio()
chunks = self.split_into_chunks(transcription)
results = self.process_chunks(chunks)
self.create_srt(results)
logger.info(f"SRT file has been generated: {self.srt_path}")
finally:
self.cleanup_temp_files()
def extract_audio(self):
logger.info("Extracting audio from video...")
video = VideoFileClip(self.video_path)
video.audio.write_audiofile(Config.TEMP_AUDIO_FILE)
self.temp_files.append(Config.TEMP_AUDIO_FILE)
def transcribe_audio(self) -> Dict[str, Any]:
logger.info("Transcribing audio with Whisper...")
device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"Using device: {device}")
logger.info(f"Loading Whisper model: {self.model_name}")
model = whisper.load_model(self.model_name).to(device)
task = "translate" if self.translate else "transcribe"
logger.info(f"Performing task: {task} with language: {self.language}")
result = model.transcribe(Config.TEMP_AUDIO_FILE, task=task, language=self.language)
return result
def split_into_chunks(self, transcription: Dict[str, Any]) -> List[Dict[str, Any]]:
logger.info("Splitting transcription into chunks...")
chunks = []
current_chunk = {"text": "", "segments": []}
current_tokens = 0
for segment in transcription['segments']:
segment_tokens = self.tokenizer.encode(segment['text'])
if current_tokens + len(segment_tokens) > Config.MAX_TOKENS_PER_CHUNK:
chunks.append(current_chunk)
current_chunk = {"text": "", "segments": []}
current_tokens = 0
current_chunk['text'] += segment['text'] + " "
current_chunk['segments'].append(segment)
current_tokens += len(segment_tokens)
if current_chunk['segments']:
chunks.append(current_chunk)
return chunks
def process_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
logger.info("Processing chunks...")
results = []
for chunk in tqdm(chunks, desc="Processing chunks"):
results.extend(chunk['segments'])
return results
def create_srt(self, results: List[Dict[str, Any]]):
logger.info("Creating SRT file...")
subs = []
for i, segment in enumerate(results, start=1):
start = timedelta(seconds=segment['start'])
end = timedelta(seconds=segment['end'])
text = segment['text']
sub = srt.Subtitle(index=i, start=start, end=end, content=text)
subs.append(sub)
with open(self.srt_path, 'w', encoding='utf-8') as f:
f.write(srt.compose(subs))
class SubtitleAdder(SubtitleProcessor):
def __init__(self, video_path: str, output_video: str, input_srt: str, subtitle_height: int = Config.DEFAULT_SUBTITLE_HEIGHT):
super().__init__(video_path, input_srt)
self.output_video = output_video
self.subtitle_height = subtitle_height
def run(self):
try:
subs = self.load_srt(self.srt_path)
self.add_subtitles_to_video(subs)
logger.info(f"Video with subtitles has been generated: {self.output_video}")
finally:
self.cleanup_temp_files()
@staticmethod
def load_srt(srt_path: str) -> List[srt.Subtitle]:
logger.info(f"Loading SRT file: {srt_path}")
with open(srt_path, 'r', encoding='utf-8') as f:
return list(srt.parse(f.read()))
def add_subtitles_to_video(self, subs: List[srt.Subtitle]):
logger.info(f"Adding subtitles to video with subtitle space height of {self.subtitle_height} pixels...")
video = VideoFileClip(self.video_path)
original_width, original_height = video.w, video.h
new_height = original_height + self.subtitle_height
background = ColorClip(size=(original_width, new_height), color=(0,0,0), duration=video.duration)
video_clip = video.set_position((0, 0))
subtitle_clips = [
self.create_subtitle_clip(sub.content, original_width)
.set_start(sub.start.total_seconds())
.set_end(sub.end.total_seconds())
.set_position((0, original_height))
for sub in subs
]
final_video = CompositeVideoClip([background, video_clip] + subtitle_clips, size=(original_width, new_height))
final_video = final_video.set_duration(video.duration)
final_video.write_videofile(
self.output_video,
codec=Config.VIDEO_CODEC,
audio_codec=Config.AUDIO_CODEC,
preset=Config.VIDEO_PRESET,
ffmpeg_params=['-crf', Config.CRF, '-pix_fmt', Config.PIXEL_FORMAT],
verbose=False,
logger=None
)
@staticmethod
def create_subtitle_clip(txt: str, video_width: int, font_size: int = Config.DEFAULT_FONT_SIZE, max_lines: int = Config.MAX_SUBTITLE_LINES) -> ImageClip:
if any(ord(char) > 127 for char in txt):
font_path = Config.JAPANESE_FONT_PATH
else:
font_path = Config.FONT_PATH
try:
font = ImageFont.truetype(font_path, font_size)
except IOError:
logger.warning(f"Failed to load font from {font_path}. Falling back to default font.")
font = ImageFont.load_default()
max_char_count = int(video_width / (font_size * 0.6))
wrapped_text = textwrap.fill(txt, width=max_char_count)
lines = wrapped_text.split('\n')[:max_lines]
dummy_img = Image.new('RGB', (video_width, font_size * len(lines)))
dummy_draw = ImageDraw.Draw(dummy_img)
max_line_width = max(dummy_draw.textbbox((0, 0), line, font=font)[2] for line in lines)
total_height = sum(dummy_draw.textbbox((0, 0), line, font=font)[3] for line in lines)
img_width, img_height = video_width, total_height + 20
img = Image.new('RGBA', (img_width, img_height), (0, 0, 0, 0))
draw = ImageDraw.Draw(img)
y_text = 10
for line in lines:
bbox = draw.textbbox((0, 0), line, font=font)
x_text = (img_width - bbox[2]) // 2
for adj in range(-2, 3):
for adj2 in range(-2, 3):
draw.text((x_text+adj, y_text+adj2), line, font=font, fill=(0, 0, 0, 255))
draw.text((x_text, y_text), line, font=font, fill=(255, 255, 255, 255))
y_text += bbox[3]
return ImageClip(np.array(img))
def main():
parser = argparse.ArgumentParser(description="Subtitle Generator and Adder", formatter_class=argparse.RawTextHelpFormatter)
subparsers = parser.add_subparsers(dest="action", required=True)
# Common arguments
common_parser = argparse.ArgumentParser(add_help=False)
common_parser.add_argument("--input", required=True, help="Input video file path")
# Generate subparser
generate_parser = subparsers.add_parser("generate", parents=[common_parser])
generate_parser.add_argument("--output_srt", required=True, help="Output SRT file path")
generate_parser.add_argument("--model", default="large-v3", help="""Whisper model name (default: large-v3)
Available models:
- tiny: Smallest and fastest, lowest accuracy (39M parameters)
- base: Improved accuracy over tiny, still fast (74M parameters)
- small: Good balance of accuracy and speed (244M parameters)
- medium: Higher accuracy, slower than small (769M parameters)
- large: Most accurate, slowest (1550M parameters)
- large-v1: Original large model
- large-v2: Updated large model with improved performance
- large-v3: Latest large model with further improvements
English-only models (faster for English content):
- tiny.en, base.en, small.en, medium.en
Note: Larger models are more accurate but require more resources and time.""")
generate_parser.add_argument("--language", default="japanese", help="Language of the audio (default: japanese)")
generate_parser.add_argument("--translate", action="store_true", help="Translate the audio to English")
# Add subparser
add_parser = subparsers.add_parser("add", parents=[common_parser])
add_parser.add_argument("--output_video", required=True, help="Output video file path")
add_parser.add_argument("--input_srt", required=True, help="Input SRT file path")
args = parser.parse_args()
if args.action == "generate":
generator = SRTGenerator(args.input, args.output_srt, args.model, args.language, args.translate)
generator.run()
elif args.action == "add":
adder = SubtitleAdder(args.input, args.output_video, args.input_srt)
adder.run()
if __name__ == "__main__":
main()
How to Use
Help
python subtitle-generator.py generate --help
usage: subtitle1.py generate [-h] --input INPUT --output_srt OUTPUT_SRT [--model MODEL] [--language LANGUAGE]
[--translate]
optional arguments:
-h, --help show this help message and exit
--input INPUT Input video file path
--output_srt OUTPUT_SRT
Output SRT file path
--model MODEL Whisper model name (default: large-v3) Available models: - tiny: Smallest and fastest, lowest
accuracy (39M parameters) - base: Improved accuracy over tiny, still fast (74M parameters) -
small: Good balance of accuracy and speed (244M parameters) - medium: Higher accuracy, slower
than small (769M parameters) - large: Most accurate, slowest (1550M parameters) - large-v1:
Original large model - large-v2: Updated large model with improved performance - large-v3:
Latest large model with further improvements English-only models (faster for English content):
- tiny.en, base.en, small.en, medium.en Note: Larger models are more accurate but require more
resources and time.
--language LANGUAGE Language of the audio (default: japanese)
--translate Translate the audio to English
python subtitle-generator.py add --help
usage: subtitle1.py add [-h] --input INPUT --output_video OUTPUT_VIDEO --input_srt INPUT_SRT
optional arguments:
-h, --help show this help message and exit
--input INPUT Input video file path
--output_video OUTPUT_VIDEO
Output video file path
--input_srt INPUT_SRT
Input SRT file path
Copying Files from Windows to the Container
# on ubuntu command line
# Use 'docker container ls -a' to find the container name
docker cp "/mnt/c/Windows_path/VVVV.mp4" docker_container_name:root/speech_to_text
Usage Example
# Generate an SRT file in Japanese
python subtitle-generator.py generate --input xxx.mp4
--output_srt japa.srt --model large-v3
# Generate an SRT file in English
python subtitle-generator.py generate --input xxx.mp4
--output_srt eng.srt --model large-v3 --translate
# Add Japanese subtitles to the video
python subtitle-generator.py add --input xxx.mp4 --output_video japa.mp4 --input_srt japa.srt
# Add English subtitles to the video
python subtitle-generator.py add --input xxx.mp4 --output_video eng.mp4 --input_srt eng.srt
Copying Files from the Container to Windows
docker cp docker_container_name:root/path/xxx.mp4 "/mnt/c/Windows path/"
Summary
This Python script combines cutting-edge AI technology to automatically create high-quality subtitles from audio or video files. You can also manually edit the AI-generated SRT files if needed. This makes it possible to automatically add English subtitles to videos.
We hope this tool will save you time, allowing you to focus more on your educational and research activities.
The progress of AI is truly remarkable. We encourage you to use AI technologies like this to create a more efficient work environment for yourself!
If you have any questions or suggestions for improvement, please let us know in the comments.