Utilizing Whisper and ChatGPT to Add Subtitles to Japanese-Language Videos
Objective
The primary aim of this project was to enhance the accessibility of Japanese-language instructional videos for non-Japanese speakers. To achieve this, I sought to develop a method for adding English subtitles to video files containing Japanese audio.
Background
In a previous article, I introduced a Python script capable of generating either Japanese or English subtitles for video files. This script utilized Whisper's speech recognition capabilities. However, when employing Whisper for translation, I encountered issues with omissions and quality inconsistencies.
Methodology
To address these limitations, I have augmented my approach by incorporating OpenAI's API for translation purposes. This updated method involves two primary steps:
Utilizing Whisper for transcription of the source language (Japanese)
Employing OpenAI's API to translate the transcribed text into the target language (English)
Script Overview
The Python script I have developed serves as a comprehensive tool for adding subtitles to video files. It automates the entire process, from subtitle generation to video integration, by combining speech recognition and translation functionalities. The script offers three main features:
Subtitle Generation via Speech Recognition
Subtitle Integration Using Existing SRT Files
Translation of Subtitle Files
Key Functionalities
1. Subtitle Generation (SRTGenerator Class)
Employs the Whisper AI model for speech recognition and subtitle creation
Processes audio in appropriate segments and saves generated subtitles in SRT format
Offers the option to translate subtitles into English using Whisper's translation feature
2. Subtitle Integration (SubtitleAdder Class)
Imports existing SRT files and incorporates subtitles into videos
Allows customization of subtitle attributes such as font size, color, and positioning
3. Subtitle Translation (SRTTranslator Class)
Utilizes the OpenAI API to translate existing SRT files into specified target languages
Enables adjustment of the translation's temperature parameter to control output diversity
Technical Specifications
Utilizes the moviepy library for video editing and subtitle integration
Implements the whisper library to achieve high-accuracy speech recognition
Incorporates OpenAI API for automated translation capabilities
Features a user-friendly Command Line Interface (CLI) for ease of operation
Environment Setup
To utilize this script, users must first install the Docker engine. Detailed installation instructions can be found in the official Docker documentation. It is advisable to remove any outdated Docker installations prior to proceeding with the setup.
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt remove $pkg; done
Installing Docker Engine
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
docker -v
wsl -—shutdown
Configuring Docker for Non-Root Usage
sudo groupadd docker
sudo usermod -aG docker rui
wsl --shutdown
Configuring Docker for Non-Root Usage
To configure Docker to start automatically on system boot:
sudo visudo
Add the following to the last line
docker ALL=(ALL) NOPASSWD: /usr/sbin/service docker start
sudo nano ~/.bashrc
Add the following to the last line
if [[ $(service docker status | awk '{print $4}') = "not" ]]; then
sudo service docker start > /dev/null
fi
source ~/.bashrc
NVIDIA Docker Installation
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Creating a Docker container
Ubuntu on WSL
In the home folder.
CUDA | NVIDIA NGC
docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
docker run -it --gpus all nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
apt update && apt full-upgrade -y
apt install git wget nano ffmpeg -y
Miniconda Installation
cd ~
mkdir tmp
cd tmp
https://docs.anaconda.com/miniconda/#miniconda-latest-installer-links
Copy the link for linux 64-bit
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# yes, enter, yes
# rm a temp folder:
cd ..
rm -rf tmp
exit
docker container ls -a
docker start <container id>
docker exec -it <container id> /bin/bash
Building the conda environment
mkdir subtitle_generator
cd subtitle_generator
nano subtitle-generator.yml
name: subtitle-generator
channels:
- conda-forge
- pytorch
- nvidia
- defaults
dependencies:
- python=3.9
- pip
- cudatoolkit=11.8
- tiktoken
- pillow
- tqdm
- srt
- moviepy
- python-dotenv
- pip:
- openai-whisper
- openai
- torch
- torchvision
- torchaudio
conda env create -f subtitle-generator.yml
conda activate subtitle-generator
Script
nano subtitle-generator.py
import os
import logging
from typing import List, Dict, Any
import torch
from moviepy.editor import VideoFileClip, CompositeVideoClip, ColorClip, ImageClip
import whisper
import srt
from datetime import timedelta
from PIL import Image, ImageDraw, ImageFont
import numpy as np
from tqdm import tqdm
import tiktoken
import textwrap
import argparse
from openai import OpenAI
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
class Config:
# Paths
FONT_PATH = os.getenv('FONT_PATH', "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf")
JAPANESE_FONT_PATH = os.getenv('JAPANESE_FONT_PATH', "/usr/share/fonts/opentype/noto/NotoSansCJK-Bold.ttc")
TEMP_AUDIO_FILE = os.getenv('TEMP_AUDIO_FILE', "temp_audio.wav")
# Video processing
DEFAULT_SUBTITLE_HEIGHT = int(os.getenv('DEFAULT_SUBTITLE_HEIGHT', 200))
DEFAULT_FONT_SIZE = int(os.getenv('DEFAULT_FONT_SIZE', 32))
MAX_SUBTITLE_LINES = int(os.getenv('MAX_SUBTITLE_LINES', 3))
# Video encoding
VIDEO_CODEC = os.getenv('VIDEO_CODEC', 'libx264')
AUDIO_CODEC = os.getenv('AUDIO_CODEC', 'aac')
VIDEO_PRESET = os.getenv('VIDEO_PRESET', 'medium')
CRF = os.getenv('CRF', '23')
PIXEL_FORMAT = os.getenv('PIXEL_FORMAT', 'yuv420p')
# Tiktoken related settings
TIKTOKEN_MODEL = "cl100k_base"
MAX_TOKENS_PER_CHUNK = 4000
# OpenAI settings
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
DEFAULT_GPT_MODEL = "gpt-4o"
GPT_MAX_TOKENS = 4000
# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class SubtitleProcessor:
def __init__(self, video_path: str, srt_path: str):
self.video_path = video_path
self.srt_path = srt_path
self.temp_files = []
def cleanup_temp_files(self):
logger.info("Cleaning up temporary files...")
for file_path in self.temp_files:
try:
if os.path.exists(file_path):
os.remove(file_path)
logger.info(f"Removed temporary file: {file_path}")
except Exception as e:
logger.error(f"Error removing {file_path}: {e}")
class SRTTranslator:
def __init__(self, model: str = Config.DEFAULT_GPT_MODEL, temperature: float = 0.3):
api_key = Config.OPENAI_API_KEY
if not api_key:
raise ValueError("OpenAI API key is required. Set it in the environment variable 'OPENAI_API_KEY'.")
self.client = OpenAI(api_key=api_key)
self.model = model
self.temperature = temperature
def translate_srt(self, input_srt: str, output_srt: str, source_lang: str, target_lang: str):
with open(input_srt, 'r', encoding='utf-8') as f:
subtitle_generator = srt.parse(f.read())
subtitles = list(subtitle_generator)
translated_subtitles = []
for subtitle in tqdm(subtitles, desc="Translating subtitles"):
translated_content = self.translate_text(subtitle.content, source_lang, target_lang)
translated_subtitle = srt.Subtitle(
index=subtitle.index,
start=subtitle.start,
end=subtitle.end,
content=translated_content
)
translated_subtitles.append(translated_subtitle)
with open(output_srt, 'w', encoding='utf-8') as f:
f.write(srt.compose(translated_subtitles))
def translate_text(self, text: str, source_lang: str, target_lang: str) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": f"You are a professional translator. Translate the following text from {source_lang} to {target_lang}. Maintain the original meaning and nuance as much as possible. Do not modify any formatting or line breaks."},
{"role": "user", "content": text}
],
temperature=self.temperature,
max_tokens=Config.GPT_MAX_TOKENS
)
return response.choices[0].message.content.strip()
class SRTGenerator(SubtitleProcessor):
def __init__(self, video_path: str, output_srt: str, model_name: str, language: str = "japanese", translate: bool = False, api_key: str = None):
super().__init__(video_path, output_srt)
self.model_name = model_name
self.translate = translate
self.tokenizer = tiktoken.get_encoding(Config.TIKTOKEN_MODEL)
self.language = language
self.api_key = api_key
def run(self):
try:
self.extract_audio()
transcription = self.transcribe_audio()
chunks = self.split_into_chunks(transcription)
results = self.process_chunks(chunks)
self.create_srt(results)
logger.info(f"SRT file has been generated: {self.srt_path}")
finally:
self.cleanup_temp_files()
def extract_audio(self):
logger.info("Extracting audio from video...")
video = VideoFileClip(self.video_path)
video.audio.write_audiofile(Config.TEMP_AUDIO_FILE)
self.temp_files.append(Config.TEMP_AUDIO_FILE)
def transcribe_audio(self) -> Dict[str, Any]:
logger.info("Transcribing audio with Whisper...")
device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"Using device: {device}")
logger.info(f"Loading Whisper model: {self.model_name}")
model = whisper.load_model(self.model_name).to(device)
logger.info(f"Performing task: transcribe with language: {self.language}")
result = model.transcribe(Config.TEMP_AUDIO_FILE, task="transcribe", language=self.language)
return result
def split_into_chunks(self, transcription: Dict[str, Any]) -> List[Dict[str, Any]]:
logger.info("Splitting transcription into chunks...")
chunks = []
current_chunk = {"text": "", "segments": []}
current_tokens = 0
for segment in transcription['segments']:
segment_tokens = self.tokenizer.encode(segment['text'])
if current_tokens + len(segment_tokens) > Config.MAX_TOKENS_PER_CHUNK:
chunks.append(current_chunk)
current_chunk = {"text": "", "segments": []}
current_tokens = 0
current_chunk['text'] += segment['text'] + " "
current_chunk['segments'].append(segment)
current_tokens += len(segment_tokens)
if current_chunk['segments']:
chunks.append(current_chunk)
return chunks
def process_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
logger.info("Processing chunks...")
results = []
for chunk in tqdm(chunks, desc="Processing chunks"):
results.extend(chunk['segments'])
return results
def create_srt(self, results: List[Dict[str, Any]]):
logger.info("Creating SRT file...")
subs = []
for i, segment in enumerate(results, start=1):
start = timedelta(seconds=segment['start'])
end = timedelta(seconds=segment['end'])
text = segment['text']
sub = srt.Subtitle(index=i, start=start, end=end, content=text)
subs.append(sub)
with open(self.srt_path, 'w', encoding='utf-8') as f:
f.write(srt.compose(subs))
class SubtitleAdder(SubtitleProcessor):
def __init__(self, video_path: str, output_video: str, input_srt: str, subtitle_height: int = Config.DEFAULT_SUBTITLE_HEIGHT):
super().__init__(video_path, input_srt)
self.output_video = output_video
self.subtitle_height = subtitle_height
def run(self):
try:
subs = self.load_srt(self.srt_path)
self.add_subtitles_to_video(subs)
logger.info(f"Video with subtitles has been generated: {self.output_video}")
finally:
self.cleanup_temp_files()
@staticmethod
def load_srt(srt_path: str) -> List[srt.Subtitle]:
logger.info(f"Loading SRT file: {srt_path}")
with open(srt_path, 'r', encoding='utf-8') as f:
return list(srt.parse(f.read()))
def add_subtitles_to_video(self, subs: List[srt.Subtitle]):
logger.info(f"Adding subtitles to video with subtitle space height of {self.subtitle_height} pixels...")
video = VideoFileClip(self.video_path)
original_width, original_height = video.w, video.h
new_height = original_height + self.subtitle_height
background = ColorClip(size=(original_width, new_height), color=(0,0,0), duration=video.duration)
video_clip = video.set_position((0, 0))
subtitle_clips = [
self.create_subtitle_clip(sub.content, original_width)
.set_start(sub.start.total_seconds())
.set_end(sub.end.total_seconds())
.set_position((0, original_height))
for sub in subs
]
final_video = CompositeVideoClip([background, video_clip] + subtitle_clips, size=(original_width, new_height))
final_video = final_video.set_duration(video.duration)
final_video.write_videofile(
self.output_video,
codec=Config.VIDEO_CODEC,
audio_codec=Config.AUDIO_CODEC,
preset=Config.VIDEO_PRESET,
ffmpeg_params=['-crf', Config.CRF, '-pix_fmt', Config.PIXEL_FORMAT],
verbose=False,
logger=None
)
@staticmethod
def create_subtitle_clip(txt: str, video_width: int, font_size: int = Config.DEFAULT_FONT_SIZE, max_lines: int = Config.MAX_SUBTITLE_LINES) -> ImageClip:
if any(ord(char) > 127 for char in txt):
font_path = Config.JAPANESE_FONT_PATH
else:
font_path = Config.FONT_PATH
try:
font = ImageFont.truetype(font_path, font_size)
except IOError:
logger.warning(f"Failed to load font from {font_path}. Falling back to default font.")
font = ImageFont.load_default()
max_char_count = int(video_width / (font_size * 0.6))
wrapped_text = textwrap.fill(txt, width=max_char_count)
lines = wrapped_text.split('\n')[:max_lines]
dummy_img = Image.new('RGB', (video_width, font_size * len(lines)))
dummy_draw = ImageDraw.Draw(dummy_img)
max_line_width = max(dummy_draw.textbbox((0, 0), line, font=font)[2] for line in lines)
total_height = sum(dummy_draw.textbbox((0, 0), line, font=font)[3] for line in lines)
img_width, img_height = video_width, total_height + 20
img = Image.new('RGBA', (img_width, img_height), (0, 0, 0, 0))
draw = ImageDraw.Draw(img)
y_text = 10
for line in lines:
bbox = draw.textbbox((0, 0), line, font=font)
x_text = (img_width - bbox[2]) // 2
for adj in range(-2, 3):
for adj2 in range(-2, 3):
draw.text((x_text+adj, y_text+adj2), line, font=font, fill=(0, 0, 0, 255))
draw.text((x_text, y_text), line, font=font, fill=(255, 255, 255, 255))
y_text += bbox[3]
return ImageClip(np.array(img))
def main():
parser = argparse.ArgumentParser(description="Subtitle Generator and Adder", formatter_class=argparse.RawTextHelpFormatter)
subparsers = parser.add_subparsers(dest="action", required=True)
# Common arguments for generate and add commands
common_parser = argparse.ArgumentParser(add_help=False)
common_parser.add_argument("--input", required=True, help="Input video file path")
# Generate subparser for creating subtitles
generate_parser = subparsers.add_parser("generate", parents=[common_parser])
generate_parser.add_argument("--output_srt", required=True, help="Output SRT file path")
generate_parser.add_argument("--model", default="large-v3", help="Whisper model name (default: large-v3)")
generate_parser.add_argument("--language", default="Japanese", help="Language of the audio (default: Japanese)")
generate_parser.add_argument("--translate", action="store_true", help="Translate the audio to English")
# Add subparser for adding subtitles to a video
add_parser = subparsers.add_parser("add", parents=[common_parser])
add_parser.add_argument("--output_video", required=True, help="Output video file path")
add_parser.add_argument("--input_srt", required=True, help="Input SRT file path")
# Translate subparser for translating an SRT file
translate_parser = subparsers.add_parser("translate")
translate_parser.add_argument("--input_srt", required=True, help="Input SRT file path")
translate_parser.add_argument("--output_srt", required=True, help="Output SRT file path")
translate_parser.add_argument("--source_lang", default="Japanese", help="Source language of the SRT file (default: Japanese)")
translate_parser.add_argument("--target_lang", default="English", help="Target language for translation (default: English)")
translate_parser.add_argument("--temperature", type=float, default=0.3, help="Temperature setting for OpenAI (default: 0.3)")
args = parser.parse_args()
if args.action == "generate":
generator = SRTGenerator(args.input, args.output_srt, args.model, args.language, args.translate)
generator.run()
elif args.action == "add":
adder = SubtitleAdder(args.input, args.output_video, args.input_srt)
adder.run()
elif args.action == "translate":
translator = SRTTranslator(temperature=args.temperature)
translator.translate_srt(args.input_srt, args.output_srt, args.source_lang, args.target_lang)
logger.info(f"Translation completed. Output saved to {args.output_srt}")
if __name__ == "__main__":
main()
.env
Create a .env file with your OpenAI API key:
# .env
OPENAI_API_KEY=sk-your_key
Usage
python subtitle_generator.py --help
usage: subtitle_generator.py [-h] {generate,add,translate} ...
Subtitle Generator and Adder
positional arguments:
{generate,add,translate}
optional arguments:
-h, --help show this help message and exit
python subtitle_generator.py add --help
usage: subtitle_generator.py add [-h] --input INPUT --output_video OUTPUT_VIDEO --input_srt INPUT_SRT
optional arguments:
-h, --help show this help message and exit
--input INPUT Input video file path
--output_video OUTPUT_VIDEO
Output video file path
--input_srt INPUT_SRT
Input SRT file path
python subtitle_generator.py generate --help
usage: subtitle_generator.py generate [-h] --input INPUT --output_srt OUTPUT_SRT [--model MODEL] [--language LANGUAGE] [--translate]
optional arguments:
-h, --help show this help message and exit
--input INPUT Input video file path
--output_srt OUTPUT_SRT
Output SRT file path
--model MODEL Whisper model name (default: large-v3)
--language LANGUAGE Language of the audio (default: Japanese)
--translate Translate the audio to English
python subtitle_generator.py translate --help
usage: subtitle_generator.py translate [-h] --input_srt INPUT_SRT --output_srt OUTPUT_SRT [--source_lang SOURCE_LANG]
[--target_lang TARGET_LANG] [--temperature TEMPERATURE]
optional arguments:
-h, --help show this help message and exit
--input_srt INPUT_SRT
Input SRT file path
--output_srt OUTPUT_SRT
Output SRT file path
--source_lang SOURCE_LANG
Source language of the SRT file (default: Japanese)
--target_lang TARGET_LANG
Target language for translation (default: English)
--temperature TEMPERATURE
Temperature setting for OpenAI (default: 0.3)
Copy the file from Windows to the container
# on ubuntu command line
# docker container ls -aでコンテナ名を調べる
docker cp "/mnt/c/Windows_path/VVVV.mp4" docker_container_name:root/speech_to_text
Example of use
1. Generation of Japanese SRT Files
The Whisper model is utilized to generate a Japanese SRT file directly from the audio content of the video.
python subtitle_generator.py generate --input test.mp4 --output_srt japa.srt --model large-v3
2. Integration of Japanese Subtitles into Video
The generated Japanese SRT file is used to incorporate Japanese subtitles into the video.
python subtitle_generator.py add --input test.mp4 --output_video japa.mp4 --input_srt japa.srt
3. Translation and Generation of English SRT Files
The Whisper model is employed to translate the video's Japanese audio content into English and generate an English SRT file.
python subtitle_generator.py generate --input test.mp4 --output_srt eng.srt --model large-v3 --translate
4. Integration of English Subtitles into Video
The generated English SRT file is utilized to incorporate English subtitles into the video.
python subtitle_generator.py add --input test.mp4 --output_video eng.mp4 --input_srt eng.srt
5. Translation of Japanese SRT Files to English
Existing Japanese SRT files are translated into English using the OpenAI API to produce English SRT files. The --temperature option allows for adjustment of the variability in the API's output.
python subtitle_generator.py translate --input_srt japa.srt --output_srt eng_translated.srt --temperature 0.7
Explanation:
The generate command employs the Whisper model for speech recognition and SRT file creation. Adding the --translate option enables automatic translation from Japanese to English using Whisper.
The add command integrates the generated SRT file as subtitles into the video.
The translate command utilizes the OpenAI API to translate existing SRT files.
Copying Files from Container to Windows
docker cp docker_container_name:root/path/xxx.mp4 "/mnt/c/Windows path/"
Summary
This Python script combines Whisper and OpenAI's API to automatically generate high-accuracy subtitles for videos, perform translations, and add subtitles. It serves as a powerful tool for efficient time management in educational, research, and content production contexts.
I encourage you to experiment with this tool if it piques your interest!
It is our hope that this will allow you to allocate more time to educational and research activities, thereby enhancing your overall productivity and satisfaction.
The advancements in AI technology are truly remarkable. I strongly encourage you to leverage such AI technologies to create more efficient work environments for yourselves!
Should you have any questions or suggestions for improvement, please don't hesitate to share them in the comments section. Your feedback is invaluable to me.