Translating Japanese Word Documents into English Using Claude's API While Preserving Formatting
Accomplishment
During a recent meeting, the topic of translating Japanese instruction manuals into English arose. As the atmosphere seemed to lean towards a manual translation effort, I decided to introduce a solution I developed using Python. This script retrieves the format of the original Word file, translates Japanese text to English using Claude's API, and generates a new Word file while preserving the original formatting.
Previously, I introduced a method using OpenAI's API for similar purposes.
https://note.com/preview/n7aaa3c7a0549
Script Overview
Word Document Translation: The script reads an input Word document, translates its content, and saves it as a new Word document.
Format Preservation: It maintains the original document's font styles, colors, sizes, and alignments.
Chunk Processing: Long documents are divided into appropriately sized chunks to circumvent API limitations.
Customizable: Different Claude models and temperature settings can be specified via command line.
Progress Display: The tqdm library is utilized to display real-time translation progress.
Environment Setup
Obtaining a Claude API Key
Create an account on the Anthropic website and acquire an API key. For detailed instructions, please refer to this article:
Conda Environment Setup
# environment_word_translator_claude.yml
name: word-translator-claude
channels:
- conda-forge
- defaults
dependencies:
- python=3.12
- python-docx
- python-dotenv
- anthropic
- tqdm
conda env create -f environment_word_translator_claude.yml
Script
import os
import argparse
from dotenv import load_dotenv
from typing import List, Tuple, Optional
from docx import Document
from docx.shared import RGBColor, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.text.paragraph import Paragraph
import anthropic
from tqdm import tqdm
# Load environment variables from .env file
load_dotenv()
# Anthropic API key configuration
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
MAX_TOKENS = 4000 # Adjusted for Claude's capacity
def split_text(text: str, max_tokens: int) -> List[str]:
"""Split text into chunks that don't exceed max_tokens."""
words = text.split()
chunks = []
current_chunk = []
for word in words:
if len(" ".join(current_chunk + [word])) > max_tokens:
chunks.append(" ".join(current_chunk))
current_chunk = [word]
else:
current_chunk.append(word)
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def translate_text(client: anthropic.Anthropic, text: str, model: str, temperature: float, max_tokens: int) -> str:
"""Translate text from Japanese to B2 level academic English using Anthropic's Claude model."""
chunks = split_text(text, max_tokens)
translated_chunks = []
system_prompt = """You are a professional translator specializing in Japanese to English translation. Please translate the following Japanese text into B2 level (upper intermediate) academic English. Follow these guidelines:
1. Maintain the meaning and style of the original text.
2. Use appropriate academic vocabulary and structures, but avoid overly complex or obscure terms.
3. Diversify word choice and sentence structure. Avoid repetition of specific words or phrases.
4. Avoid overly formal or archaic language. Aim for clarity and readability.
5. Retain the tone and intent of the original text, whether persuasive, informative, or analytical.
6. Pay attention to context and field-specific terminology, translating them accurately.
7. Avoid clichéd phrases and expressions commonly used in AI-generated text.
8. If the original text contains colloquial expressions or idioms, translate them into appropriate English equivalents that maintain the intended meaning and tone.
9. Pay attention to the nuances and implications of the original text and strive to reflect them in the translation.
10. If there is ambiguity in the original text, maintain that ambiguity in the translation without making assumptions.
Keep these guidelines in mind as you translate the given Japanese text into English. Please output only the translated text, without including the original Japanese or any explanations."""
for chunk in tqdm(chunks, desc="Translating chunks", unit="chunk"):
try:
message = client.messages.create(
model=model,
max_tokens=max_tokens,
temperature=temperature,
system=system_prompt,
messages=[
{"role": "user", "content": f"Please translate the following Japanese text into English:\n\n{chunk}"}
]
)
translated_text = message.content[0].text.strip()
# Remove excess explanations or prefixes
translated_text = translated_text.replace("Here's the translation:", "").strip()
translated_text = translated_text.replace("Translated text:", "").strip()
translated_chunks.append(translated_text)
except Exception as e:
print(f"Translation error: {str(e)}")
print(f"Error details: {type(e).__name__}")
print(f"Failed to translate chunk: {chunk}")
translated_chunks.append(chunk)
return " ".join(translated_chunks)
def process_paragraph(client: anthropic.Anthropic, paragraph: Paragraph, model: str, temperature: float, max_tokens: int) -> Tuple[str, List[Tuple]]:
"""Process a paragraph and its runs, returning the translated text and formatting information."""
translated_text = translate_text(client, paragraph.text, model, temperature, max_tokens)
formatting = []
for run in paragraph.runs:
formatting.append((
run.bold, run.italic, run.underline,
run.font.name or "Default", run.font.size,
run.font.color.rgb if run.font.color.rgb else RGBColor(0, 0, 0)
))
return translated_text, formatting
def apply_formatting(paragraph: Paragraph, text: str, formatting: List[Tuple]) -> None:
"""Apply formatting to a paragraph based on the original formatting."""
paragraph.text = ""
words = text.split()
format_index = 0
current_run = paragraph.add_run()
for word in words:
if format_index < len(formatting):
bold, italic, underline, font_name, font_size, color = formatting[format_index]
current_run = paragraph.add_run(word + " ")
current_run.bold = bold
current_run.italic = italic
current_run.underline = underline
current_run.font.name = font_name
if font_size:
current_run.font.size = font_size
current_run.font.color.rgb = color
# Move to the next format if the current run is longer than the original
if len(current_run.text) >= len(formatting[format_index][3]):
format_index += 1
else:
current_run.add_text(word + " ")
def get_document_margins(doc: Document) -> Tuple[float, float, float, float]:
"""Get the margins of the document in inches."""
section = doc.sections[0]
return (
section.top_margin.inches,
section.bottom_margin.inches,
section.left_margin.inches,
section.right_margin.inches
)
def set_document_margins(doc: Document, margins: Tuple[float, float, float, float]) -> None:
"""Set the margins of the document in inches."""
section = doc.sections[0]
section.top_margin = Inches(margins[0])
section.bottom_margin = Inches(margins[1])
section.left_margin = Inches(margins[2])
section.right_margin = Inches(margins[3])
def translate_word_document(input_path: str, output_path: str, api_key: str, model: str, temperature: float) -> None:
"""Translate Word document to B2 level academic English and preserve formatting."""
client = anthropic.Anthropic(api_key=api_key)
doc = Document(input_path)
translated_doc = Document()
# Get and set margins
original_margins = get_document_margins(doc)
set_document_margins(translated_doc, original_margins)
for paragraph in tqdm(doc.paragraphs, desc="Processing paragraphs", unit="paragraph"):
translated_text, formatting = process_paragraph(client, paragraph, model, temperature, max_tokens=4000)
translated_paragraph = translated_doc.add_paragraph()
apply_formatting(translated_paragraph, translated_text, formatting)
translated_paragraph.alignment = paragraph.alignment
translated_paragraph.style = paragraph.style
translated_doc.save(output_path)
print(f"Translated Word document saved to {output_path}")
print(f"Original margins (top, bottom, left, right): {original_margins}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Translate Word documents to English.")
parser.add_argument("input_file", help="Path to the input file to be translated")
parser.add_argument("--model", default="claude-3-5-sonnet-20240620", help="Anthropic model to use (default: claude-3-5-sonnet-20240620)")
parser.add_argument("--temperature", type=float, default=0.5, help="Temperature for generation (default: 0.5)")
args = parser.parse_args()
api_key = ANTHROPIC_API_KEY
if not api_key:
raise ValueError("ANTHROPIC_API_KEY is not set.")
input_file = args.input_file
output_dir = os.path.dirname(input_file)
output_file = os.path.join(output_dir, f"translated_{os.path.basename(input_file)}")
translate_word_document(input_file, output_file, api_key, args.model, args.temperature)
.env
# .env
ANTHROPIC_API_KEY=sk-your-key
Usage
Help
python Japanease-word-file-translate-to-English-Claude.py --help
usage: Japanease-word-file-translate-to-English-Claude.py [-h] [--model MODEL] [--temperature TEMPERATURE] input_file
Translate Word documents to English.
positional arguments:
input_file Path to the input file to be translated
options:
-h, --help show this help message and exit
--model MODEL Anthropic model to use (default: claude-3-5-sonnet-20240620)
--temperature TEMPERATURE
Temperature for generation (default: 0.5)
Example Usage:
# Note: Enclose file paths or names containing spaces in quotation marks.
python Japanease-word-file-translate-to-English.py "Path/to/your/file/sss sss.docx" --model claude-3-5-sonnet-20240620 --temperature 0.2
Conclusion
This Python script enables high-quality translation of Word files from Japanese to English in a matter of moments, even for lengthy documents, at a cost of approximately 50 yen (as personally tested by the author).
The aim is to allocate more time to educational and research activities, potentially leading to increased productivity and satisfaction.
The progress of AI is remarkable. Readers are encouraged to utilize such AI technologies to create more efficient work environments.
The author welcomes questions and suggestions for improvements in the comments section.