Translating Japanese Word Documents into English Using Claude's API While Preserving Formatting

2024年10月5日 11:14

Accomplishment

During a recent meeting, the topic of translating Japanese instruction manuals into English arose. As the atmosphere seemed to lean towards a manual translation effort, I decided to introduce a solution I developed using Python. This script retrieves the format of the original Word file, translates Japanese text to English using Claude's API, and generates a new Word file while preserving the original formatting.

Previously, I introduced a method using OpenAI's API for similar purposes.

https://note.com/preview/n7aaa3c7a0549

Script Overview

Word Document Translation: The script reads an input Word document, translates its content, and saves it as a new Word document.
Format Preservation: It maintains the original document's font styles, colors, sizes, and alignments.
Chunk Processing: Long documents are divided into appropriately sized chunks to circumvent API limitations.
Customizable: Different Claude models and temperature settings can be specified via command line.
Progress Display: The tqdm library is utilized to display real-time translation progress.

Environment Setup

Obtaining a Claude API Key

Create an account on the Anthropic website and acquire an API key. For detailed instructions, please refer to this article:

Conda Environment Setup

# environment_word_translator_claude.yml
name: word-translator-claude
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.12
  - python-docx
  - python-dotenv
  - anthropic
  - tqdm

conda env create -f environment_word_translator_claude.yml

Script

import os
import argparse
from dotenv import load_dotenv
from typing import List, Tuple, Optional
from docx import Document
from docx.shared import RGBColor, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.text.paragraph import Paragraph
import anthropic
from tqdm import tqdm

# Load environment variables from .env file
load_dotenv()

# Anthropic API key configuration
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
MAX_TOKENS = 4000  # Adjusted for Claude's capacity

def split_text(text: str, max_tokens: int) -> List[str]:
    """Split text into chunks that don't exceed max_tokens."""
    words = text.split()
    chunks = []
    current_chunk = []

    for word in words:
        if len(" ".join(current_chunk + [word])) > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
        else:
            current_chunk.append(word)

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

def translate_text(client: anthropic.Anthropic, text: str, model: str, temperature: float, max_tokens: int) -> str:
    """Translate text from Japanese to B2 level academic English using Anthropic's Claude model."""
    chunks = split_text(text, max_tokens)
    translated_chunks = []

    system_prompt = """You are a professional translator specializing in Japanese to English translation. Please translate the following Japanese text into B2 level (upper intermediate) academic English. Follow these guidelines:

1. Maintain the meaning and style of the original text.
2. Use appropriate academic vocabulary and structures, but avoid overly complex or obscure terms.
3. Diversify word choice and sentence structure. Avoid repetition of specific words or phrases.
4. Avoid overly formal or archaic language. Aim for clarity and readability.
5. Retain the tone and intent of the original text, whether persuasive, informative, or analytical.
6. Pay attention to context and field-specific terminology, translating them accurately.
7. Avoid clichéd phrases and expressions commonly used in AI-generated text.
8. If the original text contains colloquial expressions or idioms, translate them into appropriate English equivalents that maintain the intended meaning and tone.
9. Pay attention to the nuances and implications of the original text and strive to reflect them in the translation.
10. If there is ambiguity in the original text, maintain that ambiguity in the translation without making assumptions.

Keep these guidelines in mind as you translate the given Japanese text into English. Please output only the translated text, without including the original Japanese or any explanations."""

    for chunk in tqdm(chunks, desc="Translating chunks", unit="chunk"):
        try:
            message = client.messages.create(
                model=model,
                max_tokens=max_tokens,
                temperature=temperature,
                system=system_prompt,
                messages=[
                    {"role": "user", "content": f"Please translate the following Japanese text into English:\n\n{chunk}"}
                ]
            )
            translated_text = message.content[0].text.strip()
            # Remove excess explanations or prefixes
            translated_text = translated_text.replace("Here's the translation:", "").strip()
            translated_text = translated_text.replace("Translated text:", "").strip()
            translated_chunks.append(translated_text)
        except Exception as e:
            print(f"Translation error: {str(e)}")
            print(f"Error details: {type(e).__name__}")
            print(f"Failed to translate chunk: {chunk}")
            translated_chunks.append(chunk)

    return " ".join(translated_chunks)

def process_paragraph(client: anthropic.Anthropic, paragraph: Paragraph, model: str, temperature: float, max_tokens: int) -> Tuple[str, List[Tuple]]:
    """Process a paragraph and its runs, returning the translated text and formatting information."""
    translated_text = translate_text(client, paragraph.text, model, temperature, max_tokens)
    
    formatting = []
    for run in paragraph.runs:
        formatting.append((
            run.bold, run.italic, run.underline,
            run.font.name or "Default", run.font.size,
            run.font.color.rgb if run.font.color.rgb else RGBColor(0, 0, 0)
        ))
    
    return translated_text, formatting

def apply_formatting(paragraph: Paragraph, text: str, formatting: List[Tuple]) -> None:
    """Apply formatting to a paragraph based on the original formatting."""
    paragraph.text = ""
    words = text.split()
    format_index = 0
    current_run = paragraph.add_run()

    for word in words:
        if format_index < len(formatting):
            bold, italic, underline, font_name, font_size, color = formatting[format_index]
            current_run = paragraph.add_run(word + " ")
            current_run.bold = bold
            current_run.italic = italic
            current_run.underline = underline
            current_run.font.name = font_name
            if font_size:
                current_run.font.size = font_size
            current_run.font.color.rgb = color

            # Move to the next format if the current run is longer than the original
            if len(current_run.text) >= len(formatting[format_index][3]):
                format_index += 1
        else:
            current_run.add_text(word + " ")

def get_document_margins(doc: Document) -> Tuple[float, float, float, float]:
    """Get the margins of the document in inches."""
    section = doc.sections[0]
    return (
        section.top_margin.inches,
        section.bottom_margin.inches,
        section.left_margin.inches,
        section.right_margin.inches
    )

def set_document_margins(doc: Document, margins: Tuple[float, float, float, float]) -> None:
    """Set the margins of the document in inches."""
    section = doc.sections[0]
    section.top_margin = Inches(margins[0])
    section.bottom_margin = Inches(margins[1])
    section.left_margin = Inches(margins[2])
    section.right_margin = Inches(margins[3])

def translate_word_document(input_path: str, output_path: str, api_key: str, model: str, temperature: float) -> None:
    """Translate Word document to B2 level academic English and preserve formatting."""
    client = anthropic.Anthropic(api_key=api_key)
    
    doc = Document(input_path)
    translated_doc = Document()

    # Get and set margins
    original_margins = get_document_margins(doc)
    set_document_margins(translated_doc, original_margins)

    for paragraph in tqdm(doc.paragraphs, desc="Processing paragraphs", unit="paragraph"):
        translated_text, formatting = process_paragraph(client, paragraph, model, temperature, max_tokens=4000)
        translated_paragraph = translated_doc.add_paragraph()
        apply_formatting(translated_paragraph, translated_text, formatting)
        translated_paragraph.alignment = paragraph.alignment
        translated_paragraph.style = paragraph.style

    translated_doc.save(output_path)
    print(f"Translated Word document saved to {output_path}")
    print(f"Original margins (top, bottom, left, right): {original_margins}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Translate Word documents to English.")
    parser.add_argument("input_file", help="Path to the input file to be translated")
    parser.add_argument("--model", default="claude-3-5-sonnet-20240620", help="Anthropic model to use (default: claude-3-5-sonnet-20240620)")
    parser.add_argument("--temperature", type=float, default=0.5, help="Temperature for generation (default: 0.5)")
    args = parser.parse_args()

    api_key = ANTHROPIC_API_KEY
    if not api_key:
        raise ValueError("ANTHROPIC_API_KEY is not set.")

    input_file = args.input_file
    output_dir = os.path.dirname(input_file)
    output_file = os.path.join(output_dir, f"translated_{os.path.basename(input_file)}")

    translate_word_document(input_file, output_file, api_key, args.model, args.temperature)

.env

# .env
ANTHROPIC_API_KEY=sk-your-key

Usage

Help

python Japanease-word-file-translate-to-English-Claude.py --help
usage: Japanease-word-file-translate-to-English-Claude.py [-h] [--model MODEL] [--temperature TEMPERATURE] input_file

Translate Word documents to English.

positional arguments:
  input_file            Path to the input file to be translated

options:
  -h, --help            show this help message and exit
  --model MODEL         Anthropic model to use (default: claude-3-5-sonnet-20240620)
  --temperature TEMPERATURE
                        Temperature for generation (default: 0.5)

Example Usage:

# Note: Enclose file paths or names containing spaces in quotation marks.
python Japanease-word-file-translate-to-English.py "Path/to/your/file/sss sss.docx" --model claude-3-5-sonnet-20240620 --temperature 0.2

Conclusion

This Python script enables high-quality translation of Word files from Japanese to English in a matter of moments, even for lengthy documents, at a cost of approximately 50 yen (as personally tested by the author).

The aim is to allocate more time to educational and research activities, potentially leading to increased productivity and satisfaction.

The progress of AI is remarkable. Readers are encouraged to utilize such AI technologies to create more efficient work environments.

The author welcomes questions and suggestions for improvements in the comments section.