ClaudeのAPIを使って日本語のワード文章をスタイルを保持しながら英訳する

だっち

2024年10月2日 20:05

実現したこと

とある会議にて。日本語で書いてある説明書を英訳しなきゃねぇみたいな話になった。人力で頑張りましょう見たいな雰囲気が漂って来たので、無敵のPythonでやる仕組みを構築したので紹介します。

このスクリプトは元のワードファイルの形式を取得し、日本語をClaudeのAPIを使って英語にし、形式を保持しながらワードファイルを生成する感じです。

以前、OpenAIのAPIを使う方法を紹介しました。

スクリプトの概要

Word文書の翻訳: 入力されたWord文書を読み込み、内容を翻訳し、新しいWord文書として保存します。
フォーマット保持: 元の文書のフォントスタイル、色、サイズ、配置などのフォーマットを保持します。
チャンク処理: 長い文書を適切なサイズのチャンクに分割して処理することで、APIの制限を回避します。
カスタマイズ可能: コマンドラインから異なるClaudeモデルや温度設定を指定できます。
進行状況表示: tqdmライブラリを使用して、翻訳の進行状況をリアルタイムで表示します。

環境構築

Claude APIキー取得

Anthropicのウェブサイトでアカウントを作成し、APIキーを取得します。詳細な手順はこちらの記事を参照してください。

conda環境構築

# environment_word_translator_claude.yml
name: word-translator-claude
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.12
  - python-docx
  - python-dotenv
  - anthropic
  - tqdm

conda env create -f environment_word_translator_claude.yml

スクリプト

import os
import argparse
from dotenv import load_dotenv
from typing import List, Tuple, Optional
from docx import Document
from docx.shared import RGBColor, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.text.paragraph import Paragraph
import anthropic
from tqdm import tqdm

# Load environment variables from .env file
load_dotenv()

# Anthropic API key configuration
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
MAX_TOKENS = 4000  # Adjusted for Claude's capacity

def split_text(text: str, max_tokens: int) -> List[str]:
    """Split text into chunks that don't exceed max_tokens."""
    words = text.split()
    chunks = []
    current_chunk = []

    for word in words:
        if len(" ".join(current_chunk + [word])) > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
        else:
            current_chunk.append(word)

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

def translate_text(client: anthropic.Anthropic, text: str, model: str, temperature: float, max_tokens: int) -> str:
    """AnthropicのClaudeモデルを使用してテキストを日本語からB2レベルの学術的英語に翻訳する。"""
    chunks = split_text(text, max_tokens)
    translated_chunks = []

    system_prompt = """あなたは日本語から英語への翻訳を専門とするプロの翻訳者です。以下の日本語テキストをB2レベル（中上級）の学術的な英語に翻訳してください。以下のガイドラインに従ってください：

1. 原文の意味とスタイルを維持すること。
2. 適切な学術的語彙と構造を使用するが、過度に複雑または難解な用語は避けること。
3. 単語の選択と文構造に多様性を持たせること。特定の単語やフレーズの繰り返しを避けること。
4. 過度に形式的または古風な言葉遣いは避けること。明確さと読みやすさを目指すこと。
5. 説得的、情報提供的、または分析的など、原文の調子と意図を保持すること。
6. 文脈と分野固有の専門用語に注意し、それらを正確に翻訳すること。
7. AIが生成したテキストでよく使われる陳腐なフレーズや表現を避けること。
8. 原文に口語表現や慣用句が含まれている場合は、意図された意味と調子を維持する適切な英語の等価表現に翻訳すること。
9. 原文のニュアンスと含意に注意を払い、それらを翻訳に反映させるよう努めること。
10. 原文に曖昧さがある場合は、推測せずにその曖昧さを保持するように翻訳すること。

これらのガイドラインを念頭に置いて、与えられた日本語テキストを英語に翻訳してください。翻訳文のみを出力し、元の日本語や説明は含めないでください。"""

    for chunk in tqdm(chunks, desc="チャンクを翻訳中", unit="chunk"):
        try:
            message = client.messages.create(
                model=model,
                max_tokens=max_tokens,
                temperature=temperature,
                system=system_prompt,
                messages=[
                    {"role": "user", "content": f"以下の日本語テキストを英語に翻訳してください：\n\n{chunk}"}
                ]
            )
            translated_text = message.content[0].text.strip()
            # 余分な説明や接頭辞を削除
            translated_text = translated_text.replace("Here's the translation:", "").strip()
            translated_text = translated_text.replace("Translated text:", "").strip()
            translated_chunks.append(translated_text)
        except Exception as e:
            print(f"翻訳エラー: {str(e)}")
            print(f"エラーの詳細: {type(e).__name__}")
            print(f"翻訳に失敗したチャンク: {chunk}")
            translated_chunks.append(chunk)

    return " ".join(translated_chunks)

def process_paragraph(client: anthropic.Anthropic, paragraph: Paragraph, model: str, temperature: float, max_tokens: int) -> Tuple[str, List[Tuple]]:
    """段落とそのランを処理し、翻訳されたテキストと書式情報を返す。"""
    translated_text = translate_text(client, paragraph.text, model, temperature, max_tokens)
    
    formatting = []
    for run in paragraph.runs:
        formatting.append((
            run.bold, run.italic, run.underline,
            run.font.name or "Default", run.font.size,
            run.font.color.rgb if run.font.color.rgb else RGBColor(0, 0, 0)
        ))
    
    return translated_text, formatting

def apply_formatting(paragraph: Paragraph, text: str, formatting: List[Tuple]) -> None:
    """Apply formatting to a paragraph based on the original formatting."""
    paragraph.text = ""
    words = text.split()
    format_index = 0
    current_run = paragraph.add_run()

    for word in words:
        if format_index < len(formatting):
            bold, italic, underline, font_name, font_size, color = formatting[format_index]
            current_run = paragraph.add_run(word + " ")
            current_run.bold = bold
            current_run.italic = italic
            current_run.underline = underline
            current_run.font.name = font_name
            if font_size:
                current_run.font.size = font_size
            current_run.font.color.rgb = color

            # Move to the next format if the current run is longer than the original
            if len(current_run.text) >= len(formatting[format_index][3]):
                format_index += 1
        else:
            current_run.add_text(word + " ")

def get_document_margins(doc: Document) -> Tuple[float, float, float, float]:
    """Get the margins of the document in inches."""
    section = doc.sections[0]
    return (
        section.top_margin.inches,
        section.bottom_margin.inches,
        section.left_margin.inches,
        section.right_margin.inches
    )

def set_document_margins(doc: Document, margins: Tuple[float, float, float, float]) -> None:
    """Set the margins of the document in inches."""
    section = doc.sections[0]
    section.top_margin = Inches(margins[0])
    section.bottom_margin = Inches(margins[1])
    section.left_margin = Inches(margins[2])
    section.right_margin = Inches(margins[3])

def translate_word_document(input_path: str, output_path: str, api_key: str, model: str, temperature: float) -> None:
    """Word文書をB2レベルの学術的英語に翻訳し、書式を保持する。"""
    client = anthropic.Anthropic(api_key=api_key)
    
    doc = Document(input_path)
    translated_doc = Document()

    # 余白を取得し設定
    original_margins = get_document_margins(doc)
    set_document_margins(translated_doc, original_margins)

    for paragraph in tqdm(doc.paragraphs, desc="段落を処理中", unit="paragraph"):
        translated_text, formatting = process_paragraph(client, paragraph, model, temperature, max_tokens=4000)
        translated_paragraph = translated_doc.add_paragraph()
        apply_formatting(translated_paragraph, translated_text, formatting)
        translated_paragraph.alignment = paragraph.alignment
        translated_paragraph.style = paragraph.style

    translated_doc.save(output_path)
    print(f"翻訳されたWord文書が {output_path} に保存されました")
    print(f"元の余白 (上, 下, 左, 右): {original_margins}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Word文書を英語に翻訳します。")
    parser.add_argument("input_file", help="翻訳する入力ファイルのパス")
    parser.add_argument("--model", default="claude-3-5-sonnet-20240620", help="使用するAnthropicモデル (デフォルト: claude-3-5-sonnet-20240620)")
    parser.add_argument("--temperature", type=float, default=0.5, help="生成時の温度 (デフォルト: 0.5)")
    args = parser.parse_args()

    api_key = ANTHROPIC_API_KEY
    if not api_key:
        raise ValueError("ANTHROPIC_API_KEY is not set.")

    input_file = args.input_file
    output_dir = os.path.dirname(input_file)
    output_file = os.path.join(output_dir, f"translated_{os.path.basename(input_file)}")

    translate_word_document(input_file, output_file, api_key, args.model, args.temperature)

.env

# .env
ANTHROPIC_API_KEY=sk-your-key

使用方法

ヘルプ

python Japanease-word-file-translate-to-English-Claude.py --help
usage: Japanease-word-file-translate-to-English-Claude.py [-h] [--model MODEL] [--temperature TEMPERATURE] input_file

Word文書を英語に翻訳します。

positional arguments:
  input_file            翻訳する入力ファイルのパス

options:
  -h, --help            show this help message and exit
  --model MODEL         使用するAnthropicモデル (デフォルト: claude-3-5-sonnet-20240620)
  --temperature TEMPERATURE
                        生成時の温度 (デフォルト: 0.5)

使用例

# ファイルあるいはファイルパスに空白がある場合は引用符で囲うこと。
python Japanease-word-file-translate-to-English.py "Path/to/your/file/sss sss.docx" --model claude-3-5-sonnet-20240620 --temperature 0.2

まとめ

このPythonスクリプトは長い文章でも50円ぐらい？で高品質なワードファイルの英訳が一瞬で可能になります。（自腹しました💦）

これで教育・研究活動に時間が使えるようになりハッピーになると良いですね。

AIの進歩は目覚ましいものがあります。皆さんも、このようなAI技術を活用して、より効率的な作業環境を作り上げていってください！

質問や改善点があれば、ぜひコメントで教えてください。