Automating Excel Updates from Gmail Quotations (Gmail Version)

2024年10月14日 07:01

Introduction

Previously, I wrote an article about an Outlook version of this process. However, due to increasingly stringent authentication security measures, establishing IMAP connections with Outlook has become challenging. Consequently, I have transitioned to using Gmail and would like to share this updated approach.

Objective

In a previous article, I introduced a Python script that sends orders and quotation requests via email, utilizing an order form stored on the laboratory's network drive.

This time, I will present a Python script that automatically updates an Excel file with pricing information when a supplier responds to a quotation request email.

Script Overview

The script performs the following key functions:

Email Retrieval and Analysis:
- Connects to the specified mailbox using IMAP.
- Searches for emails with the subject "Quotation Request" received within the last week.
- Extracts the body of matching emails and stores it along with sender information.
Data Extraction:
- Utilizes the OpenAI GPT-4o-mini API to extract crucial information such as product name, company name, product number, list price, and quoted price from the email body.
- Structures the extracted data for further processing.
Supplier Information Matching:
- Employs a predefined mapping of suppliers and email addresses to identify the corresponding supplier for each email.
Excel File Output:
- Converts the extracted data into a pandas DataFrame.
- Appends new data to an existing Excel file, writing each item to specified columns.

Environment Setup

To set up the necessary environment, install the required packages using the following command:

conda install -c conda-forge pandas openpyxl openai python-dotenv

Script Details

The script is composed of several key components:

Import statements and environment variable loading
Email retrieval function (get_mail_data)
Data extraction function using GPT (extract_data_with_gpt)
Excel writing function (write_to_excel)
Main execution block

import imaplib
import email
from email.header import decode_header
import os
import datetime
import pandas as pd
from openai import OpenAI
from openai import BadRequestError
from email.utils import parseaddr
from pathlib import Path
import re
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Retrieve sensitive information from environment variables
username = os.getenv('GMAIL_USERNAME')
password = os.getenv('GMAIL_PASSWORD')
imap_url = 'imap.gmail.com'  # Gmail IMAP server

# Check the operating system
if os.name == 'nt':  # Windows
    excel_file_path = Path(os.getenv('EXCEL_FILE_PATH_KAKAKU_WINDOWS'))
elif os.name == 'posix':  # macOS
    excel_file_path = Path(os.getenv('EXCEL_FILE_PATH_KAKAKU_MAC'))
else:
    raise EnvironmentError("Unsupported Operating System")

# OpenAI API configuration
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

# Mapping of agencies and their email addresses
agency_mapping = {
    'Agency A': {'email1@example.com', 'email2@example.com'},
    'Agency B': {'email3@example.com'},
    'Agency C': {'email4@example.com'},
    'Agency D': {'email5@example.com'},
    'Agency E': {'email6@example.com', 'email7@example.com'},
}

def decode_mime_words(s):
    return u''.join(
        word.decode(encoding or 'utf-8') if isinstance(word, bytes) else word
        for word, encoding in decode_header(s))

# Function to retrieve data from emails
def get_mail_data(username, password, imap_url, retries=200):
    attempt = 0
    while attempt < retries:
        try:
            mail = imaplib.IMAP4_SSL(imap_url)
            mail.login(username, password)
            mail.select("INBOX")

            # Calculate the date for one week ago
            one_week_ago = (datetime.datetime.now() - datetime.timedelta(days=7)).strftime("%d-%b-%Y")

            # Search and retrieve emails from the last week
            status, messages = mail.search(None, f'SINCE "{one_week_ago}"')
            messages = messages[0].split(b' ')
            extracted_data = []

            if not messages or messages == [b'']:
                print("No matching emails found.")
                mail.logout()
                return extracted_data

            for mail_id in messages:
                res, msg = mail.fetch(mail_id, '(RFC822)')
                for response in msg:
                    if isinstance(response, tuple):
                        msg = email.message_from_bytes(response[1])
                        subject = decode_mime_words(msg.get("Subject"))

                        if 'Quotation Request' in subject:
                            from_address = decode_mime_words(msg.get("From"))
                            # Extract only the email address part
                            from_address = parseaddr(from_address)[1]
                            # For debugging: Output the retrieved email address
                            print("From address:", from_address)
                            body = ""

                            if msg.is_multipart():
                                for part in msg.walk():
                                    if part.get_content_type() == "text/plain":
                                        body = part.get_payload(decode=True)
                                        charset = part.get_content_charset()
                                        body = body.decode(charset, errors='replace') if charset else body.decode('utf-8', errors='replace')

                            else:
                                body = msg.get_payload(decode=True)
                                charset = msg.get_content_charset()
                                body = body.decode(charset, errors='replace') if charset else body.decode('utf-8', errors='replace')

                            body = body.replace("\r\n", " ")
                            agency = next((key for key, val in agency_mapping.items() if from_address in val), "Unknown")

                            extracted_data.append((body, agency))

            mail.logout()
            return extracted_data

        except imaplib.IMAP4.error as e:
            print(f"Login attempt {attempt + 1} failed: {e}")
            attempt += 1

    print("All login attempts failed.")
    return []

# Function to extract data using OpenAI API
def extract_data_with_gpt(text_data, agency_mapping):
    parsed_data = []
    print("Input Data:", text_data)  # For debugging: Input data

    for text, agency in text_data:
        # Text preprocessing
        text = str(text)
        text = text.encode('utf-8', errors='ignore').decode('utf-8')
        text = re.sub(r'[^\x20-\x7Eぁ-んァ-ン一-龥ー]', ' ', text)

        if len(text) > 5000:
            print("Text too long, truncating")
            text = text[:5000]

        print("Sending text to API:", text[:500])  # For debugging: Text sent to API
        try:
            chat_completion = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "Please extract the product name, company, product number, list price, and quoted price from the following text. For list price and quoted price, please provide only the numerical price information. Finally, format the obtained information in a semicolon-separated format suitable for a Pandas DataFrame. The DataFrame structure should be 'Product Name;Company Name;Product Number;List Price;Quoted Price', with this as the header in row 0. Include only the semicolon-separated data in your response."},
                    {"role": "user", "content": text}
                ]
            )
            
            print("API Response:", chat_completion)  # For debugging: API response
            extracted_info = chat_completion.choices[0].message.content
            print("Extracted info:", extracted_info)  # For debugging: Extracted information

            if ';' in extracted_info:
                # For semicolon-separated cases
                data_rows = [item.split(';') for item in extracted_info.split("\n") if item.strip() and "Product Name" not in item]
            else:
                # For line break-separated cases
                product_name = [line.split(': ')[1] for line in extracted_info.split("\n") if line.startswith('Product Name')]
                company_name = [line.split(': ')[1] for line in extracted_info.split("\n") if line.startswith('Company Name')]
                product_number = [line.split(': ')[1] for line in extracted_info.split("\n") if line.startswith('Product Number')]
                list_price = [line.split(': ')[1] for line in extracted_info.split("\n") if line.startswith('List Price')]
                quoted_price = [line.split(': ')[1] for line in extracted_info.split("\n") if line.startswith('Quoted Price')]
                data_rows = list(zip(product_name, company_name, product_number, list_price, quoted_price))

            # For debugging: Output the content of data_rows
            print("Data rows before adding agency:", data_rows)
        
            # Add agency information to each row
            for row in data_rows:
                row.append(agency)

            # For debugging: Output the content of data_rows
            print("Data rows after adding agency:", data_rows)

            # Include 'Agency' in the DataFrame columns
            df = pd.DataFrame(data_rows, columns=["Product Name", "Company Name", "Product Number", "List Price", "Quoted Price", "Agency"])
            parsed_data.append(df)

        except BadRequestError as e:
            print("Error with API request:", e)

    return pd.concat(parsed_data, ignore_index=True) if parsed_data else pd.DataFrame()

def write_to_excel(df, excel_file_path):
    with pd.ExcelWriter(excel_file_path, engine='openpyxl', mode='a', if_sheet_exists='overlay') as writer:
        if 'Sheet1' in writer.book.sheetnames:
            start_row = writer.book['Sheet1'].max_row
        else:
            start_row = 0  # If the sheet is empty

        # Write each column to a specific position
        column_positions = {"Product Name": 0, "Company Name": 1, "Product Number": 2, "List Price": 4, "Quoted Price": 5, "Agency": 8}  # Excel column positions
        for column in df.columns:
            col_idx = column_positions.get(column, 0)
            df[column].to_excel(writer, sheet_name='Sheet1', startcol=col_idx, startrow=start_row, index=False, header=False)

# Main process
if __name__ == "__main__":
    mail_data = get_mail_data(username, password, imap_url)
    df = extract_data_with_gpt(mail_data, agency_mapping)
    print(df)
    write_to_excel(df, excel_file_path)

Configuration

The script utilizes a .env file to securely store sensitive information:

# Email server settings
GMAIL_USERNAME=your_email@example.com
GMAIL_PASSWORD=your_email_password

# OpenAI API configuration
OPENAI_API_KEY=your_openai_api_key

# Excel file paths
EXCEL_FILE_PATH_WINDOWS=C:\path\to\your\excel\file.xlsx
EXCEL_FILE_PATH_MAC=/path/to/your/excel/file.xlsx

Conclusion

This automated system enables efficient databasing of quoted prices from various suppliers. It's worth noting that after processing, emails are moved from the INBOX to a folder named "DONE" for organizational purposes.

The goal of this automation is to allocate more time for educational and research activities, ultimately enhancing productivity and satisfaction in academic pursuits.

I have implemented various other automated processes, which I plan to introduce in future articles.

Your questions and suggestions for improvements are welcome and appreciated. Please feel free to leave comments below.