固有表現認識（NER）のカスタムモデルのトレーニングについて

2024年1月25日 19:31

例題：情報検索

シーン: 特定のキーワードやトピックに関連する文書を検索する際に、NERを使用して文書内のキーワードを特定します。

import spacy

nlp = spacy.load("en_core_web_sm")

query = "Find articles about climate change."

documents = [
    "Climate change is a pressing global issue.",
    "The Paris Agreement addresses climate change.",
    "Emissions from cars contribute to climate change."
]

for doc in documents:
    doc = nlp(doc)
    for ent in doc.ents:
        if ent.label_ == "MISC":
            print("関連文書:", doc.text)

このコードを実行しても、何も表示されない。
理由は、

提供されたコードは、指定されたテキストとモデルを使用して固有表現（固有名詞など）を抽出し、その中からカテゴリが "MISC"（その他）であるものを検出し、関連文書として表示しようとしています。

しかし、en_core_web_smモデルは一般的な固有表現のカテゴリとして "MISC" を持っていないため、何も結果が出力されません。

このモデルでは一般的に以下のようなカテゴリが使用されます：

"PERSON"（人名）
"ORG"（組織名）
"GPE"（地名）
"DATE"（日付）
"TIME"（時間）
その他

したがって、特定のカテゴリ（例: "MISC"）を抽出するためには、カスタムモデルをトレーニングするか、既存のモデルを拡張する必要があります。

例えば、固有表現のカスタムカテゴリ "MISC" を含む新しいモデルをトレーニングする場合、そのカテゴリに関連するテキストデータを収集し、モデルを学習させる必要があります。

このプロセスは時間とデータが必要で、専門的なタスクに対応するために行われることがあります。

したがって、コードが何も出力しない場合、それはモデルが指定されたカテゴリを認識しないためであり、問題ではありません。カテゴリを正確に抽出するためには、適切なモデルをトレーニングまたはカスタマイズする必要があります。

固有表現のカスタムカテゴリ "MISC" を含む新しいモデルをトレーニングする方法について

カテゴリに関連するテキストデータを収集し、モデルを学習させる必要がある。
その具体的な手順と方法は？

固有表現認識（NER）のカスタムモデルをトレーニングするには、以下の手順に従って進めることができます。
ここではspaCyを使用したカスタムNERモデルのトレーニングを説明します。

手順1: ラベルデータの収集

まず、カスタムカテゴリ "MISC" に関連するテキストデータを収集し、ラベル付けします。ラベルデータは、カテゴリを示すトークンとその位置（開始位置と終了位置）を含む必要があります。例えば：

テキスト: "Climate change is a pressing global issue."
カテゴリ "MISC": "climate change"
ラベルデータ: (2, 4, "MISC")

このように、カテゴリに関連するテキスト部分とその開始位置と終了位置を記録します。このステップで収集したデータをトレーニングデータとして使用します。

手順2: spaCyのカスタムNERモデルのセットアップ

次に、spaCyのカスタムNERモデルをセットアップします。
これには、新しいカテゴリ "MISC" を含むモデルの作成が含まれます。以下は基本的なスクリプトです：

import spacy
from spacy.training.example import Example

# 既存のモデルをロード
nlp = spacy.load("en_core_web_sm")

# 新しいカテゴリ "MISC" を追加
nlp.add_pipe("ner", config={"labels": ["MISC"]})

# NERコンポーネントを取得
ner = nlp.get_pipe("ner")

しかし、下記エラーが発生。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 8
      5 nlp = spacy.load("en_core_web_sm")
      7 # 新しいカテゴリ "MISC" を追加
----> 8 nlp.add_pipe("ner", config={"labels": ["MISC"]})
     10 # NERコンポーネントを取得
     11 ner = nlp.get_pipe("ner")

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:810, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    808 name = name if name is not None else factory_name
    809 if name in self.component_names:
--> 810     raise ValueError(Errors.E007.format(name=name, opts=self.component_names))
    811 # Overriding pipe name in the config is not supported and will be ignored.
    812 if "name" in config:

ValueError: [E007] 'ner' already exists in pipeline. Existing names: ['tok2vec', 'tagger', 'parser', 'senter', 'attribute_ruler', 'lemmatizer', 'ner']

このエラーは、既存のspaCyパイプラインに "ner" という名前のコンポーネント（固有表現認識）が既に存在するために発生しています。
同じ名前のコンポーネントを再度追加しようとしているため、エラーが発生しています。

解決策としては、既存の "ner" コンポーネントを削除してから新しいカテゴリ "MISC" を含む "ner" コンポーネントを追加することができます。
以下は修正されたコードです：

この修正により、既存の "ner" コンポーネントが削除され、新しいカテゴリ "MISC" を含む "ner" コンポーネントが追加されるはずです。

import spacy

# spaCyモデルの読み込み
nlp = spacy.load("en_core_web_sm")

# 既存の "ner" コンポーネントを削除
if "ner" in nlp.pipe_names:
    nlp.remove_pipe("ner")

# 新しいカテゴリ "MISC" を追加
nlp.add_pipe("ner", config={"labels": ["MISC"]})

# NERコンポーネントを取得
ner = nlp.get_pipe("ner")

しかし、再び下記エラーが発生

---------------------------------------------------------------------------
ConfigValidationError                     Traceback (most recent call last)
Cell In[14], line 11
      8     nlp.remove_pipe("ner")
     10 # 新しいカテゴリ "MISC" を追加
---> 11 nlp.add_pipe("ner", config={"labels": ["MISC"]})
     13 # NERコンポーネントを取得
     14 ner = nlp.get_pipe("ner")

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:821, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    817     pipe_component, factory_name = self.create_pipe_from_source(
    818         factory_name, source, name=name
    819     )
    820 else:
--> 821     pipe_component = self.create_pipe(
    822         factory_name,
    823         name=name,
    824         config=config,
    825         raw_config=raw_config,
    826         validate=validate,
    827     )
    828 pipe_index = self._get_pipe_index(before, after, first, last)
    829 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:709, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    706 cfg = {factory_name: config}
    707 # We're calling the internal _fill here to avoid constructing the
    708 # registered functions twice
--> 709 resolved = registry.resolve(cfg, validate=validate)
    710 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    711 filled = Config(filled)

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:759, in registry.resolve(cls, config, schema, overrides, validate)
    750 @classmethod
    751 def resolve(
    752     cls,
   (...)
    757     validate: bool = True,
    758 ) -> Dict[str, Any]:
--> 759     resolved, _ = cls._make(
    760         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    761     )
    762     return resolved

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:808, in registry._make(cls, config, schema, overrides, resolve, validate)
    806 if not is_interpolated:
    807     config = Config(orig_config).interpolate()
--> 808 filled, _, resolved = cls._fill(
    809     config, schema, validate=validate, overrides=overrides, resolve=resolve
    810 )
    811 filled = Config(filled, section_order=section_order)
    812 # Check that overrides didn't include invalid properties not in config

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:863, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    861     schema.__fields__[key] = copy_model_field(field, Any)
    862 promise_schema = cls.make_promise_schema(value, resolve=resolve)
--> 863 filled[key], validation[v_key], final[key] = cls._fill(
    864     value,
    865     promise_schema,
    866     validate=validate,
    867     resolve=resolve,
    868     parent=key_parent,
    869     overrides=overrides,
    870 )
    871 reg_name, func_name = cls.get_constructor(final[key])
    872 args, kwargs = cls.parse_args(final[key])

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:929, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    927         result = schema.parse_obj(validation)
    928     except ValidationError as e:
--> 929         raise ConfigValidationError(
    930             config=config, errors=e.errors(), parent=parent
    931         ) from None
    932 else:
    933     # Same as parse_obj, but without validation
    934     result = schema.construct(**validation)

ConfigValidationError: 

Config validation error
ner -> labels	extra fields not permitted
{'nlp': <spacy.lang.en.English object at 0x14cfeae10>, 'name': 'ner', 'incorrect_spans_key': None, 'labels': ['MISC'], 'model': {'@architectures': 'spacy.TransitionBasedParser.v2', 'state_type': 'ner', 'extra_state_tokens': False, 'hidden_width': 64, 'maxout_pieces': 2, 'use_upper': True, 'tok2vec': {'@architectures': 'spacy.HashEmbedCNN.v2', 'pretrained_vectors': None, 'width': 96, 'depth': 4, 'embed_size': 2000, 'window_size': 1, 'maxout_pieces': 3, 'subword_features': True}}, 'moves': None, 'scorer': {'@scorers': 'spacy.ner_scorer.v1'}, 'update_with_oracle_cut_size': 100, '@factories': 'ner'}

エラー理由として考えられるのは
spaCyのNER（固有表現認識）コンポーネントに新しいカテゴリ "MISC" を追加しようとしても、設定の形式にエラーがあるようです。
NERコンポーネントの設定は複雑で、正確なスキーマが必要です。

カスタムカテゴリ "MISC" を含むNERモデルを作成するためには、より詳細な設定が必要です。以下は、設定例です：

このコードでは、NERコンポーネントの設定を正しく行っています。
"labels" パラメータに "MISC" というカテゴリを指定しています。
その後、NERコンポーネントを追加しています。

このように設定を行うことで、カスタムカテゴリ "MISC" を含むNERモデルを正しく作成できるハズだが・・・

import spacy
from spacy.pipeline.textcat import Config

# spaCyモデルの読み込み
nlp = spacy.load("en_core_web_sm")

# 既存の "ner" コンポーネントを削除
if "ner" in nlp.pipe_names:
    nlp.remove_pipe("ner")

# NERコンポーネントの設定
config = Config().from_str('''
[ner]
labels = ["MISC"]
''')

# NERコンポーネントを追加
nlp.add_pipe("ner", config=config)

# NERコンポーネントを取得
ner = nlp.get_pipe("ner")

再びエラー発生

User
---------------------------------------------------------------------------
ConfigValidationError                     Traceback (most recent call last)
Cell In[15], line 18
     12 config = Config().from_str('''
     13 [ner]
     14 labels = ["MISC"]
     15 ''')
     17 # NERコンポーネントを追加
---> 18 nlp.add_pipe("ner", config=config)
     20 # NERコンポーネントを取得
     21 ner = nlp.get_pipe("ner")

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:821, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    817     pipe_component, factory_name = self.create_pipe_from_source(
    818         factory_name, source, name=name
    819     )
    820 else:
--> 821     pipe_component = self.create_pipe(
    822         factory_name,
    823         name=name,
    824         config=config,
    825         raw_config=raw_config,
    826         validate=validate,
    827     )
    828 pipe_index = self._get_pipe_index(before, after, first, last)
    829 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:709, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    706 cfg = {factory_name: config}
    707 # We're calling the internal _fill here to avoid constructing the
    708 # registered functions twice
--> 709 resolved = registry.resolve(cfg, validate=validate)
    710 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    711 filled = Config(filled)

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:759, in registry.resolve(cls, config, schema, overrides, validate)
    750 @classmethod
    751 def resolve(
    752     cls,
   (...)
    757     validate: bool = True,
    758 ) -> Dict[str, Any]:
--> 759     resolved, _ = cls._make(
    760         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    761     )
    762     return resolved

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:808, in registry._make(cls, config, schema, overrides, resolve, validate)
    806 if not is_interpolated:
    807     config = Config(orig_config).interpolate()
--> 808 filled, _, resolved = cls._fill(
    809     config, schema, validate=validate, overrides=overrides, resolve=resolve
    810 )
    811 filled = Config(filled, section_order=section_order)
    812 # Check that overrides didn't include invalid properties not in config

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:863, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    861     schema.__fields__[key] = copy_model_field(field, Any)
    862 promise_schema = cls.make_promise_schema(value, resolve=resolve)
--> 863 filled[key], validation[v_key], final[key] = cls._fill(
    864     value,
    865     promise_schema,
    866     validate=validate,
    867     resolve=resolve,
    868     parent=key_parent,
    869     overrides=overrides,
    870 )
    871 reg_name, func_name = cls.get_constructor(final[key])
    872 args, kwargs = cls.parse_args(final[key])

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:929, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    927         result = schema.parse_obj(validation)
    928     except ValidationError as e:
--> 929         raise ConfigValidationError(
    930             config=config, errors=e.errors(), parent=parent
    931         ) from None
    932 else:
    933     # Same as parse_obj, but without validation
    934     result = schema.construct(**validation)

ConfigValidationError: 

Config validation error
ner -> ner	extra fields not permitted
{'nlp': <spacy.lang.en.English object at 0x15ab065d0>, 'name': 'ner', 'incorrect_spans_key': None, 'model': {'@architectures': 'spacy.TransitionBasedParser.v2', 'state_type': 'ner', 'extra_state_tokens': False, 'hidden_width': 64, 'maxout_pieces': 2, 'use_upper': True, 'tok2vec': {'@architectures': 'spacy.HashEmbedCNN.v2', 'pretrained_vectors': None, 'width': 96, 'depth': 4, 'embed_size': 2000, 'window_size': 1, 'maxout_pieces': 3, 'subword_features': True}}, 'moves': None, 'ner': {'labels': ['MISC']}, 'scorer': {'@scorers': 'spacy.ner_scorer.v1'}, 'update_with_oracle_cut_size': 100, '@factories': 'ner'}

コードの実行に関する問題が解決できないようです。
spaCyのバージョンや環境の設定に関連する問題があるかもしれません。
エラーの詳細を確認する限り、コードは正しいが、エラーが解決しない。

できることとして、以下のアクションを検討してみる：

spaCyバージョンの確認: 使用しているspaCyのバージョンが最新であることを確認してください。古いバージョンのspaCyでは、一部の機能や設定がサポートされていないことがあります。
Python環境の整理: インタープリタやパッケージ環境が正しく設定されていることを確認してください。必要な依存関係がインストールされているかも確認しましょう。
コードの再評価: spaCyを使用してNERモデルをカスタマイズするための公式のドキュメントやチュートリアルを再度確認し、ステップバイステップで手順を実行してみてください。
spaCyの再インストール: 必要であれば、spaCyをアンインストールし、最新バージョンをインストールし直してみてください。

もし問題が解決しない場合、spaCyのバージョンや環境に関連する問題がある可能性が高いため、公式のspaCyフォーラムやコミュニティでサポートを求めることを検討することがオススメとのこと

エラーが解決できないため、別案を試みた

spaCyの新しい言語モデルを作成し、カスタムNERラベルを追加します。

import spacy
from spacy.pipeline.ner import Config

# spaCyモデルの読み込み
nlp = spacy.blank("en")

# NERコンポーネントの設定
config = Config().from_str('''
[ner]
labels = ["MISC"]
''')

# NERコンポーネントを追加
ner = nlp.add_pipe("ner", config=config)

しかしエラーが解決しない

---------------------------------------------------------------------------
ConfigValidationError                     Traceback (most recent call last)
Cell In[16], line 14
      8 config = Config().from_str('''
      9 [ner]
     10 labels = ["MISC"]
     11 ''')
     13 # NERコンポーネントを追加
---> 14 ner = nlp.add_pipe("ner", config=config)

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:821, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    817     pipe_component, factory_name = self.create_pipe_from_source(
    818         factory_name, source, name=name
    819     )
    820 else:
--> 821     pipe_component = self.create_pipe(
    822         factory_name,
    823         name=name,
    824         config=config,
    825         raw_config=raw_config,
    826         validate=validate,
    827     )
    828 pipe_index = self._get_pipe_index(before, after, first, last)
    829 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:709, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    706 cfg = {factory_name: config}
    707 # We're calling the internal _fill here to avoid constructing the
    708 # registered functions twice
--> 709 resolved = registry.resolve(cfg, validate=validate)
    710 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    711 filled = Config(filled)

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:759, in registry.resolve(cls, config, schema, overrides, validate)
    750 @classmethod
    751 def resolve(
    752     cls,
   (...)
    757     validate: bool = True,
    758 ) -> Dict[str, Any]:
--> 759     resolved, _ = cls._make(
    760         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    761     )
    762     return resolved

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:808, in registry._make(cls, config, schema, overrides, resolve, validate)
    806 if not is_interpolated:
    807     config = Config(orig_config).interpolate()
--> 808 filled, _, resolved = cls._fill(
    809     config, schema, validate=validate, overrides=overrides, resolve=resolve
    810 )
    811 filled = Config(filled, section_order=section_order)
    812 # Check that overrides didn't include invalid properties not in config

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:863, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    861     schema.__fields__[key] = copy_model_field(field, Any)
    862 promise_schema = cls.make_promise_schema(value, resolve=resolve)
--> 863 filled[key], validation[v_key], final[key] = cls._fill(
    864     value,
    865     promise_schema,
    866     validate=validate,
    867     resolve=resolve,
    868     parent=key_parent,
    869     overrides=overrides,
    870 )
    871 reg_name, func_name = cls.get_constructor(final[key])
    872 args, kwargs = cls.parse_args(final[key])

File ~/anaconda3/lib/python3.11/site-packages/confection/__init__.py:929, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    927         result = schema.parse_obj(validation)
    928     except ValidationError as e:
--> 929         raise ConfigValidationError(
    930             config=config, errors=e.errors(), parent=parent
    931         ) from None
    932 else:
    933     # Same as parse_obj, but without validation
    934     result = schema.construct(**validation)

ConfigValidationError: 

Config validation error
ner -> ner	extra fields not permitted
{'nlp': <spacy.lang.en.English object at 0x15cb7dad0>, 'name': 'ner', 'incorrect_spans_key': None, 'model': {'@architectures': 'spacy.TransitionBasedParser.v2', 'state_type': 'ner', 'extra_state_tokens': False, 'hidden_width': 64, 'maxout_pieces': 2, 'use_upper': True, 'tok2vec': {'@architectures': 'spacy.HashEmbedCNN.v2', 'pretrained_vectors': None, 'width': 96, 'depth': 4, 'embed_size': 2000, 'window_size': 1, 'maxout_pieces': 3, 'subword_features': True}}, 'moves': None, 'ner': {'labels': ['MISC']}, 'scorer': {'@scorers': 'spacy.ner_scorer.v1'}, 'update_with_oracle_cut_size': 100, '@factories': 'ner'}

何らかの理由でカスタムNERモデルの設定がうまくいかないようです。この問題を解決するために、別の方法を試してみましょう。

以下は、別の方法でカスタムNERモデルを設定するコードの例です：

このコードでは、EntityRecognizer を使用してカスタムNERラベル "MISC" を設定し、トレーニングデータでNERモデルをトレーニングしています。その後、テストデータでNERを実行してカスタムラベルを識別します。

import spacy
from spacy.pipeline.ner import EntityRecognizer
from spacy.training.example import Example

# spaCyモデルの読み込み
nlp = spacy.blank("en")

# NERコンポーネントの追加
ner = EntityRecognizer(nlp.vocab, entity_types=["MISC"])

# NERコンポーネントをパイプラインに追加
nlp.add_pipe(ner)

# トレーニングデータを用意
train_data = [
    ("Climate change is a pressing global issue.", {"entities": [(0, 14, "MISC")]}),
    ("The Paris Agreement addresses climate change.", {"entities": [(4, 18, "MISC")]}),
    ("Emissions from cars contribute to climate change.", {"entities": [(37, 51, "MISC")]}),
]

# NERモデルをトレーニング
for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example])

# テストデータでNERを実行
test_text = "Climate change and environmental sustainability are important topics."
doc = nlp(test_text)
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

エラー解決せず

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 9
      6 nlp = spacy.blank("en")
      8 # NERコンポーネントの追加
----> 9 ner = EntityRecognizer(nlp.vocab, entity_types=["MISC"])
     11 # NERコンポーネントをパイプラインに追加
     12 nlp.add_pipe(ner)

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/ner.pyx:198, in spacy.pipeline.ner.EntityRecognizer.__init__()

TypeError: __init__() takes at least 2 positional arguments (1 given)

前のコードに誤りがありました。EntityRecognizer のインスタンス化に必要な引数が不足していました。正しい方法でカスタムNERモデルを設定するコードを以下に示します。

import spacy
from spacy.pipeline.ner import EntityRecognizer
from spacy.training.example import Example

# spaCyモデルの読み込み
nlp = spacy.blank("en")

# NERコンポーネントの追加
ner = EntityRecognizer(nlp.vocab)

# カスタムラベル "MISC" の追加
ner.add_label("MISC")

# NERコンポーネントをパイプラインに追加
nlp.add_pipe(ner)

# トレーニングデータを用意
train_data = [
    ("Climate change is a pressing global issue.", {"entities": [(0, 14, "MISC")]}),
    ("The Paris Agreement addresses climate change.", {"entities": [(4, 18, "MISC")]}),
    ("Emissions from cars contribute to climate change.", {"entities": [(37, 51, "MISC")]}),
]

# NERモデルをトレーニング
for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example])

# テストデータでNERを実行
test_text = "Climate change and environmental sustainability are important topics."
doc = nlp(test_text)
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

しかしエラー発生

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 9
      6 nlp = spacy.blank("en")
      8 # NERコンポーネントの追加
----> 9 ner = EntityRecognizer(nlp.vocab)
     11 # カスタムラベル "MISC" の追加
     12 ner.add_label("MISC")

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/ner.pyx:198, in spacy.pipeline.ner.EntityRecognizer.__init__()

TypeError: __init__() takes at least 2 positional arguments (1 given)

エラーメッセージに基づいて問題を解決しようとしますが、エラーが解消されていないようです。この問題を解決するために、spaCyのバージョンに関する問題が考えられます。コードがspaCyの特定のバージョンに依存している可能性があるため、以下の手順を試してみてください。

spaCyをアップグレード: spaCyの最新バージョンにアップグレードしてみてください。ターミナルやコマンドプロンプトを使用して、次のコマンドを実行します。

pip install --upgrade spacy

spaCyモデルのダウンロード: spaCyモデルを最新のものにアップデートします。以下のコマンドで英語モデルをダウンロードしてみてください。

python -m spacy download en_core_web_sm

Jupyter Notebookのセル内で上記コードを実行するには、文頭に「!」が必要

コマンド python -m spacy download en_core_web_sm をJupyter Notebookのセル内で直接実行することはできません。
これはシェルコマンドであり、Jupyter NotebookセルではPythonコードの実行が前提となっています。

以下の方法でコマンドを実行できます：

Jupyter Notebookのセルでシェルコマンドを実行するには、コードセル内で ! を前置します。以下のように試してみてください：

!python -m spacy download en_core_web_sm

これにより、シェルコマンドが実行され、英語モデルがダウンロードされます。

または、Jupyter Notebookの外部でコマンドプロンプトやターミナルを開いて、コマンド python -m spacy download en_core_web_sm を直接実行してモデルをダウンロードすることもできます。
その後、Jupyter Notebook内でスクリプトを実行する際に、ダウンロードしたモデルを読み込むことができます。

ここまでできたら、Jupyter Notebookを再起動: Jupyter Notebookを再起動して、新しい環境でコードを実行してみてください。

これらの手順を試してみて、カスタムNERモデルの設定が正しく動作するかどうかを確認してください。
もし問題が解決しない場合、Python環境やspaCyの設定に関連する問題がある可能性があり、環境の整理や再インストールを検討することも重要です。

しかし、エラー解決せず

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 9
      6 nlp = spacy.blank("en")
      8 # NERコンポーネントの追加
----> 9 ner = EntityRecognizer(nlp.vocab)
     11 # カスタムラベル "MISC" の追加
     12 ner.add_label("MISC")

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/ner.pyx:198, in spacy.pipeline.ner.EntityRecognizer.__init__()

TypeError: __init__() takes at least 2 positional arguments (1 given)

このエラーは、古いバージョンのspaCyを使用していることが原因と考えられるとのこと。
新しいバージョンのspaCyでは、NERコンポーネントを直接 EntityRecognizer で初期化する方法が変更されています。

以下の方法でカスタムNERモデルを設定してみてください。この方法は、新しいspaCyバージョンに対応しています。

import spacy
from spacy.training.example import Example

# spaCyモデルの読み込み
nlp = spacy.blank("en")

# NERコンポーネントを取得
ner = nlp.add_pipe("ner")

# カスタムラベル "MISC" の追加
ner.add_label("MISC")

# トレーニングデータを用意
train_data = [
    ("Climate change is a pressing global issue.", {"entities": [(0, 14, "MISC")]}),
    ("The Paris Agreement addresses climate change.", {"entities": [(4, 18, "MISC")]}),
    ("Emissions from cars contribute to climate change.", {"entities": [(37, 51, "MISC")]}),
]

# NERモデルをトレーニング
for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example])

# テストデータでNERを実行
test_text = "Climate change and environmental sustainability are important topics."
doc = nlp(test_text)
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

しかし、エラー発生

User
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[23], line 24
     22     doc = nlp.make_doc(text)
     23     example = Example.from_dict(doc, annotations)
---> 24     nlp.update([example])
     26 # テストデータでNERを実行
     27 test_text = "Climate change and environmental sustainability are important topics."

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:1193, in Language.update(self, examples, _, drop, sgd, losses, component_cfg, exclude, annotates)
   1190 for name, proc in self.pipeline:
   1191     # ignore statements are used here because mypy ignores hasattr
   1192     if name not in exclude and hasattr(proc, "update"):
-> 1193         proc.update(examples, sgd=None, losses=losses, **component_cfg[name])  # type: ignore
   1194     if sgd not in (None, False):
   1195         if (
   1196             name not in exclude
   1197             and isinstance(proc, ty.TrainableComponent)
   1198             and proc.is_trainable
   1199             and proc.model not in (True, False, None)
   1200         ):

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/transition_parser.pyx:411, in spacy.pipeline.transition_parser.Parser.update()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/transition_parser.pyx:671, in spacy.pipeline.transition_parser.Parser._init_gold_batch()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:297, in spacy.pipeline._parser_internals.ner.BiluoPushDown.init_gold()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:61, in spacy.pipeline._parser_internals.ner.BiluoGold.__init__()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:89, in spacy.pipeline._parser_internals.ner.create_gold_state()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:201, in spacy.pipeline._parser_internals.ner.BiluoPushDown.lookup_transition()

KeyError: "[E022] Could not find a transition with the name 'O' in the NER model."

このエラーは、NERモデルのトレーニングデータやアノテーションに問題がある可能性があります。特に、エンティティラベルが正しく設定されていないか、ラベルが "O" でないことが原因かもしれません。

以下の点を確認してください：

トレーニングデータのアノテーションが正しいかどうかを確認します。アノテーションのフォーマットは、(start, end, label) のタプルのリストでなければなりません。エンティティが存在しない場合は、ラベルを "O" に設定します。
カスタムNERモデルの設定で、ner.add_label("MISC") の代わりに ner.add_label("O") を使用してください。"O" はエンティティが存在しない場合のデフォルトのラベルです。

以下は修正されたコード例です：

import spacy
from spacy.training.example import Example

# spaCyモデルの読み込み
nlp = spacy.blank("en")

# NERコンポーネントを取得
ner = nlp.add_pipe("ner")

# デフォルトの "O" ラベルを追加
ner.add_label("O")

# トレーニングデータを用意
train_data = [
    ("Climate change is a pressing global issue.", {"entities": [(0, 14, "MISC")]}),
    ("The Paris Agreement addresses climate change.", {"entities": [(4, 18, "MISC")]}),
    ("Emissions from cars contribute to climate change.", {"entities": [(37, 51, "MISC")]}),
]

# NERモデルをトレーニング
for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example])

# テストデータでNERを実行
test_text = "Climate change and environmental sustainability are important topics."
doc = nlp(test_text)
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

これでもエラー解決せず

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[24], line 24
     22     doc = nlp.make_doc(text)
     23     example = Example.from_dict(doc, annotations)
---> 24     nlp.update([example])
     26 # テストデータでNERを実行
     27 test_text = "Climate change and environmental sustainability are important topics."

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:1193, in Language.update(self, examples, _, drop, sgd, losses, component_cfg, exclude, annotates)
   1190 for name, proc in self.pipeline:
   1191     # ignore statements are used here because mypy ignores hasattr
   1192     if name not in exclude and hasattr(proc, "update"):
-> 1193         proc.update(examples, sgd=None, losses=losses, **component_cfg[name])  # type: ignore
   1194     if sgd not in (None, False):
   1195         if (
   1196             name not in exclude
   1197             and isinstance(proc, ty.TrainableComponent)
   1198             and proc.is_trainable
   1199             and proc.model not in (True, False, None)
   1200         ):

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/transition_parser.pyx:411, in spacy.pipeline.transition_parser.Parser.update()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/transition_parser.pyx:671, in spacy.pipeline.transition_parser.Parser._init_gold_batch()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:297, in spacy.pipeline._parser_internals.ner.BiluoPushDown.init_gold()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:61, in spacy.pipeline._parser_internals.ner.BiluoGold.__init__()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:89, in spacy.pipeline._parser_internals.ner.create_gold_state()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:201, in spacy.pipeline._parser_internals.ner.BiluoPushDown.lookup_transition()

KeyError: "[E022] Could not find a transition with the name 'O' in the NER model."

このエラーは、NERモデルの内部で "O" というトランジション（遷移）が見つからないことを示しています。通常、"O" はエンティティが存在しない場合のデフォルトのラベルですが、このエラーが発生する理由は、内部でのトランジションの問題かもしれません。

一時的な解決策として、以下のようにラベル "O" を追加することでエラーを回避できます。

ただし、これは一時的な対応策であり、ラベル "O" は通常、エンティティが存在しない場合に自動的に設定されます。問題が解決しない場合、spaCyのバージョンやライブラリのインストール状況を確認し、問題の解決を試みてください。

import spacy
from spacy.training.example import Example

# spaCyモデルの読み込み
nlp = spacy.blank("en")

# NERコンポーネントを取得
ner = nlp.add_pipe("ner")

# デフォルトの "O" ラベルを追加
ner.add_label("O")

# トレーニングデータを用意
train_data = [
    ("Climate change is a pressing global issue.", {"entities": [(0, 14, "MISC")]}),
    ("The Paris Agreement addresses climate change.", {"entities": [(4, 18, "MISC")]}),
    ("Emissions from cars contribute to climate change.", {"entities": [(37, 51, "MISC")]}),
]

# NERモデルをトレーニング
for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example])

# テストデータでNERを実行
test_text = "Climate change and environmental sustainability are important topics."
doc = nlp(test_text)
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

エラー回避せず

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[25], line 24
     22     doc = nlp.make_doc(text)
     23     example = Example.from_dict(doc, annotations)
---> 24     nlp.update([example])
     26 # テストデータでNERを実行
     27 test_text = "Climate change and environmental sustainability are important topics."

File ~/anaconda3/lib/python3.11/site-packages/spacy/language.py:1193, in Language.update(self, examples, _, drop, sgd, losses, component_cfg, exclude, annotates)
   1190 for name, proc in self.pipeline:
   1191     # ignore statements are used here because mypy ignores hasattr
   1192     if name not in exclude and hasattr(proc, "update"):
-> 1193         proc.update(examples, sgd=None, losses=losses, **component_cfg[name])  # type: ignore
   1194     if sgd not in (None, False):
   1195         if (
   1196             name not in exclude
   1197             and isinstance(proc, ty.TrainableComponent)
   1198             and proc.is_trainable
   1199             and proc.model not in (True, False, None)
   1200         ):

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/transition_parser.pyx:411, in spacy.pipeline.transition_parser.Parser.update()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/transition_parser.pyx:671, in spacy.pipeline.transition_parser.Parser._init_gold_batch()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:297, in spacy.pipeline._parser_internals.ner.BiluoPushDown.init_gold()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:61, in spacy.pipeline._parser_internals.ner.BiluoGold.__init__()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:89, in spacy.pipeline._parser_internals.ner.create_gold_state()

File ~/anaconda3/lib/python3.11/site-packages/spacy/pipeline/_parser_internals/ner.pyx:201, in spacy.pipeline._parser_internals.ner.BiluoPushDown.lookup_transition()

KeyError: "[E022] Could not find a transition with the name 'O' in the NER model."

再度同じエラーが発生しています。このエラーは、NERモデル内で "O" というトランジション（遷移）が見つからないことを示しています。通常、"O" はエンティティが存在しない場合のデフォルトのラベルです。

このエラーが発生しているのは、spaCyのバージョンや設定に関連する可能性があります。以下の方法で問題を解決できるか試してみてください。

spaCyのバージョンを確認し、最新バージョンにアップデートします。
以下のコマンドを使用してspaCyを最新バージョンにアップデートできます。

pip install -U spacy

カスタムNERモデルを作成する前に、spaCyのデフォルトNERモデルを読み込んでみてください。
以下のコマンドを使用してデフォルトNERモデルをダウンロードできます。

!python -m spacy download en_core_web_sm

その後、以下コードを試してみる

これらの手順を試しても問題が解決しない場合、spaCyの設定や環境に関連する問題かもしれません。
その場合、spaCyの公式ドキュメントやコミュニティフォーラムを参照して、問題を特定し解決するためのサポートを受けることを検討してください。

import spacy
from spacy.training.example import Example

# spaCyモデルの読み込み
nlp = spacy.load("en_core_web_sm")

# トレーニングデータを用意
train_data = [
    ("Climate change is a pressing global issue.", {"entities": [(0, 14, "MISC")]}),
    ("The Paris Agreement addresses climate change.", {"entities": [(4, 18, "MISC")]}),
    ("Emissions from cars contribute to climate change.", {"entities": [(37, 51, "MISC")]}),
]

# NERモデルをトレーニング
for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example])

# テストデータでNERを実行
test_text = "Climate change and environmental sustainability are important topics."
doc = nlp(test_text)
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

再びエラーが出たが、少し前進しているかもしれない

/Users/tomo/anaconda3/lib/python3.11/site-packages/spacy/training/iob_utils.py:149: UserWarning: [W030] Some entities could not be aligned in the text "The Paris Agreement addresses climate change." with entities "[(4, 18, 'MISC')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  warnings.warn(
/Users/tomo/anaconda3/lib/python3.11/site-packages/spacy/training/iob_utils.py:149: UserWarning: [W030] Some entities could not be aligned in the text "Emissions from cars contribute to climate change." with entities "[(37, 51, 'MISC')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  warnings.warn(

この警告は、トレーニングデータの一部のエンティティがテキストに対して適切にアラインメントされていないことを示しています。spaCyは、トレーニングデータ内のエンティティとテキストのトークン化を正しく調整できない場合、警告を発生させます。このような警告がある場合、モデルのトレーニングにおいて、一部のエンティティが無視される可能性があります。

この問題を解決するために、以下の手順を試すことができます。

トレーニングデータのエンティティアノテーションを確認し、エンティティの範囲が正確であることを確認します。エンティティの範囲が誤っている場合、修正します。
トレーニングデータ内のテキストを正確にトークン化してエンティティとのアラインメントを確認するために、spacy.training.offsets_to_biluo_tags 関数を使用します。以下のようにコード内で確認できます。

import spacy

nlp = spacy.blank("en")
train_data = [
    ("Climate change is a pressing global issue.", {"entities": [(0, 14, "MISC")]}),
    ("The Paris Agreement addresses climate change.", {"entities": [(4, 18, "MISC")]}),
    ("Emissions from cars contribute to climate change.", {"entities": [(37, 51, "MISC")]}),
]

for text, annotations in train_data:
    doc = nlp.make_doc(text)
    entities = annotations["entities"]
    biluo_tags = spacy.training.offsets_to_biluo_tags(doc, entities)
    print(f"Text: {text}")
    print(f"BILUO Tags: {biluo_tags}")

# テキストとエンティティのアラインメントが正確であることを確認したら、再びモデルのトレーニングを試してみてください。

実行結果は以下の通り

Text: Climate change is a pressing global issue.
BILUO Tags: ['B-MISC', 'L-MISC', 'O', 'O', 'O', 'O', 'O', 'O']
Text: The Paris Agreement addresses climate change.
BILUO Tags: ['O', '-', '-', 'O', 'O', 'O', 'O']
Text: Emissions from cars contribute to climate change.
BILUO Tags: ['O', 'O', 'O', 'O', 'O', '-', '-', '-']

この結果を評価すると

提供された BILUO タグの結果を評価します。

BILUO タグは、テキスト内のトークンごとにエンティティを示すラベルを表現するための方法です。各ラベルの意味は以下の通りです。

B-MISC: エンティティの開始
I-MISC: エンティティの中間
L-MISC: エンティティの末尾
O: エンティティでない

評価結果:

"Climate change is a pressing global issue." の BILUO タグは ["B-MISC", "L-MISC", "O", "O", "O", "O", "O", "O"] です。これは "Climate change" がエンティティであることを示しており、正しくエンティティがアノテーションされています。
"The Paris Agreement addresses climate change." の BILUO タグは ["O", "-", "-", "O", "O", "O", "O"] です。ここで、"climate change" の最初の単語 "climate" はエンティティの開始を示す "B-MISC" ではなく、ハイフン "-" でラベル付けされています。これはアラインメントの問題を示しており、修正が必要です。
"Emissions from cars contribute to climate change." の BILUO タグは ["O", "O", "O", "O", "O", "-", "-", "-"] です。同様に、"climate change" の最初の単語 "climate" はエンティティの開始を示す "B-MISC" ではなく、ハイフン "-" でラベル付けされています。

結論として、トレーニングデータ内のエンティティのアラインメントに問題があるようです。
アラインメントの問題を修正して、正確なエンティティのアノテーションを確保することが重要です。
修正後、再度モデルのトレーニングを行うことで、正確なエンティティ抽出モデルを構築できるでしょう。

まだ完成していませんが、前進しました。

続きは続編でお伝えします。

固有表現認識（NER）のカスタムモデルのトレーニングについて

例題：情報検索

固有表現のカスタムカテゴリ "MISC" を含む新しいモデルをトレーニングする方法について

エラーが解決できないため、別案を試みた

いいなと思ったら応援しよう！