見出し画像

ベクトル検索(Chroma)を使ってみる


Chromaの紹介

今回は、Chromaを使ってテキストベースと画像ベースの検索について紹介していきます。

1年ほど前に、ベクトル検索としてChromaの記事を書きました。

1年前と比べてみると、あまり大幅なアップデートは無いように見えましたが、テキストと画像ベースの検索方法がGoogle Colabを利用して紹介されていますので興味を持ちました。

また、今回は記事にはしていませんが、Chromaはここ1年ほどLangchainやLlamaIndexなどのツールへの統合に力を注いできたのではないかと思われます。

また、クライアント/サーバモードでもChromaを使うことができるようなので、こちらも機会があったら試してみたいです。


Chromaの実行

今回は、Chromaを利用したテキストベースのRAGと、画像ベースのRAGについてのコードを紹介していきます

テキストベースの検索

今回は、Chromaのサイトにあるこちらを利用します。

今回は、Google ColabでランタイムはCPUを利用します。

%pip install -Uq chromadb numpy datasets

# Get the SciQ dataset from HuggingFace
from datasets import load_dataset

dataset = load_dataset("sciq", split="train")

# Filter the dataset to only include questions with a support
dataset = dataset.filter(lambda x: x["support"] != "")

print("Number of questions with support: ", len(dataset))

今回使用するdatasetは、どのようなものかを見てみます。

print(dataset)

Dataset({ features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'], num_rows: 10481 })

出力結果

次に、ダウンロードしたdatasetをChromaにロードします。

# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()

# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.
collection = client.create_collection("sciq_supports")

# Embed and store the first 100 supports for this demo
collection.add(
    ids=[str(i) for i in range(0, 100)],  # IDs are just strings
    documents=dataset["support"][:100],
    metadatas=[{"type": "support"} for _ in range(0, 100)
    ],
)

次に、データに問い合わせを実施します。

results = collection.query(
    query_texts=dataset["question"][:10],
    n_results=1)

# Print the question and the corresponding support
for i, q in enumerate(dataset['question'][:10]):
    print(f"Question: {q}")
    print(f"Retrieved support: {results['documents'][i][0]}")
    print()

実行結果は、次のようになります。

Question: What type of organism is commonly used in preparation of foods such as cheese and yogurt?
Retrieved support: Agents of Decomposition The fungus-like protist saprobes are specialized to absorb nutrients from nonliving organic matter, such as dead organisms or their wastes. For instance, many types of oomycetes grow on dead animals or algae. Saprobic protists have the essential function of returning inorganic nutrients to the soil and water. This process allows for new plant growth, which in turn generates sustenance for other organisms along the food chain. Indeed, without saprobe species, such as protists, fungi, and bacteria, life would cease to exist as all organic carbon became “tied up” in dead organisms.

Question: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?
Retrieved support: Without Coriolis Effect the global winds would blow north to south or south to north. But Coriolis makes them blow northeast to southwest or the reverse in the Northern Hemisphere. The winds blow northwest to southeast or the reverse in the southern hemisphere.

Question: Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always what?
Retrieved support: Summary Changes of state are examples of phase changes, or phase transitions. All phase changes are accompanied by changes in the energy of a system. Changes from a more-ordered state to a less-ordered state (such as a liquid to a gas) areendothermic. Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always exothermic. The conversion of a solid to a liquid is called fusion (or melting). The energy required to melt 1 mol of a substance is its enthalpy of fusion (ΔHfus). The energy change required to vaporize 1 mol of a substance is the enthalpy of vaporization (ΔHvap). The direct conversion of a solid to a gas is sublimation. The amount of energy needed to sublime 1 mol of a substance is its enthalpy of sublimation (ΔHsub) and is the sum of the enthalpies of fusion and vaporization. Plots of the temperature of a substance versus heat added or versus heating time at a constant rate of heating are calledheating curves. Heating curves relate temperature changes to phase transitions. A superheated liquid, a liquid at a temperature and pressure at which it should be a gas, is not stable. A cooling curve is not exactly the reverse of the heating curve because many liquids do not freeze at the expected temperature. Instead, they form a supercooled liquid, a metastable liquid phase that exists below the normal melting point. Supercooled liquids usually crystallize on standing, or adding a seed crystal of the same or another substance can induce crystallization.

Question: What is the least dangerous radioactive decay?
Retrieved support: All radioactive decay is dangerous to living things, but alpha decay is the least dangerous.

Question: Kilauea in hawaii is the world’s most continuously active volcano. very active volcanoes characteristically eject red-hot rocks and lava rather than this?
Retrieved support: Example 3.5 Calculating Projectile Motion: Hot Rock Projectile Kilauea in Hawaii is the world’s most continuously active volcano. Very active volcanoes characteristically eject red-hot rocks and lava rather than smoke and ash. Suppose a large rock is ejected from the volcano with a speed of 25.0 m/s and at an angle 35.0º above the horizontal, as shown in Figure 3.40. The rock strikes the side of the volcano at an altitude 20.0 m lower than its starting point. (a) Calculate the time it takes the rock to follow this path. (b) What are the magnitude and direction of the rock’s velocity at impact?.

Question: When a meteoroid reaches earth, what is the remaining object called?
Retrieved support: Meteoroids are smaller than asteroids, ranging from the size of boulders to the size of sand grains. When meteoroids enter Earth’s atmosphere, they vaporize, creating a trail of glowing gas called a meteor. If any of the meteoroid reaches Earth, the remaining object is called a meteorite.

Question: What kind of a reaction occurs when a substance reacts quickly with oxygen?
Retrieved support: A combustion reaction occurs when a substance reacts quickly with oxygen (O 2 ). For example, in the Figure below , charcoal is combining with oxygen. Combustion is commonly called burning, and the substance that burns is usually referred to as fuel. The products of a complete combustion reaction include carbon dioxide (CO 2 ) and water vapor (H 2 O). The reaction typically gives off heat and light as well. The general equation for a complete combustion reaction is:.

Question: Organisms categorized by what species descriptor demonstrate a version of allopatric speciation and have limited regions of overlap with one another, but where they overlap they interbreed successfully?.
Retrieved support: Ring species Ring species demonstrate a version of allopatric speciation. Imagine populations of the species A. Over the geographic range of A there exist a number of subpopulations. These subpopulations (A1 to A5) and (Aa to Ae) have limited regions of overlap with one another but where they overlap they interbreed successfully. But populations A5 and Ae no longer interbreed successfully – are these populations separate species?  In this case, there is no clear-cut answer, but it is likely that in the link between the various populations will be broken and one or more species may form in the future. Consider the black bear Ursus americanus. Originally distributed across all of North America, its distribution is now much more fragmented. Isolated populations are free to adapt to their own particular environments and migration between populations is limited. Clearly the environment in Florida is different from that in Mexico, Alaska, or Newfoundland. Different environments will favor different adaptations. If, over time, these populations were to come back into contact with one another, they might or might not be able to interbreed successfully - reproductive isolation may occur and one species may become many.

Question: Alpha emission is a type of what?
Retrieved support: One type of radioactivity is alpha emission. What is an alpha particle? What happens to an alpha particle after it is emitted from an unstable nucleus?.

Question: What is the stored food in a seed called?
Retrieved support: The stored food in a seed is called endosperm . It nourishes the embryo until it can start making food on its own.



画像ベースの検索

今回は、Google ColabでラインタイムはCPUを使用しています。
少し下記コードを修正して紹介していきます。

最初に、Datasetとして画像を準備します。

!pip install datasets
import os
from datasets import load_dataset
from matplotlib import pyplot as plt

dataset = load_dataset(path="detection-datasets/coco", split="train", streaming=True)

IMAGE_FOLDER = "images"
N_IMAGES = 20

# For plotting
plot_cols = 5
plot_rows = N_IMAGES // plot_cols
fig, axes = plt.subplots(plot_rows, plot_cols, figsize=(plot_rows*2, plot_cols*2))
axes = axes.flatten()

# Write the images to a folder
dataset_iter = iter(dataset)
os.makedirs(IMAGE_FOLDER, exist_ok=True)
for i in range(N_IMAGES):
    image = next(dataset_iter)['image']
    axes[i].imshow(image)
    axes[i].axis("off")

    image.save(f"images/{i}.jpg")

plt.tight_layout()
plt.show()


detection-datasets/cocoで利用する20枚の画像

次に、画像をベクトル化していきます。

!pip install chromadb
import chromadb
client = chromadb.Client()

!pip install open-clip-torch
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader

embedding_function = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader()

collection = client.create_collection(
    name='multimodal_collection', 
    embedding_function=embedding_function, 
    data_loader=image_loader)

# Get the uris to the images
image_uris = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER)])
ids = [str(i) for i in range(len(image_uris))]

collection.add(ids=ids, uris=image_uris)


次に、ベクトル化したデータに対して問い合わせを行います。今回は、animalsに相当する画像を検索して表示します。

# Querying for "Animals"

retrieved = collection.query(query_texts=["animals"], include=['data'], n_results=3)
for img in retrieved['data'][0]:
    plt.imshow(img)
    plt.axis("off")
    plt.show()


animalsで検索した結果

次は、street scene(街中)で検索する時のコードとなります。検索結果は、街中の画像が表示されます。

# Querying for "Street Scenes"

retrieved = collection.query(query_texts=["street scene"], include=['data'], n_results=3)
for img in retrieved['data'][0]:
    plt.imshow(img)
    plt.axis("off")
    plt.show()


street sceneで検索結果


次は、特定の画像で類似画像を探す例となります。今回は、キリンの画像をベースとして、キリンを抽象的にした動物を探す場合となります。

from PIL import Image
import numpy as np

query_image = np.array(Image.open(f"{IMAGE_FOLDER}/1.jpg"))
print("Query Image")
plt.imshow(query_image)
plt.axis('off')
plt.show()

print("Results")
retrieved = collection.query(query_images=[query_image], include=['data'], n_results=5)
for img in retrieved['data'][0][1:]:
    plt.imshow(img)
    plt.axis("off")
    plt.show()


キリン画像をベースに検索した結果

次に、URIをベースに検索する場合となります。該当URIに相当する類似画像を検索表示しています。

query_uri = image_uris[1]

query_result = collection.query(query_uris=query_uri, include=['data'], n_results=5)
for img in query_result['data'][0][1:]:
    plt.imshow(img)
    plt.axis("off")
    plt.show()


URIをベースに検索した結果

いいなと思ったら応援しよう!