【LangChain】英語解説を日本語で読む【2023年4月23日｜@StarMorph AI】

2023年4月24日 18:18

LangChainの解説です。
公開日：2023年4月23日
※動画を再生してから、インタビューを読むのがオススメです。

Hey everyone, welcome to StarMorph where we talk about artificial intelligence and web development.

人工知能とウェブ開発について語るStarMorphへようこそ。

Today we are going to go into building your own knowledge base to bring your own data into LangChain and use it with GPT-4.

今日は、自分のデータをLangChainに持ち込み、GPT-4で使用するための知識ベースを構築することについて説明します。

So how do you bring in some of your specific data to interact with GPT-4?

では、GPT-4と連動させるために、どのように特定のデータを持ち込むのでしょうか？

As a lot of you probably saw in my last video, we showed what this can do, you know, giving more specific answers to your content.

前回のビデオでは、GPT-4でできること、つまり、コンテンツに対してより具体的な回答を与えることを紹介しました。

And in this presentation, I'm going to go over some of the different pieces that are at play when you're building a knowledge base like this.

このプレゼンテーションでは、このようなナレッジベースを構築する際に必要となる、さまざまな要素について説明します。

So this is an outline of the presentation, we're going to go over embeddings, vector storage, LangChain documents.

エンベッディング、ベクトルストレージ、LangChainドキュメントについて説明する予定です。

So it's just an overview of the whole building a knowledge base.

つまり、知識ベースを構築するための全体的な概要を説明するものです。

So let's get started.

では、さっそく始めましょう。

Alright, so first off, what data do you want to bring into this app?

さて、まず最初に、このアプリに取り込みたいデータは何でしょうか？

What data can you bring into this app?

このアプリに取り込むことができるデータは？

These are some of the file loaders that are supported by LangChain.

これらは、LangChainでサポートされているファイルローダーの一部です。

And there's also web loaders.

また、Webローダーもあります。

So there's an ability to bring in Amazon S3 files, you can bring in a GitHub repository, you can do web scraping with things like puppeteer or Cheerio.

Amazon S3のファイルを持ち込んだり、GitHubのリポジトリを持ち込んだり、puppeteerやCheerioのようなものでウェブスクレイピングをしたりすることができるのです。

And then you can also do these file loaders.

そして、これらのファイルローダーも利用できます。

And so just a little bit a little bit about these, I typically do a PDF file loader, or I do text files or Markdown documents.

ファイルローダーについて少し説明すると、私は通常、PDFファイルローダーや、テキストファイル、Markdownドキュメントを使用することがあります。

And I've had those working really well.

これらはとてもうまくいっています。

I've had some issues with doc x files.

doc xファイルについては、いくつか問題がありました。

And I believe the CSV file loader takes some configuration, you have to identify what headers and rows and columns you want to use.

また、CSVファイルローダーは、ヘッダーや行、列を特定する必要があり、設定が必要です。

So the LLM does well with with raw text, especially structured tests, if possible.

ですから、LLMは生テキスト、特に構造化されたテストが可能であれば、うまくいきます。

So that's what I've been doing so far.

だから、私はこれまでそうしてきたのです。

But the way that I do it also is I make a directory, there's also a directory loader, and then I can, I can load in any PDF or text files in that directory that I put in into the embedding and into the next steps we're about to go to.

でも私のやり方としては、ディレクトリを作成し、ディレクトリローダーも使って、そのディレクトリに入れたPDFやテキストファイルを埋め込みや次のステップに読み込むことができます。

So once you identify what your file type is, and what your data is going to look like, then we can load our data into LangChain into our LangChain application, using a document loader, and creating a LangChain document.

ファイルタイプやデータの形式を特定したら、ドキュメントローダーを使ってデータをLangChainアプリケーションに読み込み、LangChainドキュメントを作成します。

So you can see here, we're importing, and this is the JavaScript, this whole presentation will be on JavaScript LangChain, we're importing the text loader from LangChain.

これはJavaScriptで、このプレゼンテーションはすべてJavaScriptのLangChainで行われますが、LangChainからテキストローダーをインポートしています。

And then we are loading this text file here, using the text loader.

そして、テキストローダーを使って、テキストファイルを読み込んでいるところです。

So this is important, because this concept of loading in a document is how we start to get access to work with our data in the LangChain application.

これは重要なことです。ドキュメントを読み込むというこのコンセプトは、LangChainアプリケーションでデータを扱うためのアクセスを開始する方法だからです。

And another thing to consider here is we have some tools to help us parse the text.

さらに、テキストを解析するためのツールも用意されています。

So we can do things like splitting and chunking the text, when we're loading it into the documents and loading it into the documents and loading it into the embeddings.

つまり、テキストを分割したり、チャンク化したりすることができるのです。

And that's another tool that LangChain provides.

これもLangChainが提供するツールのひとつですね。

Okay, so once we have the document, a next piece of the puzzle is creating an embedding.

さて、ドキュメントができたら、次のピースはエンベッディングを作ることです。

So let's start out with what is an embedding.

では、まずエンベッディングとは何なのかを説明しましょう。

An embedding is used, as it says here, to create a numerical representation of textual data.

エンベッディングは、ここに書いてあるように、テキストデータを数値で表現するために使います。

So it's a vector of numbers, lots of vectors of numbers, and each number, the closer a number is to another one, the more similar they are, the distance between the numbers signifies how similar different parts of the embedding are.

ベクターは数の集まりで、それぞれの数値が近いほど類似しており、数値間の距離が埋め込みの異なる部分がどれだけ類似しているかを示します。

So let's just zoom out here a little bit.

では、ここを少し拡大してみましょう。

And this is actually a visualization of what we're going to be doing with these embeddings.

これは、これから埋め込むものを可視化したものです。

So we're going to create the embedding these vectors with numbers representing our document.

埋め込みは、ドキュメントを表す数字で構成されたベクトルを作成することになります。

And then what we're going to do is we're going to perform a similarity search over the embedding.

そして、これから行うのは、この埋め込みに対して類似性検索を行うことです。

And we're also going to talk more about vector storage and creating a database for these embeddings that we'll also use in this process of doing similarity search.

さらに、ベクトルの保存と、この埋め込みのためのデータベースの作成についても説明し、この類似性検索のプロセスでも使用します。

So this is a cool visualization of what we're actually trying to do here.

これは、私たちが実際にやろうとしていることを視覚化したものです。

We're loading in our source document, whether it's a text file or a PDF file, then we're creating this vector structure to organize the information with numbers.

テキストファイルであれ、PDFファイルであれ、ソースドキュメントを読み込み、ベクトル構造を作成して、情報を数字で整理します。

And then we're able to perform this kind of similarity search now that it's in this embedding format.

そして、この埋め込み形式を利用して、類似検索を行います。

And this is also a format that we're able to send over to OpenAI and interact with GPT-4 with the embedding.

また、これはOpenAIに送ることができるフォーマットでもあり、GPT-4とエンベッディングで対話することができます。

Okay, so that's a little bit about what embeddings are.

さて、これがエンベッディングとは何かということを少し説明します。

It's just a list of numbers that tells you about your source document context.

これは、ソース文書のコンテキストについて教えてくれる数字のリストなんだ。

And there's a great website.

そして、素晴らしいウェブサイトがあります。

There's a lot of great websites that talk about this.

このことについて書かれた素晴らしいウェブサイトがたくさんあります。

One of the pages I usually go to is just the OpenAI page about embeddings.

私がいつも見ているのは、OpenAIのエンベッディングに関するページです。

And so this is a great resource to learn more about what they are, measuring the related of text strings, and then we can do search as well as some other features here.

エンベッディングとは何か、テキスト文字列の関連性を測定すること、そして検索や他の機能についても知ることができる素晴らしいリソースです。

Okay, so let's talk about creating an embedding in LangChain.

では、LangChainでエンベッディングを作成することについて説明しましょう。

How do you use them?

どのように使うのですか？

So we can import the OpenAI embedding section of the library.

そこで、ライブラリのOpenAIエンベッディングの部分をインポートすることができます。

And this also is available in the OpenAI API directly, which this page will go into.

そしてこれも、このページで紹介するOpenAI APIで直接利用することができます。

And there's two kinds of embeddings that we can do with LangChain.

そして、LangChainでできる埋め込みは2種類あるんです。

The first is embedding the query.

1つ目は、クエリの埋め込みです。

So you can think when you're using a chat bot, the actual prompt that the user is typing into the bot, we can embed that search.

チャットボットを使っているとき、ユーザーがボットに入力するプロンプトを、私たちはその検索に埋め込むことができるのです。

And then we can also embed on the other side, the actual document, like we were just talking about bringing in the document loader and embedding the document.

また、ドキュメントを埋め込むこともできます。先ほど、ドキュメントローダーを導入してドキュメントを埋め込むという話をしました。

And then you can, I believe, use the query embedding in reference to the document embedding.

そして、ドキュメントを埋め込む際に、クエリを埋め込むことができるのだと思います。

And yeah, so there's two forms of embeddings that we can do here, both the query and the document.

このように、クエリ埋め込みとドキュメント埋め込みの2種類の埋め込みができるんです。

And that's from the LangChain documentation.

これは、LangChainのドキュメントに書いてあることです。

Okay, so now we have an embedding of our document, we are able to create an embedding of our query.

さて、これでドキュメントの埋め込みができたので、クエリの埋め込みもできるようになりました。

Let's talk about managing these embeddings.

では、これらのエンベッディングの管理について説明しましょう。

And that's where we get into vector storage.

そこで、ベクターストレージの話をします。

And this is a very exciting area right now.

これは今、非常にエキサイティングな分野です。

These companies are, there's some amazing technology coming out.

これらの企業は、素晴らしい技術を持っています。

Chroma just raised around, Weavey just raised their series B, LangChain just raised around super base just built a vector database plugin, which I'll talk more about.

ChromaはシリーズBを調達したばかり、WeaveyはシリーズBを調達したばかり、Lane Chainはスーパーベースがベクターデータベースのプラグインを開発したところです（詳しくは後述します）。

So lots of really exciting stuff happening here.

このように、本当にエキサイティングなことがたくさん起こっています。

Congratulations to these companies.

これらの企業の皆さん、おめでとうございます。

And I'm really excited for all of the tools that I imagine are going to come out of this.

そして、ここから生まれてくるであろう、あらゆるツールに期待しています。

And I think that it's a great time to jump into this stuff and start learning.

そして、このようなものに飛びつき、学び始めるには絶好の機会だと思います。

While these tools are early on, and before they get more advanced, might as well learn the fundamentals now, so we can understand how to continue using them as they develop.

これらのツールが初期段階にあり、より高度なものになる前に、今基礎を学んでおくことで、それらが発展したときにどのように使い続けることができるかを理解することができます。

So alright, this is a lot here.

さて、ここからは盛りだくさんの内容です。

Vector storage, it sounds kind of crazy.

ベクターストレージ、なんだかクレイジーに聞こえますね。

If you want to just get started into it, where's a good place to start.

ベクターストレージを使い始めたいなら、どこから手をつければいいのでしょう。

So my thoughts on this are there are two main things to consider.

私の考えでは、主に2つのことを考慮する必要があります。

The first is what environment are you going to be coding in?

1つ目は、どのような環境でコーディングするかということです。

Because some of these are going to work better with Docker, some will work in a Node.js app, some can work in the browser or serverless.

Dockerでうまくいくものもあれば、Node.jsアプリで使えるもの、ブラウザやサーバーレスで使えるものもあるからです。

So consider what environment you're going to be coding in.

ですから、どのような環境でコーディングするのかを考えてみてください。

And then the second thing is what scale are you building a vector storage app?

そして2つ目は、どのような規模でベクターストレージアプリを構築するかということです。

So if you want a production level SAS where you want a hosted service at scale, then you know, something like Pinecone could be a great option.

もし、プロダクションレベルのSASで、大規模なホスティングサービスを使いたいのであれば、Pineconeのようなものが素晴らしい選択肢になるでしょう。

If you want to have a small file in your GitHub repo, I think a great way to get started is using this HNSW lib tool that comes with LangChain.

GitHubのレポに小さなファイルを置きたいなら、LangChainに付属しているHNSW libツールを使うのが手っ取り早いと思います。

And it will create your embeddings and then store them into a file that will be local in your GitHub repo, you don't need to sign up for a cloud service.

このツールは埋め込みデータを作成し、それをGitHubレポのローカルファイルに保存してくれます。

And, you know, manage having a remote and a local vector storage is just a super simple file.

そして、リモートとローカルのベクターストレージを管理するのは、超簡単なファイルだけです。

So I think this is a great way to get started.

だから、これを始めるには最適な方法だと思うんです。

There's a page on the LangChain documentation about the differences between each of these.

LangChainのドキュメントに、それぞれの違いについてのページがあります。

So if we go to vector stores, here, LangChain recommends what to pick.

それで、ベクターストアに行くと、ここでは、LangChainが何を選べばいいかを推奨しています。

And similar to what I was just saying, they talk about the different environments.

また、先ほどの話と同じように、さまざまな環境について説明しています。

So for a Node application, they recommend HNSW, browser like, so maybe serverless on Versel, you can use the LangChain memory vector store, which I want to play with.

Nodeアプリケーションの場合はHNSW、ブラウザの場合はVerselのサーバーレスを推奨しており、LangChainのメモリベクターストアを使うことができる。

I haven't used that one yet.

あれはまだ使ってないんですけど。

There's also a lot of Python options in the Python documentation.

Pythonのドキュメントには、Pythonのオプションもたくさんあります。

If you are running it locally with Docker, your back end developer, or then you can use Chroma, and then SuperBase just launched a, they just launched a package manager for database packages.

ローカルでDockerを使って実行している場合やバックエンド開発者であれば、Chromaを使うことができます。そして、SuperBaseはデータベースパッケージのパッケージマネージャーを立ち上げました。

So like npm, but for database packages, and there's one now that is a vector store, I believe integrated with LangChain.

npmのようなものですが、データベースパッケージのためのもので、ベクターストアもあります。

So I'm definitely looking forward to trying that out.

これを試してみるのが楽しみです。

SuperBase just had an awesome launch week, coming out with a ton of new cool stuff.

SuperBaseは素晴らしいローンチウィークを迎え、たくさんの新しいクールなものを発表しています。

And I really love using SuperBase.

私はSuperBaseを使うのが大好きです。

So that's a brief overview of you know, some of the differences between these.

以上、簡単にですが、それぞれの違いをご紹介しました。

It's great trying out different ones and seeing the different functionalities for getting started.

いろいろなものを試してみて、さまざまな機能性を確認しながら始めるといいでしょう。

Again, I would recommend starting with this has been a good experience for me.

私の場合は、まずSuperBaseから始めることをお勧めします。

And then Pinecone and then Pinecone at a larger scale.

そして、Pinecone、さらに大規模なPinecone。

But I definitely want to play with all these tools more.

しかし、私は間違いなく、これらのツールすべてでもっと遊んでみたいと思っています。

So I hope that's helpful on a little overview of some of the options here.

というわけで、いくつかのオプションの概要について、少しでもお役に立てれば幸いです。

And moving into how do we kind of start to use all this together.

そして、これらを一緒に使うにはどうすればいいかという話に移ります。

And so this code snippet here, I apologize, it's a little blurry, is coming from the LangChain documentation.

このコードスニペットは、申し訳ありませんが、少しぼやけていて、LangChainのドキュメントから引用したものです。

And so we're doing a few of the steps here.

ここでは、いくつかのステップを実行しています。

First, we load in our text document, we create a LangChain document from it, then from this document, or then we are loading it into the vector store, both the document and creating a new embedding with OpenAI and loading that into the vector store as well.

まず、テキストドキュメントを読み込み、それからLangChainドキュメントを作成し、このドキュメントをベクターストアに読み込みます。また、OpenAIを使って新しい埋め込みを作成し、それもベクターストアに読み込みます。

And then we can do a vector store.similarity search.

そして、ベクターストア.類似性検索を行うことができます。

So this is where all the pieces are starting to come together.

これで、すべてのピースが揃い始めたことになります。

We're loading our data in creating the document, creating the embedding, storing that in the vector storage.

ドキュメントを作成し、エンベッディングを作成し、それをベクターストアに保存することで、データをロードしているのです。

And so that brings us to the second to last slide.

そして、最後の2番目のスライドに進みます。

Which is kind of looking at this from a macro view.

これは、マクロ的な視点から見たものです。

So there's, there are definitely other pieces of this as well, where you know, the LLM comes in.

この他にも、LLMの出番があるのは間違いありません。

But just about what we've talked about so far, this is a basic overview, you can bring your data in, whether it's from a file or from the web, you can create a LangChain document with their loader, then you can use OpenAI's API to create an embedding.

これまで話したことを簡単にまとめると、ファイルやWebからデータを取り込んで、ローダーを使ってLangChainドキュメントを作成し、OpenAIのAPIを使って埋め込みを作成できます。

And then when the user searches, they'll create an embedding of the search.

そして、ユーザーが検索したときに、その検索のエンベッディングを作成する。

And then we can store these embeddings as well as the document in the vector database.

そして、このエンベッディングを、ドキュメントと同様にベクターデータベースに保存することができる。

So that's kind of architecture, this is my current best understanding of what's happening behind the scenes.

これが、舞台裏で何が起きているのかについての、私の現時点での最良の理解です。

This is all changing very quickly.

これらはすべて、非常に急速に変化しています。

So definitely check the LangChain documentation is going to be more up to date.

ですから、LangChainのドキュメントは、より最新のものになるはずです。

And I will continue to, you know, learn and get better at this and share better information as I learn what's happening here.

そして、私はこれからも学び続け、より良いものにし、ここで起こっていることを学びながら、より良い情報を共有していきたいと思います。

But this is my current best understanding of how some of these knowledge bases are working.

しかし、これは、これらの知識ベースがどのように機能しているかについての、私の現時点での最良の理解なのです。

So I hope this is a helpful overview.

ですから、これが役に立つ概要であることを願っています。

And in terms of learning, learning more about this stuff, these are some of the YouTube channels and communities that I've been learning from.

そして、このようなことについてもっと学ぶという意味で、私が学んできたYouTubeチャンネルやコミュニティをいくつか紹介します。

And I'm sure you guys have seen a lot of these channels already, they're giving great content on LangChain, on building knowledge bases, working with large language models.

LangChain、知識ベースの構築、大規模な言語モデルとの連携など、素晴らしいコンテンツを提供しています。

Just reading the docs is honestly just sitting down and reading the docs is super valuable.

ドキュメントを読むだけでも、正直なところ、座ってドキュメントを読むだけでも、とても価値があります。

And these guys, you know, are updating it every single day and every single week on Twitter, it's just new stuff.

しかも、彼らは毎日、毎週、Twitterで更新していて、新しいものばかりです。

So very exciting space.

とてもエキサイティングな空間です。

And I know a lot of businesses are interested in doing this because it's pretty amazing to be able to bring in, you know, these tools to your custom data.

このようなツールをカスタムデータに取り入れることができるのは、とても素晴らしいことですから、多くの企業が興味を持っています。

There's a lot of different applications for it, whether it's a question and answer, but or producing assets for your business, or reading large documents that you don't have time to read a 400 page document, but you want to ask it a few questions.

これには様々な用途があり、質問回答やビジネス用のアセットの生成、または400ページの大量のドキュメントを読む時間がない場合でも、いくつかの質問を投げかけることができます。

So I hope this is a helpful overview of some of how this stuff is working with LangChain.

LangChainがどのように機能するのか、その概要についてご理解いただけたと思います。

And I know I imagine you guys want to see some coding on this too.

そして、皆さんはこの件に関するコーディングもご覧になりたいと思うでしょう。

We have some snippets here from the docs.

ドキュメントにあるスニペットをいくつか紹介します。

But I would love to make a future video I'm still working on this, where we're going to kind of build out this, build out this code and add the pieces together in an app.

でも、将来はぜひビデオを作りたいですね。私はまだこれに取り組んでいますが、このコードを構築して、その断片をアプリに追加していくような内容です。

Because I know you guys want to do that as well.

なぜなら、みなさんもそうしたいはずだからです。

So this is stepping towards that.

これはそのためのステップなのです。

And I hope this video was helpful.

このビデオがお役に立てれば幸いです。

Thank you very much for watching.

ご覧いただき、ありがとうございました。

Please be sure to like and subscribe if you found this video helpful.

このビデオが役に立ったと思ったら、ぜひ「いいね！」と「購読」をお願いします。

And I'll see you guys in the next video.

それでは、また次のビデオでお会いしましょう。

【LangChain】英語解説を日本語で読む【2023年4月23日｜@StarMorph AI】

いいなと思ったら応援しよう！