【QLoRA】英語解説を日本語で読む【2023年6月3日｜@WorldofAI】

2023年6月3日 00:18

QLoRAは、ファインチューニング中のメモリ使用量を削減することに焦点を当てています。研究者たちはQLoRAをGuanacoモデルに適用し、ChatGPTと比較して99.3％の水準に到達するという素晴らしい結果を得ました。彼らは単一のGPUで650億パラメータのモデルをファインチューニングし、Google Colabでも330億パラメータのモデルのファインチューニングに成功しました。
公開日：2023年6月3日
※動画を再生してから読むのがオススメです。

Hey, what is up guys?

やあ、みんな元気かい？

Welcome back to another YouTube video at the world of AI.

AIの世界でのYouTubeビデオにようこそ戻ってきました。

In today's video, we're going to be focusing on a research paper which is Q Laura.

今日のビデオでは、Q Lauraという研究論文に焦点を当てます。

Basically, it's an efficient fine-tuning quantized research paper that has been working to efficiently and effectively reduce the memory usage during the fine-tuning process.

基本的には、量子化された効率的な微調整を行う研究論文で、微調整の際のメモリ使用量を効率的かつ効果的に削減することに取り組んでいます。

Now, what's amazing and why I'm covering this project is because of guanaco, a 65 billion model.

さて、何がすごいかというと、なぜこのプロジェクトを取材しているかというと、650億円のモデルであるグアナコがいるからです。

This is something that they've been actually able to utilize.

これは彼らが実際に活用できているものです。

They're with the process of Q Laura.

Qローラのプロセスも進んでいます。

Now for the people who do not know, Q Laura basically stands for quantitized low-rank adapters and it's basically an approach for efficient fine-tuning of quantified as well as quantized language models.

知らない方のために説明しますが、Q Lauraは基本的に量子化低ランクアダプターを意味し、これは量子化された言語モデルの効率的なファインチューニングのアプローチです。

Now, the researchers behind this project have been able to work with guanaco, which is another type of model.

さて、このプロジェクトの研究者たちは、別のタイプのモデルであるグアナコを使って研究を行うことができました。

What they've done is that they've worked to reduce the memory usage during the fine-tuning process while still remaining high in terms of keeping up the high-performance standards.

彼らが行ったことは、高性能基準を維持しつつ、ファインチューニングプロセス中のメモリ使用量を削減することに努めたことです。

And what they've actually been able to accomplish is that they've, with the Q Laura framework, been able to basically fine-tune bunaco.

そして、実際に達成できたのは、Q Lauraフレームワークを使って、ブナコを基本的に微調整することができたということです。

So that it reaches an impressive 99.3 percent comparison level in terms of its comparison towards chat GPT, which is quite astonishing.

チャットGPTとの比較では、99.3%という驚異的なレベルに達しています。

And basically, they've been able to do this while still maintaining high performance.

そして、基本的には、高いパフォーマンスを維持したまま、これを実現することができました。

And by employing Q Laura, they have successfully fine-tuned a massive 65 Boolean parameter model on a single 48-gigabyte GPU.

そして、Q Lauraを採用することで、48ギガバイトのGPU1台で、65個の膨大なブールパラメータモデルのファインチューニングに成功したのです。

And this is all by preserving a full 16-bit fine-tuning task performance.

しかも、これはすべて16ビットの微調整タスクの性能を完全に維持することで実現しました。

And this is why this paper is so remarkable, and this is why we're going to be covering this in today's video.

これが、この論文が注目される理由であり、今日のビデオでこれを取り上げる理由です。

We've also seen cases where people on Twitter have been able to actually fine-tune their own 33 billion parameter LLM on Google hole up in just a few hours, which is quite remarkable.

また、Twitterの人々が、Googleホールアップの330億パラメータのLLMを、わずか数時間で実際に微調整できるようになった事例もあり、これは非常に驚くべきことです。

And this is all because of Q Laura.

これもすべてQ Lauraのおかげです。

They've been able to use the same framework as well as the code that is used to efficiently fine-tune different types of quantified or quantized, sorry, LMs.

彼らは、同じフレームワークと、さまざまな種類の定量化されたLMを効率的に微調整するために使用されるコードを使うことができました。

And the code will be linked in the description below, as well as all the links that we're going to be checking out in today's video.

そのコードは、下の説明文にリンクされていますし、今日のビデオでチェックするすべてのリンクも同様です。

So before we actually get into the gist of today's video, guys, it'd mean the whole world to me if you guys can go give the world of AI on Twitter a follow.

今日のビデオの要点を説明する前に、TwitterでAIの世界をフォローしてもらえると、とても嬉しいです。

I'm going to be posting the latest AI news as well as the latest content over here, so that you can get the latest AI news right to you on the Twitter feed.

私は最新のAIニュースや最新のコンテンツをこちらに投稿するつもりなので、Twitterのフィードで最新のAIニュースをあなたに届けることができます。

And guys, it would mean the whole world to me if you can go please subscribe to this channel.

そして皆さん、このチャンネルを購読していただけると、私にとってはこの上ない喜びです。

If you guys haven't already, turn on the notification bell.

まだの方は、通知ベルをオンにしてください。

Like this video, and with that thought, if you guys haven't seen any of my previous videos, definitely do so.

このビデオに「いいね！」して、そう思って、もし皆さんが私の過去のビデオを見たことがなければ、ぜひ見てください。

There's a lot of content that you will definitely benefit from.

あなたにとって間違いなく有益なコンテンツがたくさんあります。

So with that thought, guys, I'll see you guys right into the video.

というわけで、では、早速ビデオに入ります。

So guys, the key concept behind Q Laura is to backpropagate gradients through a frozen 4-bit quantized pre-trained language model into a low-rank adapter, and this is what helps it reduce its memory usage.

Q Lauraのキーコンセプトは、4ビット量子化された言語モデルをバックプロパゲートして、低ランクのアダプターに学習させることで、メモリ使用量を削減することです。

And this enables the efficient computation and memory utilization during the fine-tuning process of different LMs.

これにより、さまざまな言語モデルの微調整を行う際に、効率的な計算とメモリ利用が可能になります。

Now, by leveraging a low-rank fractionalization technique, Q Laura has been able to reduce the parameter count of the adapter's layers.

Q Lauraは、低ランクのフラクショナル化技術を活用することで、アダプターのレイヤーのパラメータ数を削減することができました。

What they've been actually added towards the pre-trained model, without sacrificing any of the performances during the fine-tuning process.

微調整の過程で性能を犠牲にすることなく、事前に学習させたモデルに対して実際に追加したものです。

Now, what they've done with the Quanco model is that it represents the best fine-tuning processing results in comparison to ChatGPT, based on the QLoRA approach that they've integrated.

彼らがQuancoモデルで行ったことは、それがQLoRAアプローチを統合したことに基づいて、ChatGPTと比較して最良のファインチューニング処理結果を示すことです。

And these models have been able to achieve remarkable results on different types of LLM models, just like Vicuna, as well as LLaMA and Guanaco.

また、これらのモデルは、VicunaやLLaMA、Guanacoのように、異なるタイプのLLMモデルで顕著な結果を出すことができました。

Basically, they've actually reached an impressive 99.3 percent performance level in comparison to ChatGPT, which we talked about at the start.

基本的には、冒頭でお話したChatGPTと比較して、99.3％という素晴らしいパフォーマンス・レベルに達しています。

And for the people who do not know, this is quite impressive as not a lot of other types of fine-tuning models have reached such levels in comparison to ChatGPT.

ChatGPTと比較して99.3%という驚異的な性能を達成しているのですが、これは他のファインチューニングモデルでもあまり例がないため、ご存じない方も多いと思います。

And what's amazing is also that it only required 24 hours on a single GPU to do this.

そして、驚くべきは、これを実行するのに1つのGPUで24時間しかかからなかったということです。

So they've been actually able to fine-tune this model within a 24-hour scope, and they've been able to do this efficiently.

つまり、24時間の範囲内でこのモデルの微調整を行い、しかも効率的に行うことができたのです。

In terms of its approach, they didn't require tens of millions of dollars like GPT-4 as well as GPT-3.5 had done in terms of fine-tuning its model.

また、GPT-4やGPT-3.5のように、モデルの微調整に何千万ドルもの資金を必要としないのも特徴です。

In this case, it only required one single GPU to do this whole process, which makes it so much better in terms of its efficiency in fine-tuning different types of LMs.

GPT-4やGPT-3.5のように何千万円もかけてモデルの微調整をするのではなく、GPU1台で全ての処理を行うので、異なるタイプのLMを微調整する効率という点では、非常に優れています。

Now, let's actually take a look at this table over here.

では、実際にこちらの表を見てみましょう。

Table one presents ELO ratings for the competition between the different models, with the results averaged over 10K random initial orderings.

表1は、異なるモデル間の競争におけるELOの評価で、10K回のランダムな初期オーダーで平均化された結果が示されています。

Now, in this competition, the models were evaluated based on the performance in response generation for the Vicuna Benchmark.

さて、今回の対戦では、Vicuna Benchmarkのレスポンス生成のパフォーマンスで各モデルが評価されました。

The winner of each match is determined by GPT-4, which determines which response is better for a given prompt.

各試合の勝敗は、与えられたプロンプトに対してどの応答が優れているかを判断するGPT-4によって決定されます。

Now, we can see there are different types of models, such as Bard, Guanaco, Chagibiti, by Kona 13B, as well as GPT-4.

さて、GPT-4だけでなく、Bard、Guanaco、Chagibiti、by Kona 13Bなど、さまざまな種類のモデルがあることがわかります。

It also states the size as well as the ELO ratings.

また、サイズだけでなく、ELOレーティングも記載されています。

For the people who do not know, the ELO rating indicates the relative strength of each model compared to the others.

ご存じない方のために説明しますと、ELO値とは、各モデルの相対的な強さを示すものです。

The higher ELO rating basically suggests that there's a better performance level.

ELO値が高ければ高いほど、基本的に性能が高いことを意味します。

Now, the table also indicates a 95% confidence interval, and this is a provided range for uncertainty around the actual ranges of the ELOs.

さて、この表には95％信頼区間も示されていますが、これはELOの実際の範囲にまつわる不確実性を示す範囲として提供されるものです。

According to this table, we can see that after ChatGPT-4, the best second-best model is the Guanaco 33 billion parameter, as well as the 65 billion parameter.

この表によると、ChatGPT-4の後、2番目に良いモデルは、Guanacoの330億パラメータと、650億パラメータであることがわかります。

And those two are the models that won the most matches in comparison with ChatGPT-4.

そして、この2つはChatGPT-4と比較して、最も多くの試合に勝利したモデルである。

And it indicates that their strong performance compared to the other models are so much more superior in terms of its ELO rating, and this is quite remarkable, guys.

そして、他のモデルと比較して、その強いパフォーマンスがELOレーティングの面で非常に優れていることを表しています。

Because of the fine-tuning process that QLaura has been able to implement within these models, it's basically able to surpass different things such as Bard, which is being backed by an amazing company, as well as ChatGPT.

QLauraはこれらのモデルの中で微調整を行うことができたので、基本的に素晴らしい会社によって支えられているBardやChatGPTのような様々なものを超えることができるのです。

It has reached a higher score than these two models, which is quite remarkable.

この2つのモデルよりも高いスコアに到達しており、これは非常に注目に値します。

And basically, these ratings provide a quantitative assessment of the model's performance in this competition, which showcases the comparison as well as the insight.

そして基本的に、これらの評価は、このコンペティションにおけるモデルのパフォーマンスを定量的に評価するもので、比較だけでなく、洞察も示しています。

It basically spotlights the strengths that put these models ahead in comparison with each other.

基本的には、これらのモデルが互いに比較して優位に立つ強みにスポットライトを当てています。

Now, in this figure, it illustrates different fine-tuning methods that compare their memory requirements, as well as working towards the comparison that demonstrates the improvements of the different models.

今、この図では、メモリ要求とそれぞれのモデルの改善を示す比較に向けて、さまざまなファインチューニング方法を示しています。

Now, the first technique is quantizing, and this is basically where the Transformer model is being utilized to the 4-bit Precision model.

さて、最初のテクニックは量子化で、これは基本的にTransformerモデルを4ビットPrecisionモデルに活用するところです。

And quantization is a process where you're reducing the precision of a numerical value in the model.

量子化とは、モデル内の数値の精度を下げる処理です。

And basically, it helps reduce the memory usage.

そして基本的には、メモリの使用量を減らすのに役立ちます。

Now, by using a 4-bit Precision model, Q Laura can achieve a significant reduction in memory requirements compared to Laura, which is the previous model that they've actually worked upon, and it utilizes the 16-bit for Transformer.

さて、Q Lauraは4ビットPrecisionモデルを使用することで、実際に取り組んだ前モデルであるTransformer用の16ビットを利用したLauraと比較して、メモリ要件の大幅な削減を達成することができます。

Now, the second technique involves the use of a paged Optimizer to handle memory spikes.

さて、2つ目のテクニックは、ページングされたオプティマイザを使って、メモリのスパイクを処理するというものです。

Now, during the fine-tuning process, certain operations or calculations can basically cause temporary spikes in memory usage.

微調整の過程で、特定の操作や計算をすると、基本的にメモリ使用量が一時的に急増することがあります。

With paged optimizers, you're able to get a design designer to basically handle such spikes efficiently by dividing the memory into smaller Pages or chunks, in a way which you can see over here.

ページド・オプティマイザを使用すると、設計者は、メモリをより小さなページやチャンクに分割することで、このようなスパイクを効率的に処理することができるようになるのです。

And it basically allows for a more efficient memory management.

これにより、より効率的なメモリ管理が可能になります。

Now, by implementing paged optimizers, Q Laura basically further optimizes the memory usage, which contributes to its improved performance compared to Laura.

また、ページド・オプティマイザを実装することで、Q Lauraはメモリ使用量をさらに最適化し、Lauraと比較してパフォーマンスの向上に寄与しています。

Now, overall, we're able to see how this visual demonstrates how Q Laura outperforms Laura in terms of his memory requirements by leveraging the 4-bit Transformer.

さて、全体として、このビジュアルは、Q Lauraが4ビットTransformerを活用することで、メモリ要件の点でLauraを上回ることを実証していることがわかりますね。

And it also utilizes the paged optimizers, which helps it become more of a better requirement in saving memory.

また、ページド・オプティマイザを活用することで、よりメモリ節約に優れた要件となっています。

To perform fine-tuning with Q Laura, you gotta follow through with the six-step process.

Q Lauraでファインチューニングを行うには、6つのステップを踏まなければなりません。

Firstly, you have to select a pre-trained language model.

まず、事前に学習させた言語モデルを選択する必要があります。

And this is by beginning to select a pre-trained language model such as gpt3, which has been trained on a large Corpus of text Data.

これは、gpt3のような、大規模なテキストデータのコーパスで訓練された言語モデルを選択することから始めます。

Obviously, you can select other models at your own choice.

もちろん、他のモデルを選択することも可能です。

Secondly, you move on to the next process, which is quantization, quantization sorry.

次に、量子化、量子化です。

And this is where you apply a 4-bit normal flow quantization to the pre-trained language model.

ここでは、学習済みの言語モデルに対して、4ビットのノーマルフロー量子化を適用します。

And this step, as well as this technique, helps reduce the precision of numerical values in the model, which basically ensures high fidelity.

このステップは、この手法と同様に、モデル内の数値の精度を下げるのに役立ちますし、基本的に高い忠実度を保証します。

And basically, this is going to help you significantly reduce memory usage during the fine-tuning process.

また、基本的には、微調整の過程でメモリ使用量を大幅に削減するのに役立つと思います。

The third step is adding the different types of adapter layers.

3つ目のステップは、さまざまなタイプのアダプターレイヤーを追加することです。

You add these layers to the pre-trained model.

これらのレイヤーは、事前に訓練されたモデルに追加されます。

And with these layers, there are small modules that are inserted into the model's architecture, which allows for a stat-specific fine-tuning, while basically lowering the impact on the original parameters.

このレイヤーでは、モデルのアーキテクチャに挿入される小さなモジュールがあり、基本的に元のパラメータへの影響を抑えながら、統計に特化したファインチューニングを可能にします。

And these will basically be used for specific tasks where you want to fine-tune the model.

そして、これらは基本的に、モデルを微調整したい特定のタスクのために使用されます。

Fourthly, you will double down on the quantization process by applying a double quantization to the adapter layers.

第四に、アダプターレイヤーに二重の量子化を適用することで、量子化プロセスを二倍にします。

And it basically is involving a step which is taking an additional quantization step to specifically be designed for the dappolars.

これは、基本的に、ダポラー用に特別に設計された追加の量子化ステップを取るステップを含みます。

And it further basically makes the fine-tuning process more memory efficient.

そして、さらに基本的に微調整のプロセスをよりメモリ効率的にするのです。

On the fifth step, you are working with paged optimizers.

5つ目のステップでは、ページドオプティマイザを使用します。

And in this step, you're employing paged optimizers to handle memory spikes during the gradient checkpoints.

このステップでは、勾配チェックポイント時のメモリスパイクを処理するために、ページドオプティマイザーを採用しています。

And what this does is that it divides the memory into smaller Pages or chunks, which allows for a more efficient memory management, preventing a lot of out-of-memory errors during the fine-tuning process.

これにより、メモリを小さなページやチャンクに分割し、より効率的なメモリ管理が可能になり、微調整の過程でメモリ不足のエラーが多発するのを防ぐことができます。

In the sixth step, you're working with the fine-tuning process, where you're using the fine-tune language model for your own specific task.

第6ステップでは、ファインチューニング・プロセスで、ファインチューニング言語モデルを自分自身の特定のタスクに使用する作業を行います。

Now, the fine-tune model is then provided with task-specific data, as well as updating the adapter layers to learn task-specific patterns from the previous inputs that you give it.

ファインチューニング・モデルは、タスクに特化したデータを提供され、またアダプター層を更新して、あなたが与えた以前の入力からタスクに特化したパターンを学習します。

In a way, you're basically feeding it previous knowledge to make it more smarter.

いわば、過去の知識を与えて、より賢くするわけです。

And this is what makes Q Laura's approach of making an efficient model so much better in terms of its fine-tuning process.

そしてこれが、効率的なモデルを作るというQローラのアプローチが、微調整のプロセスという点で、非常に優れている点なのです。

Now, let's get to the next step where we can actually start fine-tuning our own model.

さて、次のステップでは、実際に自分たちのモデルの微調整を始めましょう。

It, I forgot to pronounce this name properly, but Itamar has been able to, like, create this Google Colab link in which you can actually fine-tune your own model.

この名前を正しく発音するのを忘れてしまいましたが、ItamarはこのGoogle Colabリンクを作成し、実際に自分のモデルを微調整することができるようにしました。

Now, if you want the Colab for different types of purposes, all the links will be in the description below, and I'll leave a little tweet for this exact page right in the description below.

異なる目的のためのColabが必要なら、すべてのリンクは下の説明に含まれています。そして、この具体的なページについては、下の説明に少しツイートを残します。

But in this case, you can actually spend a couple of hours to fine-tune your own language model.

しかし、この場合、実際に2、3時間かけて自分の言語モデルを微調整することができます。

First things first, you need to connect the GPU to this Google Colab.

まず最初に、GPUをこのGoogle Colabに接続する必要があります。

Secondly, you need to go on to your runtime.

次に、あなたのランタイムに進む必要があります。

What you want to do is change it to the GPU so that you're utilizing the GPU hardware.

GPUのハードウェアを活用するために、GPUに変更するのです。

Lastly, what you want to do is save a copy in your drive so that you're utilizing it in your own drive space.

最後に、自分のドライブにコピーを保存して、自分のドライブスペースで利用できるようにします。

And what you can now do is start working towards installing the requirements.

これで、要件をインストールするための作業を開始することができます。

And as you go slowly down, you can start looking through inputting the models as well as what you're trying to find.

そして、徐々に下に降りていき、モデルや探そうとしているものを入力しながら見ていくことができます。

Soon as you go down in the inputs, and this is basically easy as that, guys.

入力が始まったらすぐに、これは基本的に簡単なことです。

You just gotta click through these certain buttons, these play prompts to help you start initializing the actual fine-tuning process with the code of Q-Laura on the Google Colab.

Google ColabのQ-Lauraのコードを使って、実際の微調整プロセスを初期化するのに役立つ、特定のボタンやプレイプロンプトをクリックするだけです。

So you can start working towards fine-tuning your own model.

そうすれば、自分のモデルを微調整する作業を始めることができます。

I'm not going to be showing you guys how you can actually find some neuro model because that is going to require a lot of hours to do so.

実際にニューロモデルを見つける方法を皆さんにお見せするつもりはありません。なぜなら、そのためには多くの時間を必要とするからです。

But in this case, we're just going to show you a little comparison of how Quinoco 65 million parameter model has been able to achieve such an amazing level of its performance in comparison to ChatGPT 5, or sorry, 4.

しかし今回は、Quinocoの6500万パラメータ・モデルが、ChatGPT 5、あるいは4と比較して、どのように素晴らしいレベルのパフォーマンスを達成できたかを、少し比較してみます。

And basically, this is by going through a Google Colab that was created by one user on Twitter.

これは、Twitterのあるユーザーが作成したGoogle Colabを使ったものです。

And basically, he created this to tell if you can differentiate ChatGPT towards versus Konako 65 billion parameter model.

ChatGPTとKONAKOの650億パラメータモデルを比較するために作成されました。

And you can actually do so by different questionnaires as well as different prompts.

そして、さまざまなアンケートやプロンプトによって、実際にそれを行うことができます。

Basically, he's been able to compare these two models by giving the same prompt to each other.

基本的には、同じプロンプトを与えることで、この2つのモデルを比較することができました。

As you go down, you can see that there are two types of questions as well as two types of generative answer.

下に行くにつれて、2種類の質問と2種類の生成回答があることがわかります。

And over here, we're able to see that it's able to get the same relative and like relative type of answer when you're able to give an input towards the both types of models.

そしてこちらでは、2種類のモデルに入力を与えると、同じような相対的な答えを得ることができることがわかります。

And this is one of the quite remarkable things of what you can do with Q-Lara, guys.

これは、Q-Laraでできることの中でも、非常に驚くべきことの一つです。

In summary, it's an amazing tool that you could utilize for fine-tuning different types of models quite efficiently without using a high memory usage rate.

要約すると、Q-Laraは、高いメモリ使用率を使わずに、非常に効率的に異なるタイプのモデルを微調整するために利用できる素晴らしいツールなのです。

And with that thought, guys, I hope you got some sort of value out of this video.

このビデオで何らかの価値を感じてもらえたなら、幸いです。

I hope that this will be an easier step for you to fine-tune your own models so that you don't have to utilize a lot of resources for your as well as the computation power to actually fine-tune your own model.

このビデオが、あなた自身のモデルを微調整するための簡単なステップとなり、実際にあなた自身のモデルを微調整するために多くのリソースや計算能力を利用する必要がなくなることを期待しています。

So with that thought, guys, make sure you give this Twitter account a follow if you guys haven't already, definitely subscribe if you haven't already, turn on notification Bell, like this video, comment anything you want to see in future uploads.

このTwitterアカウントをまだフォローしていない人はフォローし、まだ購読していない人は購読し、通知ベルをオンにし、このビデオを「いいね！」し、今後のアップロードで見たいことがあればコメントするようにしてください。

And if you guys haven't seen any of my previous videos, it'd mean the whole world to me, guys, if you guys can do so.

そして、もし皆さんが私の過去のビデオを見たことがないのであれば、見てもらえたら、私にとっては、とてもうれしいです。

And with that thought, I'll see you guys next time.

それでは、また次回お会いしましょう。

【QLoRA】英語解説を日本語で読む【2023年6月3日｜@WorldofAI】

いいなと思ったら応援しよう！