Gemini のマルチモーダルの7つの例
以下の記事が面白かったので、簡単にまとめました。
1. はじめに
「Gemini」のネイティブなマルチモーダル機能とロングコンテキスト機能は、「NotebookLM」「Google Lens」などのアプリケーションを強化し、開発者にさまざまな斬新なアプリケーションの可能性をもたらしました。
この記事では、「Gemini API」の画像と動画入力に関して最も期待しているユースケースをいくつか取り上げ、ビジョン機能を使って何ができるかを紹介します。紹介しているのは「Gemini 1.5 Pro」の例 ですが、タスクの複雑さに応じて、「Gemini 1.5 Flash」「Gemini 1.5 Flash-8B」を試してみることをお勧めします。
2. 画像の詳細説明
「Gemini」は、画像について説明したり、質問に答えたり、推論したりできます。モデルはプロンプトに応じて説明の長さ、トーン、形式を調整するため、ユースケースに合わせてモデルの動作を調整できます。
3. ネイティブビジョンで長いPDFを理解
「Gemini」は、1,000ページ以上のPDFを理解して処理できます。ネイティブビジョン機能により、「Gemini」は表を正確に転記し、複雑な複数列のレイアウトを解釈し、文書内のチャート、スケッチ、図、手書きのテキストを理解し、このテキストと視覚情報を使用してエンドツーエンドのタスクを実行できます。
以下は、「Gemini」が一連の PDF の情報を使用して構造化された出力とコードを生成する例です。
2024年第3四半期までの15件のAlphabetの収益発表資料 (合計152ページ) を入力し、「Gemini」にこれらの文書全体からGoogleの収益数値を抽出し、集計した表を作成し、matplotlibで収益と四半期ごとの成長を視覚化するコードを記述するよう依頼しました。
・Prompt
These are earnings reports over the last 15 quarters.
Generate a table in markdown of total revenue for each quarter, from "Quarter Ended March 31, 2021" to "Quarter Ended September 30 2024". There should be 15 in total.
2. Generate a table of total revenues for each Google product/service -- "Google Search", "Youtube ads", "Google Network", "Google subscriptions, platforms and devices", "Google Cloud" and "Other Bets" -- for each of the quarters (15 in total). Show this in multiple tables (4 quarters for each table) in your output. Note that Google subscriptions, platforms and devices was called 'Google other' third quarter 2023 and before.
・Response
Here are the tables you requested:
**1. Total Revenue by Quarter (in millions USD)**
| Quarter Ended | Year | Total Revenue |
|-------------------|------|--------------|
| March 31 | 2021 | 55,314 |
| June 30 | 2021 | 61,880 |
| September 30 | 2021 | 65,118 |
| December 31 | 2021 | 75,325 |
| March 31 | 2022 | 68,011 |
| June 30 | 2022 | 69,685 |
| September 30 | 2022 | 69,092 |
| December 31 | 2022 | 76,048 |
| March 31 | 2023 | 69,787 |
| June 30 | 2023 | 74,604 |
| September 30 | 2023 | 76,693 |
| December 31 | 2023 | 86,310 |
| March 31 | 2024 | 80,539 |
| June 30 | 2024 | 84,742 |
| September 30 | 2024 | 88,268 |
**2. Revenue by Google Product/Service (in millions USD)**
**2021 Q1 - Q4**
| Quarter Ended | Google Search | YouTube Ads | Google Network | Google Other | Google Cloud | Other Bets |
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| March 31, 2021 | 31,879 | 6,005 | 6,800 | 6,494 | 4,047 | 198 |
| June 30, 2021 | 35,845 | 7,002 | 7,597 | 6,623 | 4,628 | 192 |
| Sept 30, 2021 | 37,926 | 7,205 | 7,999 | 6,754 | 4,990 | 182 |
| Dec 31, 2021 | 43,301 | 8,633 | 9,305 | 8,161 | 5,541 | 181 |
**2022 Q1 - Q4**
| Quarter Ended | Google Search | YouTube Ads | Google Network | Google Other | Google Cloud | Other Bets |
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| March 31, 2022 | 39,618 | 6,869 | 8,174 | 6,811 | 5,821 | 440 |
| June 30, 2022 | 40,689 | 7,340 | 8,259 | 6,553 | 6,276 | 193 |
| Sept 30, 2022 | 39,539 | 7,071 | 7,872 | 6,895 | 6,868 | 209 |
| Dec 31, 2022 | 42,604 | 7,963 | 8,475 | 8,796 | 7,315 | 226 |
**2023 Q1 - Q4**
| Quarter Ended | Google Search | YouTube Ads | Google Network | Google Other | Google Cloud | Other Bets |
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| March 31, 2023 | 40,359 | 6,693 | 7,496 | 7,413 | 7,454 | 288 |
| June 30, 2023 | 42,628 | 7,665 | 7,850 | 8,142 | 8,031 | 285 |
| Sept 30, 2023 | 44,026 | 7,952 | 7,669 | 8,339 | 8,411 | 297 |
| Dec 31, 2023 | 48,020 | 9,200 | 8,297 | 10,794 | 9,192 | 657 |
**2024 Q1 - Q3**
| Quarter Ended | Google Search | YouTube Ads | Google Network | Subscriptions, Platforms, & Devices | Google Cloud | Other Bets |
|----------------|--------------|------------|---------------|-----------------------------------|-------------|-----------|
| March 31, 2024 | 46,156 | 8,090 | 7,413 | 8,739 | 9,574 | 495 |
| June 30, 2024 | 48,509 | 8,663 | 7,444 | 9,312 | 10,347 | 365 |
| Sept 30, 2024 | 49,385 | 8,921 | 7,548 | 10,656 | 11,353 | 388 |
・Prompt
Using the tables generated, write matplotlib code to visualize these revenues, in bar chart format. On the x-axis, plot time (each of the 15 quarters in chronological order). On the y-axis, plot revenues -- total revenue and for each product and service for each time horizon. Do not leave code for the user to fill in. Leave out Other Bets, and ensure that the bars have equal spacing.
・Prompt
Awesome! Now, can you generate a plot showing quarter on quarter revenue growth for each of the product areas? The x axis should be time (each of the 15 quarters) and the y axis should be the revenue percentage growth, and there should be 6 lines on the chart. Include variable values in the code even if they were defined previously. Exclude other bets.
Quarter on Quarter revenue growth is defined as ((revenue for current quarter)-(revenue for previous quarter))/(revenue for previous quarter).
4. 現実世界の文書推論
「Gemini」を使用すると、領収書、ラベル、標識、メモ、ホワイトボードのスケッチ、個人記録など、さまざまな現実世界の文書から情報を理解して抽出できます。
以下は、領収書の画像からユーザー定義フィールドの値を抽出し、これを JSONとして返す例です。
5. Webページデータの抽出
「Gemini」は、Webページのスクリーンキャプチャからデータを抽出し、JSONなどの構造化された出力形式でこのデータを返すことができます。これにより、Webページ上の画像や動画など、人間が見るのと同じようにリアルタイムのページコンテンツにアクセスでき、WebデータAPI やブラウジング エージェントなどのアプリケーションが可能になります。
以下はGeminiがGoogle Play Webページの情報を構造化された出力形式に変換する例です。
・Prompt
Extract each of the books from this webpage, and return a list of JSON objects, where each contains
name of the book
author
number of stars
price
Only output information that you see in the webpage
・Response
[
{
"name": "Warriors: A Starless Clan #6",
"author": "Erin Hunter",
"stars": null,
"price": "$14.95"
},
{
"name": "In Too Deep: A Reacher Novel",
"author": "Lee Child",
"stars": 4.7,
"price": "$16.95"
},
{
"name": "The Waiting: A Ballard and Bosch Novel",
"author": "Michael Connelly",
"stars": 4.8,
"price": "$16.95"
},
{
"name": "The Fury of the Gods",
"author": "John Gwynne",
"stars": 5.0,
"price": "$29.95"
},
{
"name": "Raised by Wolves: Patterson's Greatest Small-Town Thriller",
"author": "James Patterson",
"stars": null,
"price": "$14.95"
},
{
"name": "Iron Flame (1 of 2)",
"author": null,
"stars": 4.6,
"price": "$27.95"
},
{
"name": "The Message",
"author": "Ta-Nehisi Coates",
"stars": 4.9,
"price": "$14.95"
},
{
"name": "Revenge of the Tipping Point: Overstories, Superspreaders",
"author": "Malcolm Gladwell",
"stars": 4.6,
"price": "$14.95"
},
{
"name": "Counting Miracles: A Novel",
"author": "Nicholas Sparks",
"stars": 4.3,
"price": "$16.95"
},
{
"name": "Nexus: A Brief History of Information Networks from...",
"author": "Yuval Noah Harari",
"stars": 4.2,
"price": "$19.95"
},
{
"name": "Wind and Truth: Book Five of...",
"author": "Brandon Sanderson",
"stars": null,
"price": "$38.45"
},
{
"name": "Passions in Death: An Eve...",
"author": "J.D. Robb",
"stars": 4.5,
"price": "$14.95"
},
{
"name": "Sunrise on the Reaping (A Hunger Games Novel)",
"author": "Suzanne Collins",
"stars": null,
"price": "$19.95"
},
{
"name": "Michael Vey 10: The Colony",
"author": "Richard Paul Evans",
"stars": null,
"price": "$14.95"
},
{
"name": "The Echo Machine: How Right-Wing Extremism...",
"author": "David Pakman",
"stars": null,
"price": "$24.95"
},
{
"name": "The Perfect Crime (A Jessie...",
"author": null,
"stars": null,
"price": "$14.95"
},
{
"name": "Peppermint Peril (A Penny...",
"author": null,
"stars": null,
"price": "$4.99"
},
{
"name": "Lethal Prey",
"author": "John Sandford",
"stars": null,
"price": "$14.95"
},
{
"name": "The Song of Achilles: A Novel",
"author": "Madeline Miller",
"stars": 4.7,
"price": "$3.99"
},
{
"name": "A People's History of the United States",
"author": "Howard Zinn",
"stars": 4.1,
"price": "$4.99"
},
{
"name": "Termination Shock: A Novel",
"author": "Neal Stephenson",
"stars": 4.3,
"price": "$4.99"
},
{
"name": "Mere Christianity",
"author": "C.S. Lewis",
"stars": 4.8,
"price": "$4.99"
},
{
"name": "Local Woman Missing",
"author": "Mary Kubica",
"stars": 4.6,
"price": "$6.99"
},
{
"name": "Murder on the Orient Express: A Hercule Poirot Mystery",
"author": "Agatha Christie",
"stars": 4.4,
"price": "$4.99"
},
{
"name": "The Mountains Sing",
"author": "Nguyen Phan Que Mai",
"stars": 1.5,
"price": "$32.30"
},
{
"name": "The Bands of Mourning",
"author": "Brandon Sanderson",
"stars": 4.9,
"price": "$19.95"
},
{
"name": "The Institute: A Novel",
"author": "Stephen King",
"stars": 4.7,
"price": "$16.95"
},
{
"name": "Tom Lake: A Novel",
"author": "Ann Patchett",
"stars": 4.4,
"price": "$16.95"
},
{
"name": "All the Sinners Bleed: A Novel",
"author": "S.A. Cosby",
"stars": 4.7,
"price": "$14.95"
},
{
"name": "The Black Prism",
"author": "Brent Weeks",
"stars": 4.9,
"price": "$35.99"
},
{
"name": "The Hundred Years' War on Palestine: A History of Settle...",
"author": "Rashid Khalidi",
"stars": 4.6,
"price": "$14.95"
},
{
"name": "War",
"author": "Bob Woodward",
"stars": 3.7,
"price": "$14.95"
},
{
"name": "A People's History of the United States",
"author": "Howard Zinn",
"stars": 4.1,
"price": "$4.99"
},
{
"name": "Confronting the Presidents: No Spin Assessments from...",
"author": "Bill O'Reilly & Martin Dugard",
"stars": 4.8,
"price": "$19.95"
},
{
"name": "Chaos: Charles Manson, the CIA, and the Secret History o...",
"author": null,
"stars": 4.8,
"price": "$29.95"
},
{
"name": "The Demon of Unrest: A Saga of Hubris, Heartbreak, and...",
"author": "Erik Larson",
"stars": 3.1,
"price": "$19.95"
},
{
"name": "A Hidden Secret: A Kate Burkholder Short Story",
"author": "Linda Castillo",
"stars": 4.8,
"price": "$1.99"
},
{
"name": "MrBallen Presents: Strange, Dark & Mysterious: The...",
"author": null,
"stars": 5.0,
"price": "$13.95"
},
{
"name": "Angle of Investigation: Three Harry Bosch Stories",
"author": "Michael Connelly",
"stars": 4.4,
"price": "$12.95"
},
{
"name": "Cross Kill: An Alex Cross Story",
"author": "James Patterson",
"stars": 4.7,
"price": "$14.95"
},
{
"name": "Detective Cross",
"author": "James Patterson",
"stars": 4.6,
"price": "$14.95"
},
{
"name": "Blood Brothers: A Short Story Exclusive",
"author": "James Rollins",
"stars": 5.0,
"price": "$9.95"
}
]
6. 物体検出
「Gemini」を使用すると、画像内の物体を検出し、それらの境界ボックス座標を生成できます。「Gemini」の物体検出機能は、モデルの応答を画像に視覚的に反映させるのに特に役立ち、ユーザー定義の基準に基づいて物体を推論および検索する必要がある場合に、特殊なモデルよりも高い付加価値を提供します。
以下は、「Gemini 1.5 Pro」を使用した物体検出の例です。
「Gemini」は画像内の各物体を検出し、その応答で境界ボックスの座標を出力します。下の画像は境界ボックスを表示しています (Colabノートブック)。
7. 動画の要約と文字起こし
「Gemini」は、映像フレームと音声の両方を含む最長90分の動画を処理できます。「Gemini」を使用すると、次のことが可能になります。
以下は、「Gemini」を使用して動画の概要を生成する例です。
・Prompt
Generate technical lecture notes for this video in a manner that would be understandable for a high school student. Make sure to follow these instructions:
1. Use both the information shown in each slide image and the audio to create the notes. Make it clear when you have used information present in a slide, and pay attention and describe diagrams provided on the slides.
2. Include clear chapters for sections of the lecture
3. Cover content in the entire content of the video, from start to end uniformly.
・Response
Here are the technical lecture notes you requested:
# **Achieving Rapid Response Times in Large Online Services**
Jeff Dean, Google
## **Introduction**
Rapid response times of web applications are important for making them more interactive, fluid, and easy to use for the user. It’s often challenging to keep web apps performing quickly when there’s a large fanout, or the number of servers that need to be contacted in order to fulfill the user’s request. This can be seen in Google Search, where the results page often requires information from thousands of servers to create. This is made even more challenging when Google services are run in a shared environment, or on a large cluster of servers where many different services can be performed. This allows for an array of network issues like traffic congestion, background activity, and spikes in foreground activity for other services being carried out on the same server cluster.
When the latency of applications under these conditions is inspected, they exhibit what's referred to as long-tail latencies. Jeff used an amusing picture of himself on an African safari getting his shirt pulled by a cheetah to illustrate this concept. As he explained, long tail latency means that when you measure the latency of your application and find that it performs quickly on average, the 99th percentile latency could be very long. For example, if your server has a 10 ms average response time and a 99th percentile latency of 1 second, then if you have to get data from only one of those servers, 1% of requests will take more than a second. But when you have to get a response from 100 servers, 63% of your requests will take over a second, since at least one of the servers is likely to be experiencing this high latency.
## **Basic Latency Reduction Techniques**
Some common ways of dealing with these issues are:
* Differentiated service classes: Prioritizing interactive requests and their network traffic much higher than background requests, as this is less likely to affect the user experience if these requests lag.
* Reduce head-of-line blocking: Dividing large requests into many smaller requests to prevent lag for higher-priority requests waiting behind them.
* Manage expensive background activities: Rate-limiting background activities or delaying them until there’s less traffic on the servers, as these are usually not directly associated with a user request.
## **Fault Tolerance vs. Tolerating Variability**
The speaker drew an analogy to fault tolerance, a common hardware technique where unreliable parts of the system, like hard drives or a computer's power supply, are used to create a whole reliable system. By analogy, he wants to use unpredictable components that vary greatly in performance to create a predictable and high-performing system. Jeff pointed out that while both fault tolerance and tolerating variability use extra resources, the difference between the two is in the timescale of their variability. The issues that fault tolerance measures are on a scale of tens or hundreds of events per day, while latency tolerance measures thousands of events per second.
## **Latency Tolerating Techniques**
Here are two techniques that Jeff describes for minimizing variability in latency:
### Cross Request Adaptation
* Collect statistics on the system. This could include latency rates, performance of backends, etc.
* Take action to improve the latency of future requests, for example, by load balancing.
* Timescale for these kinds of actions are generally on the order of tens of seconds to minutes.
### Within-Request Adaptation
* Within a single high-level request, cope with slow subsystems.
* Timescale for these kinds of actions are generally immediate, while the user is waiting for a request to be fulfilled.
## **Fine-Grained Dynamic Partitioning**
One cross request adaption technique that Jeff discussed was fine-grained dynamic partitioning. Normally, if you have ‘n’ servers, you could simply divide the workload into ‘n’ equal pieces, and each server can deal with one piece each, assuming that there’s no shared environment where other things can happen. But once you have a shared environment, the load becomes unpredictable and can result in a server getting overloaded. In the case of a shared environment, it’s recommended to have a server dealing with 10–100 different pieces of work. This allows for very fine-grain load balancing, because if one server is overloaded, one of those pieces of work can be assigned to another server. Another reason for doing this is that it speeds up failure recovery, because when a server dies, whatever it was responsible for is distributed to other machines, and if the workload has been divided into ‘n’ smaller tasks, this recovery process can happen in ‘n’ separate ways simultaneously.
## **Selective Replication**
Another technique often used by Google is called selective replication, where heavily-used pieces of information in the system are copied to other server clusters. This can be static, where the number of copies is fixed, or dynamic, where the number of copies of a piece of information is increased or decreased depending on the amount of traffic there is in requests associated with that information.
## **Latency-Induced Probation**
A third technique that Jeff described for dealing with unpredictable latency and interference effects from shared services was what he calls latency-induced probation, or the concept of removing capacity under load to improve latency. The steps for this are:
* Recognize that a server is slow to respond, even if it is a high priority server.
* Make a copy of the data in question on another server.
* Send a “shadow stream” of requests to the slow server. These requests are similar to “canary requests” in that they serve as a check to make sure the server is functioning.
* Once the latency of the slow server has gone down and the “canary” checks show it working, return it to service.
## **Backup Requests**
Another technique for minimizing latency variability is the use of backup requests, where a client sends a copy of the same request to two or more server clusters in order to improve latency. If one of the servers selected returns the data faster, the client sends a cancellation request for the duplicate request in the other server queue, if it’s possible to maintain information about where the original request was sent. However, the disadvantage of this is that it can double the processing required if two servers begin processing the request at about the same time. In the latter case, the client needs to check if the issue of simultaneous processing occurred, and if so, to send only one copy of the requested data.
The speaker then measured the improvement in latency using two different systems. The first was a loaded server cluster where data was replicated in two in-memory servers, and 1000 requests were spread across 100 tablets. The speaker measured the time it took for all 1000 keys to be retrieved. The second measurement used an almost completely idle system, but with the same parameters: data was replicated in two in-memory servers, 1000 requests were sent across 100 tablets, and the total retrieval time for all 1000 keys was measured. In both cases, backup requests reduced latency dramatically. The results for both loaded and idle servers, respectively, are shown in the tables below:
### Loaded cluster results:
| Policy | Avg | Std Dev | 95%ile | 99%ile | 99.9%ile |
| ------------- |:--------:|:-------:|:------:|:------:|:--------:|
| No backups | 33 ms | 1524 ms| 24 ms | 52 ms | 994 ms |
| Backup after 10 ms | 14 ms | 4 ms | 20 ms | 23 ms | 50 ms |
| Backup after 50 ms | 16 ms | 12 ms | 57 ms | 63 ms | 68 ms |
### Idle cluster results:
| Policy | 50%ile | 90%ile | 99%ile | 99.9%ile |
| ------------- |:--------:|:-------:|:------:|:--------:|
| No backups | 19 ms | 38 ms | 67 ms | 98 ms |
| Backup after 2 ms | 16 ms | 28 ms | 38 ms | 51 ms |
## Conclusion
These techniques can make online services more responsive and can dramatically cut down on processing time and costs.
8. 動画抽出
「Gemini」は、動画から情報を抽出し、リスト、テーブル、JSONなどの構造化された形式で出力できます。これは、カタログ作成、小売、交通、ホーム セキュリティなどのドメインでのエンティティの検出、画面録画からの非構造化データの抽出などのタスクに特に役立ちます。