【PHP】Whisper.phpを使ってみた

Laravel Newsで紹介されていましたが、自動音声認識と文字起こしツールのWhisper.phpが面白そうなので使ってみました。

Just a moment...

GitHub - CodeWithKyrian/whisper.php: Local Speech to Text in PHP made easy thanks to Whisper.cpp and OpenAI

Local Speech to Text in PHP made easy thanks to Whisper.cpp and OpenAI - CodeWithKyrian/whisper.php

前提条件
Whisper.phpインストール
サンプルコード
動作確認
日本語対応
言語モデルとディスクサイズ・使用メモリ
ファイル出力
ログ出力

前提条件

Ubuntu24.04上で作業しています
PHP8.1以降インストール済
FFI拡張有効化済
Composerインストール済

Whisper.phpインストール

プロジェクトフォルダ「whisperphp-app」を作成して中に入ります。

mkdir whisperphp-app
cd whisperphp-app

ComposerでWhisper.phpをインストールします。

composer require codewithkyrian/whisper.php

サンプルコード

ローカルの音声ファイル（mp3）を読み込んでテキスト出力するサンプルコードです。

▼「src/UseWhisper.php」

<?php

require_once __DIR__ . '/../vendor/autoload.php';

use Codewithkyrian\Whisper\Whisper;
use function Codewithkyrian\Whisper\readAudio;
use function Codewithkyrian\Whisper\toTimestamp;
 
// Transcribe Audio
$whisper = Whisper::fromPretrained('tiny.en', baseDir: __DIR__.'/models');
$audio = readAudio(__DIR__ . '/../audio/abm00001504.mp3');
$segments = $whisper->transcribe($audio, 4);
 
// Output transcribed segment data
foreach ($segments as $segment) {
    echo toTimestamp($segment->startTimestamp) . ': ' . $segment->text . "\n";
}

▼音声ファイル

大学入試共通テスト英語リスニングの音量調整用ファイル(1:38)をダウンロードして

audio/abm00001504.mp3

として保存してみました。

ERROR: The request could not be satisfied

動作確認

上記スクリプトを実行してみます。

php -f src/UseWhisper.php

※使用メモリ上限を変更したい場合は

php -f src UseWhisper.php -d memory_limit=512M

最初の日本語のパートは (speaking in foreign language) となっていますが、

00:36～01:09 の英語のパートは正確に文字起こしされています。

01:33～01:36 のパートは日本語の

では、イヤホンを耳から外し、静かに机の上に置いてください。

が

She’s a good girl, and she’s a good girl.

となっているのは愛嬌です。

日本語対応

上記コードの「tiny.en」の箇所を「base」にすれば一応日本語にもそれなりに対応します。

▼「src/UseWhisper.php」

<?php

require_once __DIR__ . '/../vendor/autoload.php';

use Codewithkyrian\Whisper\Whisper;
use function Codewithkyrian\Whisper\readAudio;
use function Codewithkyrian\Whisper\toTimestamp;
 
// Transcribe Audio
$whisper = Whisper::fromPretrained('base', baseDir: __DIR__.'/models');
$audio = readAudio(__DIR__ . '/../audio/abm00001504.mp3');
$segments = $whisper->transcribe($audio, 4);
 
// Output transcribed segment data
foreach ($segments as $segment) {
    echo toTimestamp($segment->startTimestamp) . ': ' . $segment->text . "\n";
}

▼実行結果

色々と誤字が見られますが、それは愛嬌ということで。

やはり最後は人間の目と知能でチェックが必要です。

１行目：×「長します」　→　〇「流します」

３行目：×「支持」　→　〇「指示」

１９行目：×「砂糖中」　→　〇「作動中」

言語モデルとディスクサイズ・使用メモリ

言語モデルとディスクサイズ、使用メモリの関係は次のぺージを参考にしてください。

GitHub - ggml-org/whisper.cpp: Port of OpenAI's Whisper model in C/C++

Port of OpenAI's Whisper model in C/C++. Contribute to ggml-org/whisper.cpp development by creating an account on GitHub...

一応、表を転載しておきます。

モデル	ディスクサイズ	使用メモリ
tiny	75MiB	～273MB
base	142MiB	～388MB
small	466MiB	～852MB
medium	1.5GiB	～2.1GB
large	2.9GiB	～3.9GB

また、利用可能なモデルの名称は次の通りです。

tiny.en
tiny
base.en
base
small.en
small
medium.en
medium
large-v1
large-v2
large-v3
large-v3-turbo

ファイル出力

.txt .csv .vtt .srt の４種のファイル出力に対応していますが、

ここでは .txt と .csv だけ紹介します。

▼「src/UseWhisper.php」

<?php

require_once __DIR__ . '/../vendor/autoload.php';

use Codewithkyrian\Whisper\Whisper;
use function Codewithkyrian\Whisper\outputTxt;
use function Codewithkyrian\Whisper\outputCsv;
use function Codewithkyrian\Whisper\readAudio;
use function Codewithkyrian\Whisper\toTimestamp;

// Transcribe Audio
$whisper = Whisper::fromPretrained('base', baseDir: __DIR__.'/models');
$audio = readAudio(__DIR__ . '/../audio/abm00001504.mp3');
$segments = $whisper->transcribe($audio, 4);
 
// Output transcribed segment data
foreach ($segments as $segment) {
    echo toTimestamp($segment->startTimestamp) . ': ' . $segment->text . "\n";
}

// ファイル出力
outputTxt($segments, 'output/abm00001504.txt');
outputCsv($segments, 'output/abm00001504.csv');

▼.txtの出力

これから音量を調節します 英語の音声を約30秒間長します
その間にあなたが聞きやすい音量に調節してください この英語は問題そのものではありませんので内容を把握する必要はありません
音声の最後でイヤフォンを外すよう支持します 支持があったらすぐに外し机の上に置いてください
それでは音量の調節を始めます
 Let's talk about the newsletter
OK! Let's check what we've got so far
We've decided to have one main story and one short story, right?
 Right! And what about pictures?
 Should we have one for each story?
 I'm not so sure about that.
 Maybe it would be too much.
 How about just for the main story?
 That sounds good.
 Now what will our stories be?
 We could do one about the students who visited from Hawaii.
 Maybe we could use one of the photos they sent us
これで音量の調節は終わりです
この後、監督者の指示で試験を始めますが 音量は試験の最中一でも調節できます
なお、次の再生ボタンも 砂糖中ランプが光るまで長く押し続けるボタンですから注意してください
では、イヤフォンを耳から外し 静かに机の上に置いてください
[音楽]

▼ .csv出力

0,7360,"これから音量を調節します 英語の音声を約30秒間長します"
7360,19840,"その間にあなたが聞きやすい音量に調節してください この英語は問題そのものではありませんので内容を把握する必要はありません"
19840,29880,"音声の最後でイヤフォンを外すよう支持します 支持があったらすぐに外し机の上に置いてください"
29880,36160,それでは音量の調節を始めます
36160,38160," Let's talk about the newsletter"
38160,41320,"OK! Let's check what we've got so far"
41320,46160,"We've decided to have one main story and one short story, right?"
46160,49060," Right! And what about pictures?"
49060,51360," Should we have one for each story?"
51360,53560," I'm not so sure about that."
53560,55560," Maybe it would be too much."
55560,58240," How about just for the main story?"
58240,59800," That sounds good."
59800,62280," Now what will our stories be?"
62280,65880," We could do one about the students who visited from Hawaii."
65880,69600," Maybe we could use one of the photos they sent us"
69600,72880,これで音量の調節は終わりです
72880,81880,"この後、監督者の指示で試験を始めますが 音量は試験の最中一でも調節できます"
81880,91080,"なお、次の再生ボタンも 砂糖中ランプが光るまで長く押し続けるボタンですから注意してください"
91080,96280,"では、イヤフォンを耳から外し 静かに机の上に置いてください"
96280,98280,[音楽]

ログ出力

▼「src/UseWhisper.php」

<?php

require_once __DIR__ . '/../vendor/autoload.php';

use Codewithkyrian\Whisper\Whisper;
use Codewithkyrian\Whisper\WhisperLogger;
use function Codewithkyrian\Whisper\readAudio;
use function Codewithkyrian\Whisper\toTimestamp;

// Set Logger
Whisper::setLogger(new WhisperLogger(__DIR__ . '/../output/whisper.log'));

// Transcribe Audio
$whisper = Whisper::fromPretrained('base', baseDir: __DIR__.'/models');
$audio = readAudio(__DIR__ . '/../audio/abm00001504.mp3');
$segments = $whisper->transcribe($audio, 4);
 
// Output transcribed segment data
foreach ($segments as $segment) {
    echo toTimestamp($segment->startTimestamp) . ': ' . $segment->text . "\n";
}

▼「output/whisper.log」

[2024-12-15 05:08:32] whisper.info: whisper_init_from_file_with_params_no_state: loading model from '/home/macocci7/work/whisperphp-app/src/models/ggml-base.bin' []
[2024-12-15 05:08:32] whisper.info: whisper_init_with_params_no_state: use gpu    = 0 []
[2024-12-15 05:08:32] whisper.info: whisper_init_with_params_no_state: flash attn = 0 []
[2024-12-15 05:08:32] whisper.info: whisper_init_with_params_no_state: gpu_device = 0 []
[2024-12-15 05:08:32] whisper.info: whisper_init_with_params_no_state: dtw        = 0 []
[2024-12-15 05:08:32] whisper.info: whisper_init_with_params_no_state: backends   = 1 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: loading model []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_vocab       = 51865 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_audio_ctx   = 1500 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_audio_state = 512 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_audio_head  = 8 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_audio_layer = 6 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_text_ctx    = 448 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_text_state  = 512 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_text_head   = 8 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_text_layer  = 6 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_mels        = 80 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: ftype         = 1 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: qntvr         = 0 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: type          = 2 (base) []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: adding 1608 extra tokens []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: n_langs       = 99 []
[2024-12-15 05:08:32] whisper.info: whisper_model_load:      CPU total size =   147.37 MB []
[2024-12-15 05:08:32] whisper.info: whisper_model_load: model size    =  147.37 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: kv self size  =    6.29 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: kv cross size =   18.87 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: kv pad  size  =    3.15 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: compute buffer (conv)   =   16.26 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: compute buffer (encode) =   85.86 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: compute buffer (cross)  =    4.65 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: compute buffer (decode) =   96.35 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: kv self size  =    6.29 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: kv cross size =   18.87 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: kv pad  size  =    3.15 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: compute buffer (conv)   =   16.26 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: compute buffer (encode) =   85.86 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: compute buffer (cross)  =    4.65 MB []
[2024-12-15 05:08:32] whisper.info: whisper_init_state: compute buffer (decode) =   96.35 MB []
[2024-12-15 05:08:33] whisper.info: whisper_full_with_state: auto-detected language: ja (p = 0.998259) []

以上です。