transcribe()v4.0.131
Transcribes a media file by utilizing Whisper.cpp.
You should first install Whisper.cpp, for example through installWhisperCpp()
.
This function only works with Whisper.cpp 1.5.5 or later, unless tokenLevelTimestamps
is set to false.
transcribe.mjstsx
importpath from 'path';import {transcribe } from '@remotion/install-whisper-cpp';const {transcription } = awaittranscribe ({inputPath : '/path/to/audio.wav',whisperPath :path .join (process .cwd (), 'whisper.cpp'),model : 'medium.en',tokenLevelTimestamps : true,});for (consttoken oftranscription ) {console .log (token .timestamps .from ,token .timestamps .to ,token .text );}
transcribe.mjstsx
importpath from 'path';import {transcribe } from '@remotion/install-whisper-cpp';const {transcription } = awaittranscribe ({inputPath : '/path/to/audio.wav',whisperPath :path .join (process .cwd (), 'whisper.cpp'),model : 'medium.en',tokenLevelTimestamps : true,});for (consttoken oftranscription ) {console .log (token .timestamps .from ,token .timestamps .to ,token .text );}
Options
inputPath
The path to the file you want extract text from.
The file has to be a 16KHz wav file. You can extract a 16KHz wav file from a video or audio file for example by utilizing FFmpeg with the following command:
bash
ffmpeg -i input.mp4 -ar 16000 output.wav -y
bash
ffmpeg -i input.mp4 -ar 16000 output.wav -y
If you don't want to install FFmpeg, you can also use the smaller FFmpeg binary provided by Remotion.
bash
npx remotion ffmpeg -i input.mp4 -ar 16000 output.wav -y
bash
npx remotion ffmpeg -i input.mp4 -ar 16000 output.wav -y
whisperPath
The path to your whisper.cpp
folder.
If you haven't installed Whisper.cpp, you can do so for example through installWhisperCpp()
and use the same folder
.
tokenLevelTimestamps
v4.0.131
Passes the --dtw
flag to Whisper.cpp to generate more accurate timestamps, which are being returned under the t_dtw
field.
Recommended to get actually accurate timings, but only available from Whisper.cpp versions later than 1.0.55.
Set to false
if you use an older version of Whisper.cpp.
model?
default: base.en
Specify a specific Whisper model for the transcription.
Possible values: tiny
, tiny.en
, base
, base.en
, small
, small.en
, medium
, medium.en
, large-v1
, large-v2
, large-v3
, large-v3-turbo
.
Make sure the model you want to use exists in your whisper.cpp/models
folder. You can ensure a specific model is available locally by utilizing the downloadWhisperModel() API.
Note: large-v3-turbo
is only working properly from Whisper.cpp versions built from November 2024 or later and Remotion v4.0.229 or greater.
modelFolder?
default: whisperPath/models
If you saved Whisper models to a specific folder, pass its path here.
Uses the whisper.cpp/models
folder at the location defined through whisperPath
as default.
translateToEnglish?
default: false
Set this boolean flag to true
if you want to get a translated transcription of the provided file in English.
Make sure to not use a *.en model, as they will not be able to translate a foreign language to english.
We recommend using at least the medium
model to get satisfactory results when translating.
printOutput?
v4.0.132
Whether to print the output of the transcription process to the console. Defaults to true
.
tokensPerItem?
v4.0.141
default: 1
The maximum amount of tokens included in each transcription item.
Set this flag to null
, to use whisper.cpp
's default token grouping (useful for generating a movie-style transcription).
tokensPerItem
can only be set when tokenLevelTimestamps
is set to false
.
splitOnWord?
v4.0.208
Adds the --split-on-word
flag to Whisper.cpp for cleaner word-for-word output.
language?
v4.0.142
default: null
Passes the -l
flag to Whisper.cpp to specific spoken language of the audio file.
Possible values: Afrikaans
, Albanian
, Amharic
, Arabic
, Armenian
, Assamese
, Azerbaijani
, Bashkir
, Basque
, Belarusian
, Bengali
, Bosnian
, Breton
, Bulgarian
, Burmese
, Castilian
, Catalan
, Chinese
, Croatian
, Czech
, Danish
, Dutch
, English
, Estonian
, Faroese
, Finnish
, Flemish
, French
, Galician
, Georgian
, German
, Greek
, Gujarati
, Haitian
, Haitian Creole
, Hausa
, Hawaiian
, Hebrew
, Hindi
, Hungarian
, Icelandic
, Indonesian
, Italian
, Japanese
, Javanese
, Kannada
, Kazakh
, Khmer
, Korean
, Lao
, Latin
, Latvian
, Letzeburgesch
, Lingala
, Lithuanian
, Luxembourgish
, Macedonian
, Malagasy
, Malay
, Malayalam
, Maltese
, Maori
, Marathi
, Moldavian
, Moldovan
, Mongolian
, Myanmar
, Nepali
, Norwegian
, Nynorsk
, Occitan
, Panjabi
, Pashto
, Persian
, Polish
, Portuguese
, Punjabi
, Pushto
, Romanian
, Russian
, Sanskrit
, Serbian
, Shona
, Sindhi
, Sinhala
, Sinhalese
, Slovak
, Slovenian
, Somali
, Spanish
, Sundanese
, Swahili
, Swedish
, Tagalog
, Tajik
, Tamil
, Tatar
, Telugu
, Thai
, Tibetan
, Turkish
, Turkmen
, Ukrainian
, Urdu
, Uzbek
, Valencian
, Vietnamese
, Welsh
, Yiddish
, Yoruba
, Zulu
.
af
, am
, ar
, as
, az
, ba
, be
, bg
, bn
, bo
, br
, bs
, ca
, cs
, cy
, da
, de
, el
, en
, es
, et
, eu
, fa
, fi
, fo
, fr
, gl
, gu
, ha
, haw
, he
, hi
, hr
, ht
, hu
, hy
, id
, is
, it
, ja
, jw
, ka
, kk
, km
, kn
, ko
, la
, lb
, ln
, lo
, lt
, lv
, mg
, mi
, mk
, ml
, mn
, mr
, ms
, mt
, my
, ne
, nl
, nn
, no
, oc
, pa
, pl
, ps
, pt
, ro
, ru
, sa
, sd
, si
, sk
, sl
, sn
, so
, sq
, sr
, su
, sv
, sw
, ta
, te
, tg
, th
, tk
, tl
, tr
, tt
, uk
, ur
, uz
, vi
, yi
, yo
, zh
or auto
.
signal?
v4.0.156
A signal from an AbortController
to cancel the transcription process.
onProgress?
v4.0.156
Listen for progress updates from the transcription process.
The progress is a number between 0
and 1
.
tsx
import type {TranscribeOnProgress } from '@remotion/install-whisper-cpp';constonProgress :TranscribeOnProgress = (progress ) => {console .log (`Transcription progress: ${progress * 100}%`);};
tsx
import type {TranscribeOnProgress } from '@remotion/install-whisper-cpp';constonProgress :TranscribeOnProgress = (progress ) => {console .log (`Transcription progress: ${progress * 100}%`);};
Return value
TranscriptionJson
An object containing all the metadata and transcriptions resulting from the transcription process.
ts
typeTimestamps = {from : string;to : string;};typeOffsets = {from : number;to : number;};typeWordLevelToken = {t_dtw : number;text : string;timestamps :Timestamps ;offsets :Offsets ;id : number;p : number;};typeTranscriptionItem = {timestamps :Timestamps ;offsets :Offsets ;text : string;};typeTranscriptionItemWithTimestamp =TranscriptionItem & {tokens :WordLevelToken [];};typeModel = {type : string;multilingual : boolean;vocab : number;audio : {ctx : number;state : number;head : number;layer : number;};text : {ctx : number;state : number;head : number;layer : number;};mels : number;ftype : number;};typeParams = {model : string;language : string;translate : boolean;};typeResult = {language : string;};export typeTranscriptionJson <WithTokenLevelTimestamp extends boolean> = {systeminfo : string;model :Model ;params :Params ;result :Result ;transcription : true extendsWithTokenLevelTimestamp ?TranscriptionItemWithTimestamp []:TranscriptionItem [];};
ts
typeTimestamps = {from : string;to : string;};typeOffsets = {from : number;to : number;};typeWordLevelToken = {t_dtw : number;text : string;timestamps :Timestamps ;offsets :Offsets ;id : number;p : number;};typeTranscriptionItem = {timestamps :Timestamps ;offsets :Offsets ;text : string;};typeTranscriptionItemWithTimestamp =TranscriptionItem & {tokens :WordLevelToken [];};typeModel = {type : string;multilingual : boolean;vocab : number;audio : {ctx : number;state : number;head : number;layer : number;};text : {ctx : number;state : number;head : number;layer : number;};mels : number;ftype : number;};typeParams = {model : string;language : string;translate : boolean;};typeResult = {language : string;};export typeTranscriptionJson <WithTokenLevelTimestamp extends boolean> = {systeminfo : string;model :Model ;params :Params ;result :Result ;transcription : true extendsWithTokenLevelTimestamp ?TranscriptionItemWithTimestamp []:TranscriptionItem [];};
Prefer relying on the t_dtw
value for accurate timestamps over offsets
.
Use convertToCaptions()
to use our opinionated suggestion for postprocessing the captions.