vllm.config.speech_to_text ¶
SpeechToTextConfig ¶
Configuration for speech-to-text models.
Source code in vllm/config/speech_to_text.py
max_audio_clip_s class-attribute instance-attribute ¶
max_audio_clip_s: int | None = 30
Maximum duration in seconds for a single audio clip without chunking. Audio longer than this will be split into smaller chunks if allow_audio_chunking evaluates to True, otherwise it will be rejected. None means audio duration can be unlimited and won't be chunked.
min_energy_split_window_size class-attribute instance-attribute ¶
min_energy_split_window_size: int | None = 1600
Window size in samples for finding low-energy (quiet) regions to split audio chunks. The algorithm looks for the quietest moment within this window to minimize cutting through speech. Default 1600 samples ≈ 100ms at 16kHz. If None, no chunking will be done.
SpeechToTextParams dataclass ¶
All parameters consumed by get_generation_prompt().
TranscriptionRequest.build_stt_params() constructs this object, mapping API-level fields into typed attributes. Models only receive this object, so new parameters can be added here without changing the get_generation_prompt signature.
Source code in vllm/config/speech_to_text.py
hotwords class-attribute instance-attribute ¶
hotwords: str | None = None
hotwords refers to a list of important words or phrases that the model should pay extra attention to during transcription.
language class-attribute instance-attribute ¶
language: str | None = None
ISO 639-1 language code (validated / auto-detected).
request_prompt class-attribute instance-attribute ¶
request_prompt: str = ''
Optional text prompt to guide the model.
stt_config instance-attribute ¶
stt_config: SpeechToTextConfig
Server-level speech-to-text configuration.