Multimedia Chunking

Up until now, our examples have focused on applying chunking and batching techniques to simple text inputs. However, these strategies can also be extended to other media types, such as audio and video. In this section, we focus on how these techniques can be adapted for other input types, and also introduce other methods that are unique to specific media types.

Duration Chunking

In the Fixed Chunking section of this guide, we introduced straightforward techniques for splitting large blocks of text into smaller chunks based on character count. In media inputs (such as video and audio), we can instead break the content up by fixed time intervals (i.e. every 30 seconds), to produce smaller clips which contain fewer tokens and can therefore fit within the model’s context window. This is demonstrated in the code sample below, which shows how sliding window chunking can be implemented.

In this code sample, the input video file is chunked into multiple 100 second clips, each of which overlaps by 10 seconds using a sliding window

input_file_path = "path/to/media/file"

chunk_duration = 100 # The duration (in seconds) of each chunk.
window_duration = 10 # The duration (in seconds) of the sliding window.

video_duration = float((ffmpeg.probe(path))['format']['duration']) # using the FFmpeg tool to retrieve the duration of the video in seconds.
number_of_chunks = math.ceil(video_duration / (chunk_duration - window_duration))

chunked_files = [] # This will hold an all of the file name of the chunked inputted so they can be easily accessed.

for i in range(number_of_chunks):
    chunk_start_time = i * (chunk_duration - window_duration)
    # Using the FFmpeg tool to trim each video into chunks, saving each as a file named 'chunk_i' where 'i' is the chunk number.
    ffmpeg.input(input_file_path, ss=chunk_start_time).output(f'chunk_{i}.mp4', to=chunk_duration, c='copy').run(overwrite_output=True, capture_stdout=True, capture_stderr=True) 
    chunked_files.append(f'chunk_{i}.mp4')

These chunks can then be used to query the Gemini API as followa:

uploaded_file = client.files.upload(file=chunked_files[0]) # Uploading the 1st chunk of the video to the Gemini API.

# It can take time for the file to be uploaded, so we busy wait until it is available.
while uploaded_file.state.name == "PROCESSING" or uploaded_file.state.name == "PENDING":
    logging.info(f'Waiting for file {filepath} to upload, current state is {uploaded_file.state.name}')
    time.sleep(5)

# The uploaded file can now be included in the prompt as part of the contents.
response = client.models.generate_content(
    model="gemini-2.5-flash",
    config=types.GenerateContentConfig(
        system_instruction=system_prompt,
        thinking_config=types.ThinkingConfig(thinking_budget=0)
    ),
    contents=[f'\nQuestion:\n{question}', f'Contents (attached as file named {chunked_files[0]})', uploaded_file]
)

Transcript-based Chunking

There are several models which are designed to generate a transcript from an audio or video file. Two of the most prominent options are Google’s Speech-to-Text AI and OpenAI’s Whisper, which are both dedicated to this task. However, the Gemini family of models also perform well when transcribing videos, which can be done by simply uploading the file and asking Gemini to create a transcript. This can be done as follows:

input_file_path = "path/to/media/file.mp3"
uploaded_file = client.files.upload(file=input_file_path) # Uploading the file to Gemini

# It can take time for the file to be uploaded, so we busy wait until it is available.
while uploaded_file.state.name == "PROCESSING" or uploaded_file.state.name == "PENDING":
    logging.info(f'Waiting for file {filepath} to upload, current state is {uploaded_file.state.name}')
    time.sleep(5)

# We also get the start time of each sentence to allow us to chunk sentences later on.
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=["Create a transcript of the provided media, in the format of: start time of sentence in seconds, caption.", uploaded_file]
)

Using a Gemini model for transcription allows for us to maintain consistency in our choice of model throughout the entire library. It is also recommended to extract the audio from video files (such as .mp4) and only submit this audio to the model rather than the entire file. This significantly reduces the file size, which in turn also reduces the token usage and processing efficiency.

Once a transcript of a file has been produced, it is then possible to use the various text-based chunking methods, such as fixed and semantic chunking. Once the chunks have been determined, the timestamps of each chunk can be found using the transcript and the media file can be trimmed into chunks which are then uploaded to model as shown in the duration chunking section.

Other Chunking Methods

So far in this section, we have just demonstrated how a transcript can be generated from a video or audio file, which then allows for perform text-based chunking and batching techniques to be applied. However, video and audio inputs have additional features which can also be used to create chunks, either by themselves or in combiniation with the text-based methods.

Audio Methods

Speaker diarization is the proceess of identifying and separating inidivdual speakers in an audio recording. It can also be used to determine natural breaks in speech, both of which act as good chunking points. One useful library for this is pyannote.audio, which provides pretrained models for speaker diarization models and voice detection.

Video Methods

In the same way that we can extract the transcript of a video or audio for text proccessing, we can also extract the audio track of a video to apply the audio specific techniques, such as speaker diarization, which was previously mentioned.

Finally, video content also provides visual cues for chunking, such as scene changes or camera cuts. Another method would be to detect changes in the video, for example a change in scene or a camera cut, which could both provide a good chunking position. One useful tool for this is PySceneDetect, which is a python package that can be used to automatically detect shot changes in videos and to create separate clips.