Announcement

Marqo Now Supports Video and Audio Search

We are thrilled to announce a significant enhancement to Marqo's search capabilities: Marqo now supports video and audio search! With the integration of LanguageBind models, you can index, embed, and search video and audio files seamlessly. This advancement extends Marqo's functionality beyond text and images, opening up a new world of possibilities for multimodal search.

Multimodal Search Capabilities

Index and Embed Video and Audio Files

Marqo's new release allows you to:

  • Index: Add video and audio files to your search index with ease.
  • Embed: Generate rich embeddings for multimedia content using advanced LanguageBind models.
  • Search: Perform comprehensive searches across text, images, video, and audio files.

Note, this is available in Marqo open source and will be coming to Marqo Cloud very soon!

Why This Matters

In an increasingly multimedia-centric world, the ability to search across various content/data types is crucial. Whether you're managing a vast library of educational videos, a repository of podcasts, or a collection of marketing materials, Marqo's new capabilities ensure that you can find exactly what you need, when you need it.

Use Cases

There are endless possibilities with audio and video search but below are some of the most common use cases.

  • Music and Video Discovery: Search for songs or videos using audio clips, lyrics, or visual content, aiding in content discovery and identification.
  • Meeting and Lecture Search: Search within recorded meetings, lectures, or podcasts to find specific topics, discussions, or key moments.
  • Surveillance and Security: Identify events or objects in video surveillance footage or detect specific sounds in audio recordings.
  • Customer Support and Training: Search across support call recordings or training videos to find relevant sections for faster resolution or learning.
  • Ecommerce and Social Media: Search within product reviews or user-generated content for specific products, features, or emotional tones.
  • Getting Started on Marqo Cloud

    This section will show you how to get set up on Marqo Cloud in 5 simple steps.

    1. Set Up

    First, we need to install Marqo with pip:

    
    pip install marqo
    

    2. Initialise Marqo Client

    Next, we will need to initialize the Marqo Client. This will allow us to create and add documents to an index. To obtain your API Key, see this article.

    
    import marqo
    
    api_key = "put_your_api_key_here"
    
    mq = marqo.Client(url='https://api.marqo.ai', api_key=api_key)
    

    3. Create Marqo Index

    Let’s set up the configuration for our index. Here, we specify the LanguageBind/Video_V1.5_FT_Audio_FT_Image model which allows us to create an index that can handle video, audio, image, and text files. We also specify that we want to use marqo.GPU as the inference type as well as configure other basic settings.

    
    # Define settings for the index
    settings = {
        "type": "unstructured",  # Unstructured data allows flexible input types
        "vectorNumericType": "float",  # Use floating-point numbers for vector embeddings
        "model": "LanguageBind/Video_V1.5_FT_Audio_FT_Image",  # Model to handle text, audio, video, and images
        "normalizeEmbeddings": True,  # Normalize embeddings to ensure comparability
        "treatUrlsAndPointersAsMedia": True,  # Treat URLs as media files
        "treatUrlsAndPointersAsImages": True,  # Specifically treat certain URLs as images
        "audioPreprocessing": {"splitLength": 10, "splitOverlap": 5},  # Split audio into 10-second chunks with 5-second overlap
        "videoPreprocessing": {"splitLength": 20, "splitOverlap": 5},  # Split video into 20-second chunks with 5-second overlap
        "inferenceType": "marqo.GPU",  # Specify inference type
    }
    
    # Create a new index with the specified settings
    mq.create_index("audio-and-video-search", settings_dict=settings)
    

    4. Add Documents to Your Index

    We now add our audio, video and image documents to our index. We will add one of each type. The video file is a video of our co-founder, Jesse Clark, giving a presentation at Google HQ. The audio file is blues music. The image file is our fashion-CLIP model logo - a hippo with a hat. These URLs are public so feel free to inspect them for yourself.

    
    mq.index("audio-and-video-search").add_documents(
        documents=[
            # Add an audio file (blues music)
            {"audio_field": "https://marqo-tutorial-public.s3.us-west-2.amazonaws.com/example-audio.mp3", "_id": "id1"},
            # Add a video file (public speaking)
            {"video_field": "https://marqo-tutorial-public.s3.us-west-2.amazonaws.com/example-video.mp4", "_id": "id2"},
            # Add an image (Marqo logo which is a hippo)
            {"image_field": "https://marqo-tutorial-public.s3.us-west-2.amazonaws.com/example-image.png", "_id": "id3"},
            # Add more documents here if needed
        ],
        tensor_fields=['audio_field', 'video_field', 'image_field']  # Specify which fields should be embedded
    )
    

    5. Search with Marqo

    Let's search over this index. We will use the 'public speaking' query as this description matches our video file well.

    
    # Search the index for a query related to public speaking
    res = mq.index("audio-and-video-search").search("public speaking")
    print(res['hits'][0])  # Print the top hit (should relate to the video of public speaking)
    

    After performing our search, we obtain the output:

    
    {'_id': 'id2', 'video_field': 'https://marqo-tutorial-public.s3.us-west-2.amazonaws.com/example-video.mp4', '_highlights': [{'video_field': '[0.8858670000000011, 20.885867]'}], '_score': 0.5409741804365457}
    

    From this, we can clearly see the video file is our top result.

    We encourage you to add more video, audio, image, and text documents to your index and experiment with different queries.

    6. Clean Up

    If you follow the steps in this guide, you will create an index with GPU inference and a basic storage shard. When you are done with the index you can delete it with the following code:

    
    mq.delete_index("audio-and-video-search")
    

    If you do not delete your index you will continue to be charged for it.

    Getting Started Locally

    You've seen how to get set up on Marqo Cloud. This section will explain how to run Marqo locally through Docker.

    1. Start Marqo

    Marqo requires Docker. If you haven’t already, install Docker. Then, navigate to your terminal and input the following to run marqo:

    
    docker rm -f marqo
    docker pull marqoai/marqo:latest
    docker run --name marqo -it -p 8882:8882 \
    -e MARQO_MODELS_TO_PRELOAD="[]" \
    -e MARQO_MAX_CUDA_MODEL_MEMORY=16 \
    -e MARQO_MAX_CPU_MODEL_MEMORY=16 marqoai/marqo:latest
    

    Note, we configure some variables here. MARQO_MODELS_TO_PRELOAD is set to [] so that no models are automatically loaded—we will load our audio/video models later. In addition, MARQO_MAX_CUDA_MODEL_MEMORY and MARQO_MAX_CPU_MODEL_MEMORY have a default value of 4 so we increase this to 16 to allow for larger model sizes and more complex computations, enabling faster processing and handling.

    Now, we install marqo, the Python client for interacting with the Marqo server running in Docker:

    
    pip install marqo
    

    That’s it! Now we’re ready to create an index and begin searching over audio and video files.

    2. Indexing Your Data with Marqo

    First, create a new Python script and input the following. Here, we import marqo and set up the client:

    
    import marqo
    
    # Set up the marqo client 
    mq = marqo.Client("http://localhost:8882")
    

    Next, we specify our settings for this index.

    
    settings = {
        "type": "unstructured",    # Type of index
        "vectorNumericType": "float",    # Numeric type for vector encoding
        "model": "LanguageBind/Video_V1.5_FT_Audio_FT_Image",    # The model to use to vectorise doc content
        "normalizeEmbeddings": True,    # Normalize the embeddings to have unit length
        "treatUrlsAndPointersAsMedia": True,    # Fetch images, videos and audio from pointers
        "treatUrlsAndPointersAsImages": True,   # Fetch image from pointers
        "audioPreprocessing": {"splitLength": 10, "splitOverlap": 3},   # The audio preprocessing object
        "videoPreprocessing": {"splitLength": 10, "splitOverlap": 3}    # The video preprocessing object
    }
    

    Here, we are using the LanguageBind/Video_V1.5_FT_Audio_FT_Image model. For more information on this model, see the model card here. For more information on the additional inputs featured here, visit our documentation.

    Now, we create our index and add documents to it. Below we’ve included an audio example but you can also specify video files/URLs here too.

    
    # Create your marqo index
    resp = mq.create_index("my-index", settings_dict=settings)
    
    # Add documents to your index
    res = mq.index("my-index").add_documents(
        documents = [
          # Add an audio file of music
          {"audio_field": "https://dn720302.ca.archive.org/0/items/cocktail-jazz-coffee/01.%20Relaxing%20Jazz%20Coffee.mp3", "_id": "id1"},
          # Or add a video file of a movie
          # {"video_field": "https://ia800103.us.archive.org/27/items/electricsheep-flock-248-22500-1/00248%3D22801%3D20924%3D20930.mp4", "_id": "id2"},
          # Or add an image 
          # {"image_field": "https://raw.githubusercontent.com/marqo-ai/marqo-api-tests/mainline/assets/ai_hippo_realistic.png", "_id": "id3"},
            
          # Add more documents here
        ],
        tensor_fields=['audio_field']
    )
    

    3. Search!

    Now we can perform a text search over this data.

    
    # Search for jazz music
    res = mq.index("my-index").search("jazz music")
    
    # Print the top hit
    print(res['hits'][0])
    

    This returns:

    
    {'audio_field': 'https://dn720302.ca.archive.org/0/items/cocktail-jazz-coffee/01.%20Relaxing%20Jazz%20Coffee.mp3', '_id': 'id1', '_highlights': [{'audio_field': '[156.034286, 166.034286]'}], '_score': 0.5701236134891678}
    

    We see that the top hit for our query returns the jazz music audio file we added to our index.

    Audio and video search in Marqo really is as easy as that!

    Usage Recommendations

    We recommend the following when using audio and video search with Marqo:

    • Don't mix modalities within a document or between documents unless using a multimodal combination
    • Recommended pre-processing for both video and audio is <20. Audio was trained at 10 seconds and video was trained at 20 seconds.
    • Search can be done with URLs for media of different modalities

    Join the Marqo Community

    This major milestone wouldn’t have been possible without our incredible community offering suggestions, feedback and ideas. Join our growing community to share your experiences, ask questions, and collaborate:

    Ellie Sleightholm
    Head of Software Developer Relations