Announcement

Marqo Now Supports Video and Audio Search

September 18, 2024
5
mins read
We are thrilled to announce a significant enhancement to Marqo's search capabilities: Marqo now supports video and audio search! With the integration of LanguageBind models, you can index, embed, and search video and audio files seamlessly. This advancement extends Marqo's functionality beyond text and images, opening up a new world of possibilities for multimodal search.

Multimodal Search Capabilities

Index and Embed Video and Audio Files

Marqo's new release allows you to:

  • Index: Add video and audio files to your search index with ease.
  • Embed: Generate rich embeddings for multimedia content using advanced LanguageBind models.
  • Search: Perform comprehensive searches across text, images, video, and audio files.

Note, this is available in Marqo open source and will be coming to Marqo Cloud very soon!

Why This Matters

In an increasingly multimedia-centric world, the ability to search across various content/data types is crucial. Whether you're managing a vast library of educational videos, a repository of podcasts, or a collection of marketing materials, Marqo's new capabilities ensure that you can find exactly what you need, when you need it.

Use Cases

There are endless possibilities with audio and video search but below are some of the most common use cases.

  • Music and Video Discovery: Search for songs or videos using audio clips, lyrics, or visual content, aiding in content discovery and identification.
  • Meeting and Lecture Search: Search within recorded meetings, lectures, or podcasts to find specific topics, discussions, or key moments.
  • Surveillance and Security: Identify events or objects in video surveillance footage or detect specific sounds in audio recordings.
  • Customer Support and Training: Search across support call recordings or training videos to find relevant sections for faster resolution or learning.
  • Ecommerce and Social Media: Search within product reviews or user-generated content for specific products, features, or emotional tones.
  • Getting Started

    1. Start Marqo

    Marqo requires Docker. If you haven’t already, install Docker. Then, navigate to your terminal and input the following to run marqo:

    
    docker rm -f marqo
    docker pull marqoai/marqo:latest
    docker run --name marqo -it -p 8882:8882 \
    -e MARQO_MODELS_TO_PRELOAD="[]" \
    -e MARQO_MAX_CUDA_MODEL_MEMORY=16 \
    -e MARQO_MAX_CPU_MODEL_MEMORY=16 marqoai/marqo:latest
    

    Note, we configure some variables here. MARQO_MODELS_TO_PRELOAD is set to [] so that no models are automatically loaded—we will load our audio/video models later. In addition, MARQO_MAX_CUDA_MODEL_MEMORY and MARQO_MAX_CPU_MODEL_MEMORY have a default value of 4 so we increase this to 16 to allow for larger model sizes and more complex computations, enabling faster processing and handling.

    Now, we install marqo, the Python client for interacting with the Marqo server running in Docker:

    
    pip install marqo
    

    That’s it! Now we’re ready to create an index and begin searching over audio and video files.

    2. Indexing Your Data with Marqo

    First, create a new Python script and input the following. Here, we import marqo and set up the client:

    
    import marqo
    
    # Set up the marqo client 
    mq = marqo.Client("http://localhost:8882")
    

    Next, we specify our settings for this index.

    
    settings = {
        "type": "unstructured",    # Type of index
        "vectorNumericType": "float",    # Numeric type for vector encoding
        "model": "LanguageBind/Video_V1.5_FT_Audio_FT_Image",    # The model to use to vectorise doc content
        "normalizeEmbeddings": True,    # Normalize the embeddings to have unit length
        "treatUrlsAndPointersAsMedia": True,    # Fetch images, videos and audio from pointers
        "treatUrlsAndPointersAsImages": True,   # Fetch image from pointers
        "audioPreprocessing": {"splitLength": 10, "splitOverlap": 3},   # The audio preprocessing object
        "videoPreprocessing": {"splitLength": 10, "splitOverlap": 3}    # The video preprocessing object
    }
    

    Here, we are using the LanguageBind/Video_V1.5_FT_Audio_FT_Image model. For more information on this model, see the model card here. For more information on the additional inputs featured here, visit our documentation.

    Now, we create our index and add documents to it. Below we’ve included an audio example but you can also specify video files/URLs here too.

    
    # Create your marqo index
    resp = mq.create_index("my-index", settings_dict=settings)
    
    # Add documents to your index
    res = mq.index("my-index").add_documents(
        documents = [
          # Add an audio file of music
          {"audio_field": "https://dn720302.ca.archive.org/0/items/cocktail-jazz-coffee/01.%20Relaxing%20Jazz%20Coffee.mp3", "_id": "id1"},
          # Or add a video file of a movie
          # {"video_field": "https://ia800103.us.archive.org/27/items/electricsheep-flock-248-22500-1/00248%3D22801%3D20924%3D20930.mp4", "_id": "id2"},
          # Or add an image 
          # {"image_field": "https://raw.githubusercontent.com/marqo-ai/marqo-api-tests/mainline/assets/ai_hippo_realistic.png", "_id": "id3"},
            
          # Add more documents here
        ],
        tensor_fields=['audio_field']
    )
    

    3. Search!

    Now we can perform a text search over this data.

    
    # Search for jazz music
    res = mq.index("my-index").search("jazz music")
    
    # Print the top hit
    print(res['hits'][0])
    

    This returns:

    
    {'audio_field': 'https://dn720302.ca.archive.org/0/items/cocktail-jazz-coffee/01.%20Relaxing%20Jazz%20Coffee.mp3', '_id': 'id1', '_highlights': [{'audio_field': '[156.034286, 166.034286]'}], '_score': 0.5701236134891678}
    

    We see that the top hit for our query returns the jazz music audio file we added to our index.

    Audio and video search in Marqo really is as easy as that!

    Usage Recommendations

    We recommend the following when using audio and video search with Marqo:

    • Don't mix modalities within a document or between documents unless using a multimodal combination
    • Recommended pre-processing for both video and audio is <20. Audio was trained at 10 seconds and video was trained at 20 seconds.
    • Search can be done with URLs for media of different modalities

    Join the Marqo Community

    This major milestone wouldn’t have been possible without our incredible community offering suggestions, feedback and ideas. Join our growing community to share your experiences, ask questions, and collaborate:

    Ellie Sleightholm
    Head of Software Developer Relations