Using Marqo and GPT3 for topical news summarisation

May 29, 2024
mins read
I wanted to build a fun search application within minutes to show the ease and power of Marqo. I decided to build a news summarisation application, i.e. answer questions like “What is happening in business today?” that synthesises example news corpus (link).

The plan is to use Marqo’s search to provide useful context for a generation algorithm; we use OpenAI’s GPT3 API (link). This is more formally called “retrieval-augmented generation” and helps with generation tasks that require specific knowledge that the model has not seen during training. For example, company-specific documents and news data that’s “in the future”. Overview of what we’re planning:

Thus, we can see the problem when we solely ask GPT3, “What is happening in business today?” It does not know and thus generates a generic response:

In fact, anyone following the financial markets knows ‘the “economy is slowly recovering” and “businesses are starting to invest again” is completely wrong!!

To solve this, we need to start our Marqo docker container, which creates a Python API we’ll interact with during this demo:

Next, let’s look at our example news documents corpus, which contains BBC and Reuters news content from 8th and 9th of November. We use “_id” as Marqo document identifier, the “date” the article was written, “website” indicating the web domain, “Title” for the headline, and “Description” for the article body:

We then index our news documents that manage both the lexical and neural embeddings. By default, Marqo uses SBERT from neural text embeding and has complete OpenSearch lexical and metadata functionality natively.

Now we have indexed our news documents, we can simply use Marqo Python search API to return relevant context for our GPT3 generation. For query “q”, we use the question and want to match news context based on the “Title” and “Description” text. We also want to filter our documents for “today”, which was ‘2022–11–09’.

Next, we insert Marqo’s search results into GPT3 prompt as context, and we try generating an answer again::

Sucess! You’ll notice that using Marqo to add relevant and temporally correct context means we can build a news summarisation application with ease. So instead of wrong and vague answers, we get factually-grounded summaries based on retrieved facts such as:

  1. Marks and Spencer has warned of a “gathering storm” of higher costs for retailers
  2. Facebook-owner Meta is cutting 11,000 staff
  3. Tesla stock has hit a 2-year low after CEO Elon Musk sold $4 billion worth of shares

Full code: here (you’ll need GPT3 API token)

Visit Marqo on Github:

Iain Mackie
Investor @ Creator Fund