Summarizing large documents with current-gen AI models can be challenging, since all are limited to various degrees in how many characters (tokens) they can process in one request.
An AI bot will happily summarize any content that fits in its context window, but how do we get a cohesive summary of large documents that are many times the size of the bot’s context window?
In this blog post, I’ll introduce a document summarizer I wrote that uses Azure OpenAI and a sliding content window algorithm to generate rich summaries of large documents. The script is written in Python and supports the summarization of plain text files, PDFs, Word documents, and even websites.
Quick Links
Why use this tool?
I originally developed the document summarization tool to condense transcripts from recorded Microsoft Teams meetings. The tool has been invaluable for facilitating conversations with partner teams and utilizing the transcripts as meeting notes.
I’ve used these transcripts and summaries to generate blog content, design documents, and user stories for various products.
When I realized the tool’s potential, I adapted it to accommodate not only meeting transcripts but also PDFs, Word documents, and websites, in addition to plain text files.
As a product manager with a software development background, I’m convinced that tools like this will become increasingly common as AI large-language models (LLMs) gain wider adoption.
Get the code
The full code for the document summarizer is available on my GitHub repo. Clone it locally using the following command:
git clone https://github.com/mario-guerra/AzureOpenAIDocSummarizer.git
Refer to the project README for installation and usage instructions. You’ll need access to an OpenAI GPT model, either from Azure or directly from OpenAI.
This project is currently set up to use Azure OpenAI models.
Sliding Content Window
The sliding content window is a method that enables the document summarizer to iteratively summarize large documents without exceeding the AI model’s token limits, while maintaining sufficient context between summarization steps to create a cohesive summary.
I derived this method from my background in embedded design and signal processing, where it’s common to analyze a signal’s time-varying spectral properties by breaking it down into smaller overlapping segments and applying the Fourier Transform to each segment.
I realized that a similar strategy can be applied to large documents, since the meaning of what you’re currently reading only makes sense in the limited context of the few sentences that came before and the few sentences that come after.
When applied to document summarization, the sliding window approach processes the input document in text chunks sized according to a predefined summary level (verbose, concise, or terse). The sliding content window is made up of the most recent paragraphs from the previous summary, plus the current input text chunk.
The model is asked to summarize the new chunk and update the previous summary with the new information. This method ensures that the script can generate a cohesive summary of the entire document while keeping the context intact.
# Process the input text in chunks and generate the summary
with open(output_path, "a") as out_f:
processed_chars = 0
while True:
print("Summarizing...")
# Read a chunk of text from the input_text
chunk = input_text[processed_chars:processed_chars+chunk_size]
processed_chars += len(chunk)
# Break the loop if there's no more text to process
if not chunk:
break
# Combine previous summary paragraphs and the current chunk
input_text_chunk = "[PREVIOUS_SUMMARY]\n\n" + "\n\n".join(
previous_summary_paragraphs) + "\n\n" + "[CURRENT_CHUNK]\n\n" + chunk
# Process the text chunk and generate a summary
summary_ctx = await process_text(input_text_chunk, summary_level)
summary = str(summary_ctx)
# Update the previous summary paragraphs based on the new summary.
# If the summary has more than max_context_paragraphs, remove the first
# paragraph until the summary is within the limit. As paragraphs are removed,
# they are written to the output file.
if summary:
summary_paragraphs = extract_summary_paragraphs(summary)
while len(summary_paragraphs) > max_context_paragraphs:
out_f.write(summary_paragraphs.pop(0) + "\n\n")
out_f.flush()
previous_summary_paragraphs = summary_paragraphs
print("\nSummary window: \n", summary)
else:
print("No summary generated for the current chunk.")
# Calculate and display the progress of the summarization
progress = (processed_chars / total_chars) * 100
print(
f"\nProgress: {processed_chars}/{total_chars} ({progress:.2f}%)")
If you read through the code, you may have noticed that I’m saving summaries as paragraphs in a list. As the list grows, I pop the oldest paragraphs off the front till the number of paragraphs fits within my predefined limit of max_context_paragraphs
.
This is how I keep the sliding content window constrained to only what’s current as the script iterates through a document. When there’s no more content to summarize, the script is done.
Communicating with Azure OpenAI
The document summarizer is using Microsoft’s Semantic Kernel project to communicate with Azure OpenAI services.
Semantic Kernel is an open-source SDK that facilitates the integration of AI services, such as OpenAI, Azure OpenAI, and Hugging Face, with programming languages like Python and C# with more on the way.
In the document summarizer script, the Semantic Kernel library’s `AzureChatCompletion` connector is used to communicate with Azure OpenAI, enabling me to communicate with a GPT-4 deployment for my document summarization.
Supported File Types and Input Sources
The document summarizer is designed to handle various file types and input sources, including:
- Text files
- PDFs
- Word documents
- Websites
The script automatically detects the input file type and extracts the text accordingly using the following functions:
def extract_text_from_pdf(pdf_path):
# ...
def extract_text_from_word(doc_path):
# ...
def extract_text_from_url(url):
# ...
These methods rely on functionality provided by various Python libraries, namely PyPDF2, docx, and BeautifulSoup.
Customizable Summary Levels
The document summarizer offers three summary levels to cater to different needs:
- Verbose: Detailed summary with a focus on key details and new information
- Concise: Balanced summary highlighting key details and technical content (default)
- Terse: Brief summary for executive action, focusing on key details and technical content
These summary levels can be easily adjusted using the optional --summary-level
command-line argument.
Summary levels are managed through a combination of prompting, input chunk size, and token limits. The script defines three dictionaries that help manage the verbosity of the chat model output for the document summarizer. These dictionaries are:
summary_levels
: This dictionary defines input text chunk sizes. The chunk size is used to determine the number of characters to process from an input text during a single request to the chat model. It has three keys: “verbose”, “concise”, and “terse”, each associated with different chunk sizes.request_token_sizes
: This dictionary defines request token sizes, which sets a limit on the number of characters the model can use for a response. Like the previous dictionary, it has three keys: “verbose”, “concise”, and “terse”, each associated with different token sizes.summary_prompts
: This dictionary contains summary prompts for each summary level. These prompts guide the chat model in generating summaries with the desired verbosity. The keys are the same as in the previous dictionaries: “verbose”, “concise”, and “terse”.
Summary level and request token size share an inverse relationship. The request tokens size establishes an upper limit on the number of tokens that the model can use to generate a response.
By decreasing the input chunk size and increasing the request token size, we give the model the flexibility to be more verbose while summarizing a smaller portion of text.
This approach enables the model to incorporate additional detail in the summary, while still maintaining a reasonable summary length.
Conversely, by increasing the input chunk size and reducing the request tokens size, we constrain the model to summarize larger amounts of text withing a smaller response window, resulting in a more concise summary.
# Dictionary defining chunk sizes, which influence verbosity of the chat model output.
# The larger the chunk size, the more verbose the output. The chunk size is
# used to determine the number of characters to process in a given text during a
# single request to the chat model.
summary_levels = {
"verbose": 5000,
"concise": 10000,
"terse": 20000,
}
# Dictionary defining request token sizes, which influence verbosity of the chat model output.
# The larger the request token size, the more verbose the output. The request token size is
# used to determine the number of tokens to request from the chat model during a single request.
request_token_sizes = {
"verbose": 3000,
"concise": 2000,
"terse": 1000,
}
summary_prompts = {
"verbose": """Summarize verbosely, emphasizing key details and incorporating new information from [CURRENT_CHUNK] into [PREVIOUS_SUMMARY]. Retain the first two paragraphs of [PREVIOUS_SUMMARY]. Remove labels, maintain paragraph breaks for readability, and avoid phrases like 'in conclusion' or 'in summary'.""",
"concise": """Summarize concisely, highlighting key details, and update with new info. Ignore irrelevant content, include all technical content. Use [PREVIOUS_SUMMARY] and [CURRENT_CHUNK]. Keep first two paragraphs in [PREVIOUS_SUMMARY] as-is. Exclude these labels from summary. Ensure readability using paragraph breaks, and avoid phrases like 'in conclusion' or 'in summary'.""",
"terse": """Summarize tersely for executive action using [PREVIOUS_SUMMARY] and [CURRENT_CHUNK], focusing on key details and technical content. Retain the first two paragraphs of [PREVIOUS_SUMMARY], remove labels, and maintain paragraph breaks for readability. Avoid phrases like 'in conclusion' or 'in summary'.""",
}
Usage Instructions
To use the document summarizer, follow these steps:
- Clone the repository to your local machine.
- Install the required dependencies using pip.
- Rename ‘.env.example’ to ‘.env’ and add your Azure OpenAI deployment name, API key, and endpoint.
- Run the script with the following command:
python summarizer.py <input_path> <output_path> [--summary-level <summary_level>]
Replace <input_path>
with the path to the input file or URL, <output_path>
with the path to the output summary file, and <summary_level>
(optional) with one of the following options: “verbose”, “concise”, or “terse”.
Conclusion
The document summarizer with Azure OpenAI demonstrates the practical application of the sliding content window approach to generate cohesive summaries of large documents.
With support for various file types as input sources, customizable summary levels, and its integration with Azure OpenAI, this tool serves as a valuable resource for document summarization.
As AI large-language models continue to evolve, we can expect ongoing developments and improvements in this area, making document summarization increasingly efficient and accessible for a wide range of use cases.