What documents does this notebook analyze?

The notebook analyzes 10-K PDF filings — lengthy annual reports that public companies file with the SEC. The example uses 2021 10-K documents from Uber and Lyft, each over 100 pages long.

What are the two main query patterns shown in this notebook?

Simple QA uses a QueryEngine with VectorStoreIndex for straightforward questions against a single document, while the SubQuestionQueryEngine breaks complex compare-and-contrast queries into simpler sub-questions routed to separate company indexes, then synthesizes the answers together.

What do I need to set up before using this notebook?

You need to install the `llama-index` library, configure an OpenAI API key, and set `gpt-3.5-turbo-instruct` as your LLM via a global ServiceContext. The OpenAI API is used for both embedding computation and query answering.

When should I use the SubQuestionQueryEngine instead of the simple QueryEngine?

Use SubQuestionQueryEngine when your query requires synthesizing information across multiple documents, such as comparing financial metrics between two companies, since it automatically decomposes compound questions into single-document sub-questions.

Prompt Chain

Analyze Financial Documents with LlamaIndex

Name: Financial Document Analysis with LlamaIndex
Availability: OnlineOnly
Author: OpenAI Cookbook

Analyze and compare 10-K financial filings across companies using LlamaIndex's VectorStoreIndex and SubQuestionQueryEngine.

Copy chain

Works with openaillamaindex

OpenAI Cookbook

Own this? Claim it

Maintainer of this project? Claim this page to edit the listing.

Spark score

out of 100

Updated last month

Version 1.0.0

Models

gpt 3 5gpt 4o llama 3

Add to Favorites

Why it matters

Automate the extraction and synthesis of insights from lengthy financial documents like 10-K forms, enabling faster and more informed financial analysis.

Outcomes

What it gets done

Load and index financial documents (e.g., 10-K forms) using LlamaIndex.

Perform simple question-answering over indexed financial data.

Conduct advanced compare-and-contrast analysis across multiple financial documents.

Leverage RAG systems for efficient information retrieval and insight generation.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-financialdocumentanalysiswithllamaindex | bash

Steps

Steps in the chain

Setup - Install and Import

Configure LLM Provider

Data Loading and Indexing

Build VectorStoreIndex

Configure QueryEngine

Simple QA Queries

Advanced QA - Compare and Contrast

Overview

Financial Document Analysis with LlamaIndex

Indexes and analyzes 10-K financial filings with LlamaIndex, supporting single-document QA and cross-document compare-and-contrast queries. Use when analysts need quick QA over long financial filings or need to compare metrics across multiple companies' documents.

What it does

This notebook shows how to perform financial analysis over 10-K documents - the lengthy, jargon-heavy annual reports companies file with the SEC - using LlamaIndex, a data framework for building RAG systems in just a few lines of code. After installing llama-index and configuring gpt-3.5-turbo-instruct as the LLM via a global ServiceContext, two 10-K PDFs (Uber and Lyft, 2021) are loaded and parsed into per-page Document objects (each over 100 pages, so loading takes a while), then indexed into an in-memory VectorStoreIndex (which calls the OpenAI API to compute embeddings over document chunks). Simple QA is run via a QueryEngine configured with similarity_top_k controlling how many retrieved chunks (Node objects) are used as context per answer. For more complex, cross-document analysis - comparing and contrasting Uber's and Lyft's financials - a SubQuestionQueryEngine breaks a compound compare-and-contrast query into simpler sub-questions, each routed to its respective company's index, then synthesizes the sub-answers into a combined response.

The motivating problem is that extracting information and synthesizing insight from long financial documents is a core part of a financial analyst's job, and a 10-K - the annual report the SEC requires public companies to file, giving a comprehensive summary of financial performance - typically runs hundreds of pages with domain-specific terminology that's hard for a layperson to digest quickly. LlamaIndex is introduced as a general data framework for LLM applications: a RAG system can be stood up in a few lines of code, and for more advanced use it also offers a broader toolkit for data ingestion and indexing, retrieval and re-ranking modules, and composable components for building custom query engines beyond what this specific notebook demonstrates. The notebook itself follows a fixed outline - Introduction, Setup, Data Loading & Indexing, Simple QA, and Advanced QA (Compare and Contrast) - walking from installing the library and configuring the LLM through to the two query patterns.

When to use - and when NOT to

Use the simple QueryEngine + VectorStoreIndex pattern for straightforward Q&A against a single long financial document. Use the SubQuestionQueryEngine specifically when a query requires synthesizing information across multiple documents (e.g. comparing two companies' financial metrics), since it decomposes the compound question into single-document sub-questions automatically rather than requiring you to manually split the query yourself.

Inputs and outputs

Input: 10-K PDF filings (or similar long financial documents). Output: direct answers to single-document questions via simple QA, or synthesized compare-and-contrast answers across multiple companies' filings via the sub-question query engine.

Integrations

Requires the llama-index library and an OpenAI API key/model (gpt-3.5-turbo-instruct in this example) for both embedding computation and query answering.

Who it's for

Financial analysts and developers who need to quickly extract information and synthesize insights from long, jargon-dense financial filings - including across multiple companies - without building a custom RAG pipeline from scratch.

Source README

Financial Document Analysis with LlamaIndex

In this example notebook, we showcase how to perform financial analysis over 10-K documents with the LlamaIndex framework with just a few lines of code.

Introduction

LLamaIndex

LlamaIndex is a data framework for LLM applications.
You can get started with just a few lines of code and build a retrieval-augmented generation (RAG) system in minutes.
For more advanced users, LlamaIndex offers a rich toolkit for ingesting and indexing your data, modules for retrieval and re-ranking, and composable components for building custom query engines.

See full documentation for more details.

Financial Analysis over 10-K documents

A key part of a financial analyst's job is to extract information and synthesize insight from long financial documents.
A great example is the 10-K form - an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company's financial performance.
These documents typically run hundred of pages in length, and contain domain-specific terminology that makes it challenging for a layperson to digest quickly.

We showcase how LlamaIndex can support a financial analyst in quickly extracting information and synthesize insights across multiple documents with very little coding.

Setup

To begin, we need to install the llama-index library

Now, we import all modules used in this tutorial

Before we start, we can configure the LLM provider and model that will power our RAG system.
Here, we pick gpt-3.5-turbo-instruct from OpenAI.

We construct a ServiceContext and set it as the global default, so all subsequent operations that depends on LLM calls will use the model we configured here.

Data Loading and Indexing

Now, we load and parse 2 PDFs (one for Uber 10-K in 2021 and another for Lyft 10-k in 2021).
Under the hood, the PDFs are converted to plain text Document objects, separate by page.

Note: this operation might take a while to run, since each document is more than 100 pages.

Now, we can build an (in-memory) VectorStoreIndex over the documents that we've loaded.

Note: this operation might take a while to run, since it calls OpenAI API for computing vector embedding over document chunks.

Simple QA

Now we are ready to run some queries against our indices!
To do so, we first configure a QueryEngine, which just captures a set of configurations for how we want to query the underlying index.

For a VectorStoreIndex, the most common configuration to adjust is similarity_top_k which controls how many document chunks (which we call Node objects) are retrieved to use as context for answering our question.

Let's see some queries in action!

Advanced QA - Compare and Contrast

For more complex financial analysis, one often needs to reference multiple documents.

As a example, let's take a look at how to do compare-and-contrast queries over both Lyft and Uber financials.
For this, we build a SubQuestionQueryEngine, which breaks down a complex compare-and-contrast query, into simpler sub-questions to execute on respective sub query engine backed by individual indices.

Let's see these queries in action!

FAQ

Common questions

Discussion

Analyze Financial Documents with LlamaIndex

What it gets done

Add it to your toolbox

Steps in the chain

Financial Document Analysis with LlamaIndex

What it does

When to use - and when NOT to

Inputs and outputs

Integrations

Who it's for

Financial Document Analysis with LlamaIndex

Notebook Outline

Introduction

LLamaIndex

Financial Analysis over 10-K documents

Setup

Data Loading and Indexing

Simple QA

Advanced QA - Compare and Contrast

Common questions

Questions & comments · 0