ZeroxPDFLoader

This notebook provides a quick overview for getting started with ZeroxPDF document loader. For detailed documentation of all DocumentLoader features and configurations head to the API reference.

Overview

ZeroxPDFLoader is a document loader that leverages the Zerox library. Zerox converts PDF documents into images, processes them using a vision-capable language model, and generates a structured Markdown representation. This loader allows for asynchronous operations and provides page-level document extraction.

Integration details

Class	Package	Local	Serializable	JS support
PyPDFLoader	langchain_community	❌	❌	❌

Loader features

Source	Document Lazy Loading	Native Async Support	Extract Images	Extract Tables
PyPDFLoader	✅	❌	✅	✅

Setup

Credentials

Appropriate credentials need to be set up in environment variables. The loader supports number of different models and model providers. See Usage header below to see few examples or Zerox documentation for a full list of supported models.

Installation

To use ZeroxPDFLoader, you need to install the zerox package. Also make sure to have langchain-community and py-zeroxinstalled.

%pip install -qU langchain_community py-zerox

Initialization

ZeroxPDFLoader enables PDF text extraction using vision-capable language models by converting each page into an image and processing it asynchronously. To use this loader, you need to specify a model and configure any necessary environment variables for Zerox, such as API keys.

If you're working in an environment like Jupyter Notebook, you may need to handle asynchronous code by using nest_asyncio. You can set this up as follows:

import nest_asyncio
nest_asyncio.apply()

import os
from getpass import getpass

import nest_asyncio
from langchain_community.document_loaders import ZeroxPDFLoader

nest_asyncio.apply()

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your API key: ")
file_path = "./example_data/layout-parser-paper.pdf"
loader = ZeroxPDFLoader(file_path)

API Reference:ZeroxPDFLoader

Load

docs = loader.load()
docs[0]

images_to_text can not be simulate
``````output

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

---------------------------------------------------------------------------
``````output
ModelAccessError                          Traceback (most recent call last)
``````output
Cell In[3], line 1
----> 1 docs = loader.load()
      2 docs[0]
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/langchain_core/document_loaders/base.py:31, in BaseLoader.load(self)
     29 def load(self) -> list[Document]:
     30     """Load data into Document objects."""
---> 31     return list(self.lazy_load())
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/langchain_community/document_loaders/pdf.py:1551, in ZeroxPDFLoader.lazy_load(self)
   1549 else:
   1550     blob = Blob.from_path(self.file_path)  # type: ignore[attr-defined]
-> 1551 yield from self.parser.lazy_parse(blob)
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/pdf.py:2168, in ZeroxPDFParser.lazy_parse(self, blob)
   2164         prompt_images = ""
   2165     zerox_prompt = ZeroxPDFParser._prompt.format(
   2166         prompt_tables=prompt_tables, prompt_images=prompt_images
   2167     )
-> 2168 zerox_output = asyncio.run(
   2169     zerox(
   2170         file_path=str(file_path),
   2171         model=self.model,
   2172         cleanup=self.cleanup,
   2173         concurrency=self.concurrency,
   2174         maintain_format=self.maintain_format,
   2175         custom_system_prompt=zerox_prompt,
   2176         select_pages=self.select_pages,
   2177         **self.zerox_kwargs,
   2178     )
   2179 )
   2181 # Convert zerox output to Document instances and yield them
   2182 if len(zerox_output.pages) > 0:
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/nest_asyncio.py:30, in _patch_asyncio.<locals>.run(main, debug)
     28 task = asyncio.ensure_future(main)
     29 try:
---> 30     return loop.run_until_complete(task)
     31 finally:
     32     if not task.done():
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/nest_asyncio.py:98, in _patch_loop.<locals>.run_until_complete(self, future)
     95 if not f.done():
     96     raise RuntimeError(
     97         'Event loop stopped before Future completed.')
---> 98 return f.result()
``````output
File ~/miniconda3/lib/python3.12/asyncio/futures.py:203, in Future.result(self)
    201 self.__log_traceback = False
    202 if self._exception is not None:
--> 203     raise self._exception.with_traceback(self._exception_tb)
    204 return self._result
``````output
File ~/miniconda3/lib/python3.12/asyncio/tasks.py:314, in Task.__step_run_and_handle_result(***failed resolving arguments***)
    310 try:
    311     if exc is None:
    312         # We use the `send` method directly, because coroutines
    313         # don't have `__iter__` and `__next__` methods.
--> 314         result = coro.send(None)
    315     else:
    316         result = coro.throw(exc)
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/pyzerox/core/zerox.py:76, in zerox(cleanup, concurrency, file_path, maintain_format, model, output_dir, temp_dir, custom_system_prompt, select_pages, **kwargs)
     73     raise FileUnavailable()
     75 # Create an instance of the litellm model interface
---> 76 vision_model = litellmmodel(model=model,**kwargs)
     78 # override the system prompt if a custom prompt is provided
     79 if custom_system_prompt:
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/pyzerox/models/modellitellm.py:39, in litellmmodel.__init__(self, model, **kwargs)
     37 self.validate_environment()
     38 self.validate_model()
---> 39 self.validate_access()
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/pyzerox/models/modellitellm.py:71, in litellmmodel.validate_access(self)
     69 """Validates access to the model -> if environment variables are set correctly with correct values."""
     70 if not litellm.check_valid_key(model=self.model,api_key=None):
---> 71     raise ModelAccessError(extra_info={"model": self.model})
``````output
ModelAccessError: 
    Your provided model can't be accessed. Please make sure you have access to the model and also required environment variables are setup correctly including valid api key(s).
    Refer: https://docs.litellm.ai/docs/providers
     (Extra Info: {'model': 'gpt-4o-mini'})

import pprint

pprint.pp(docs[0].metadata)

Lazy Load

pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []
len(pages)

Metadata attribute now contains several pieces of information about the file in addition to the total number of pages.
Why is this important? If we want to reference a document, we need to determine if it’s relevant. A reference is valid if it helps the user quickly locate the fragment within the document (using the page and/or a chunk excerpt). But if the URL points to a PDF file without a page number (for various reasons) and the file has a large number of pages, we want to remove the reference that doesn’t assist the user. There’s no point in referencing a 100-page document! The total_pages metadata can then be used.

pprint.pp(pages[0].page_content[:100])
pprint.pp(pages[0].metadata)

Extract the PDF by page. Each page is extracted as a langchain Document object:

loader = ZeroxPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
)
docs = loader.load()
print(len(docs))

Extract the whole PDF as a single langchain Document object:

loader = ZeroxPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
)
docs = loader.load()
print(len(docs))

Add a custom pages_delimitor to identify where are ends of pages in single mode:

loader = ZeroxPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
    pages_delimitor="\n-------THIS IS A CUSTOM END OF PAGE-------\n",
)
docs = loader.load()
print(docs[0].page_content[:5780])

This could simply be \n, or \f to clearly indicate a page change, or  for seamless injection in a Markdown viewer without a visual effect.

Why is it important to identify page breaks when retrieving the full document flow? Because we generally want to provide a URL with the chunk’s location when the LLM answers. While it’s possible to reference the entire PDF, this isn’t practical if it’s more than two pages long. It’s better to indicate the specific page to display in the URL. Therefore, assistance is needed so that chunking algorithms can add the page metadata to each chunk. The choice of delimiter helps the algorithm prioritize this parameter.

Extract images from the PDF

In LangChain the OCR process used by the parsers involves asking the parser for the text on a page, then retrieving images to apply OCR.
In the previous implementation the text extracted from images was appended to the end of the page text. In a RAG context it means if in the original document a paragraph is spread across two pages it would have been cut in half by the OCR process putting the text (from the image) in between, worsening RAG model's performance.

To avoid this, we modified the strategy for injecting OCR results from images. Now, the result is inserted between the last and the second-to-last paragraphs of text (\n\n or \n) of the page.

Extract images from the PDF in markdown format (can also be html or text) with vision model:

from langchain_community.document_loaders.parsers.pdf import (
    convert_images_to_description,
)

loader = ZeroxPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    extract_images=True,
    images_to_text=convert_images_to_description(model=None, format="html"),
)
docs = loader.load()

print(docs[5].page_content)

API Reference:convert_images_to_description

API reference

`ZeroxPDFLoader`

This loader class initializes with a file path and model type, and supports custom configurations via zerox_kwargs for handling Zerox-specific parameters.

Arguments:

file_path (Union[str, Path]): Path to the PDF file.
model (str): Vision-capable model to use for processing in format <provider>/<model>. Some examples of valid values are:
- model = "gpt-4o-mini" ## openai model
- model = "azure/gpt-4o-mini"
- model = "gemini/gpt-4o-mini"
- model="claude-3-opus-20240229"
- model = "vertex_ai/gemini-1.5-flash-001"
- See more details in Zerox documentation
- Defaults to "gpt-4o-mini".
**zerox_kwargs (dict): Additional Zerox-specific parameters such as API key, endpoint, etc.
- See Zerox documentation

Methods:

lazy_load: Generates an iterator of Document instances, each representing a page of the PDF, along with metadata including page number and source.

See full API documentaton here

Notes

Model Compatibility: Zerox supports a range of vision-capable models. Refer to Zerox's GitHub documentation for a list of supported models and configuration details.
Environment Variables: Make sure to set required environment variables, such as API_KEY or endpoint details, as specified in the Zerox documentation.
Asynchronous Processing: If you encounter errors related to event loops in Jupyter Notebooks, you may need to apply nest_asyncio as shown in the setup section.

Troubleshooting

RuntimeError: This event loop is already running: Use nest_asyncio.apply() to prevent asynchronous loop conflicts in environments like Jupyter.
Configuration Errors: Verify that the zerox_kwargs match the expected arguments for your chosen model and that all necessary environment variables are set.

Additional Resources

Zerox Documentation: Zerox GitHub Repository
LangChain Document Loaders: LangChain Documentation

Document loader conceptual guide
Document loader how-to guides

Overview​

Integration details​

Loader features​

Setup​

Credentials​

Installation​

Initialization​

Load​

Lazy Load​

Extract the PDF by page. Each page is extracted as a langchain Document object:​

Extract the whole PDF as a single langchain Document object:​

Add a custom pages_delimitor to identify where are ends of pages in single mode:​

Extract images from the PDF

Extract images from the PDF in markdown format (can also be html or text) with vision model:​

API reference​

ZeroxPDFLoader​

Notes​

Troubleshooting​

Additional Resources​

Related​

Was this page helpful?