ZeroxPDFLoader
This notebook provides a quick overview for getting started with ZeroxPDF
document loader. For detailed documentation of all DocumentLoader features and configurations head to the API reference.
Overview
ZeroxPDFLoader
is a document loader that leverages the Zerox library. Zerox converts PDF documents into images, processes them using a vision-capable language model, and generates a structured Markdown representation. This loader allows for asynchronous operations and provides page-level document extraction.
Integration details
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
PyPDFLoader | langchain_community | ❌ | ❌ | ❌ |
Loader features
Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
---|---|---|---|---|
PyPDFLoader | ✅ | ❌ | ✅ | ✅ |
Setup
Credentials
Appropriate credentials need to be set up in environment variables. The loader supports number of different models and model providers. See Usage header below to see few examples or Zerox documentation for a full list of supported models.
Installation
To use ZeroxPDFLoader
, you need to install the zerox
package. Also make sure to have langchain-community
and py-zerox
installed.
%pip install -qU langchain_community py-zerox
Initialization
ZeroxPDFLoader
enables PDF text extraction using vision-capable language models by converting each page into an image and processing it asynchronously. To use this loader, you need to specify a model and configure any necessary environment variables for Zerox, such as API keys.
If you're working in an environment like Jupyter Notebook, you may need to handle asynchronous code by using nest_asyncio
. You can set this up as follows:
import nest_asyncio
nest_asyncio.apply()
import os
from getpass import getpass
import nest_asyncio
from langchain_community.document_loaders import ZeroxPDFLoader
nest_asyncio.apply()
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("Enter your API key: ")
file_path = "./example_data/layout-parser-paper.pdf"
loader = ZeroxPDFLoader(file_path)
Load
docs = loader.load()
docs[0]
images_to_text can not be simulate
``````output
[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.
---------------------------------------------------------------------------
``````output
ModelAccessError Traceback (most recent call last)
``````output
Cell In[3], line 1
----> 1 docs = loader.load()
2 docs[0]
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/langchain_core/document_loaders/base.py:31, in BaseLoader.load(self)
29 def load(self) -> list[Document]:
30 """Load data into Document objects."""
---> 31 return list(self.lazy_load())
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/langchain_community/document_loaders/pdf.py:1551, in ZeroxPDFLoader.lazy_load(self)
1549 else:
1550 blob = Blob.from_path(self.file_path) # type: ignore[attr-defined]
-> 1551 yield from self.parser.lazy_parse(blob)
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/pdf.py:2168, in ZeroxPDFParser.lazy_parse(self, blob)
2164 prompt_images = ""
2165 zerox_prompt = ZeroxPDFParser._prompt.format(
2166 prompt_tables=prompt_tables, prompt_images=prompt_images
2167 )
-> 2168 zerox_output = asyncio.run(
2169 zerox(
2170 file_path=str(file_path),
2171 model=self.model,
2172 cleanup=self.cleanup,
2173 concurrency=self.concurrency,
2174 maintain_format=self.maintain_format,
2175 custom_system_prompt=zerox_prompt,
2176 select_pages=self.select_pages,
2177 **self.zerox_kwargs,
2178 )
2179 )
2181 # Convert zerox output to Document instances and yield them
2182 if len(zerox_output.pages) > 0:
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/nest_asyncio.py:30, in _patch_asyncio.<locals>.run(main, debug)
28 task = asyncio.ensure_future(main)
29 try:
---> 30 return loop.run_until_complete(task)
31 finally:
32 if not task.done():
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/nest_asyncio.py:98, in _patch_loop.<locals>.run_until_complete(self, future)
95 if not f.done():
96 raise RuntimeError(
97 'Event loop stopped before Future completed.')
---> 98 return f.result()
``````output
File ~/miniconda3/lib/python3.12/asyncio/futures.py:203, in Future.result(self)
201 self.__log_traceback = False
202 if self._exception is not None:
--> 203 raise self._exception.with_traceback(self._exception_tb)
204 return self._result
``````output
File ~/miniconda3/lib/python3.12/asyncio/tasks.py:314, in Task.__step_run_and_handle_result(***failed resolving arguments***)
310 try:
311 if exc is None:
312 # We use the `send` method directly, because coroutines
313 # don't have `__iter__` and `__next__` methods.
--> 314 result = coro.send(None)
315 else:
316 result = coro.throw(exc)
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/pyzerox/core/zerox.py:76, in zerox(cleanup, concurrency, file_path, maintain_format, model, output_dir, temp_dir, custom_system_prompt, select_pages, **kwargs)
73 raise FileUnavailable()
75 # Create an instance of the litellm model interface
---> 76 vision_model = litellmmodel(model=model,**kwargs)
78 # override the system prompt if a custom prompt is provided
79 if custom_system_prompt:
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/pyzerox/models/modellitellm.py:39, in litellmmodel.__init__(self, model, **kwargs)
37 self.validate_environment()
38 self.validate_model()
---> 39 self.validate_access()
``````output
File ~/workspace.bda/patch_langchain_common/.venv/lib/python3.12/site-packages/pyzerox/models/modellitellm.py:71, in litellmmodel.validate_access(self)
69 """Validates access to the model -> if environment variables are set correctly with correct values."""
70 if not litellm.check_valid_key(model=self.model,api_key=None):
---> 71 raise ModelAccessError(extra_info={"model": self.model})
``````output
ModelAccessError:
Your provided model can't be accessed. Please make sure you have access to the model and also required environment variables are setup correctly including valid api key(s).
Refer: https://docs.litellm.ai/docs/providers
(Extra Info: {'model': 'gpt-4o-mini'})
import pprint
pprint.pp(docs[0].metadata)
Lazy Load
pages = []
for doc in loader.lazy_load():
pages.append(doc)
if len(pages) >= 10:
# do some paged operation, e.g.
# index.upsert(page)
pages = []
len(pages)
Metadata attribute now contains several pieces of information about the file in addition to the total number of pages.
Why is this important? If we want to reference a document, we need to determine if it’s relevant. A reference is valid if it helps the user quickly locate the fragment within the document (using the page and/or a chunk excerpt). But if the URL points to a PDF file without a page number (for various reasons) and the file has a large number of pages, we want to remove the reference that doesn’t assist the user. There’s no point in referencing a 100-page document! The total_pages metadata can then be used.
pprint.pp(pages[0].page_content[:100])
pprint.pp(pages[0].metadata)
Extract the PDF by page. Each page is extracted as a langchain Document object:
loader = ZeroxPDFLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
)
docs = loader.load()
print(len(docs))
Extract the whole PDF as a single langchain Document object:
loader = ZeroxPDFLoader(
"./example_data/layout-parser-paper.pdf",
mode="single",
)
docs = loader.load()
print(len(docs))
Add a custom pages_delimitor to identify where are ends of pages in single mode:
loader = ZeroxPDFLoader(
"./example_data/layout-parser-paper.pdf",
mode="single",
pages_delimitor="\n-------THIS IS A CUSTOM END OF PAGE-------\n",
)
docs = loader.load()
print(docs[0].page_content[:5780])
This could simply be \n, or \f to clearly indicate a page change, or <!-- PAGE BREAK --> for seamless injection in a Markdown viewer without a visual effect.
Why is it important to identify page breaks when retrieving the full document flow? Because we generally want to provide a URL with the chunk’s location when the LLM answers. While it’s possible to reference the entire PDF, this isn’t practical if it’s more than two pages long. It’s better to indicate the specific page to display in the URL. Therefore, assistance is needed so that chunking algorithms can add the page metadata to each chunk. The choice of delimiter helps the algorithm prioritize this parameter.
Extract images from the PDF
In LangChain the OCR process used by the parsers involves asking the parser for the text on a page, then retrieving images to apply OCR.
In the previous implementation the text extracted from images was appended to the end of the page text. In a RAG context it means if in the original document a paragraph is spread across two pages it would have been cut in half by the OCR process putting the text (from the image) in between, worsening RAG model's performance.
To avoid this, we modified the strategy for injecting OCR results from images. Now, the result is inserted between the last and the second-to-last paragraphs of text (\n\n or \n) of the page.
Extract images from the PDF in markdown format (can also be html or text) with vision model:
from langchain_community.document_loaders.parsers.pdf import (
convert_images_to_description,
)
loader = ZeroxPDFLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
extract_images=True,
images_to_text=convert_images_to_description(model=None, format="html"),
)
docs = loader.load()
print(docs[5].page_content)
API reference
ZeroxPDFLoader
This loader class initializes with a file path and model type, and supports custom configurations via zerox_kwargs
for handling Zerox-specific parameters.
Arguments:
file_path
(Union[str, Path]): Path to the PDF file.model
(str): Vision-capable model to use for processing in format<provider>/<model>
. Some examples of valid values are:model = "gpt-4o-mini" ## openai model
model = "azure/gpt-4o-mini"
model = "gemini/gpt-4o-mini"
model="claude-3-opus-20240229"
model = "vertex_ai/gemini-1.5-flash-001"
- See more details in Zerox documentation
- Defaults to
"gpt-4o-mini".
**zerox_kwargs
(dict): Additional Zerox-specific parameters such as API key, endpoint, etc.
Methods:
lazy_load
: Generates an iterator ofDocument
instances, each representing a page of the PDF, along with metadata including page number and source.
See full API documentaton here
Notes
- Model Compatibility: Zerox supports a range of vision-capable models. Refer to Zerox's GitHub documentation for a list of supported models and configuration details.
- Environment Variables: Make sure to set required environment variables, such as
API_KEY
or endpoint details, as specified in the Zerox documentation. - Asynchronous Processing: If you encounter errors related to event loops in Jupyter Notebooks, you may need to apply
nest_asyncio
as shown in the setup section.
Troubleshooting
- RuntimeError: This event loop is already running: Use
nest_asyncio.apply()
to prevent asynchronous loop conflicts in environments like Jupyter. - Configuration Errors: Verify that the
zerox_kwargs
match the expected arguments for your chosen model and that all necessary environment variables are set.
Additional Resources
- Zerox Documentation: Zerox GitHub Repository
- LangChain Document Loaders: LangChain Documentation
Related
- Document loader conceptual guide
- Document loader how-to guides