My Next works

Published: January 07, 2024

Interesting papers

Data-Centric Debugging: mitigating model failures via targeted image retrieval
Bi-directional Training for Composed Image Retrieval via Text Prompt Learning
DocLLM: A layout-aware generative language model for multimodal document understanding
1. The paper focuses on the challenges of understanding visually rich documents, such as forms, invoices, receipts, and contracts, which require the integration of textual and spatial modalities.
2. Existing large language models (LLMs) primarily accept text-only inputs and assume simple layouts, which may not be suitable for handling visual documents.
3. The proposed DocLLM model is a lightweight extension to traditional LLMs that incorporates both textual semantics and spatial layout information. It avoids expensive image encoders and instead uses bounding box information to capture the spatial layout structure.
4. The model introduces modifications to the pre-training objective to address irregular layouts and mixed data types in visual documents.
5. The performance of DocLLM is evaluated on various document intelligence tasks, outperforming state-of-the-art LLMs on most datasets and generalizing well to previously unseen datasets.
A Mathematical Framework for Transformer Circuits