You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AlibabaResearch recently published a new model for Document Layout Analysis which sets a new benchmark in the task of Document Layout Analysis.
Introduction - To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding https://arxiv.org/abs/2308.14978
Effect on LLM usage - VGT can dissect the page into different portions (headers, subheaders, titles, etc.) which can then be OCRed and passed to an LLM for RAG.
The text was updated successfully, but these errors were encountered:
Implement Vision Grid Transformer for Document Layout Analysis
AlibabaResearch recently published a new model for Document Layout Analysis which sets a new benchmark in the task of Document Layout Analysis.
Introduction - To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding
https://arxiv.org/abs/2308.14978
Effect on LLM usage - VGT can dissect the page into different portions (headers, subheaders, titles, etc.) which can then be OCRed and passed to an LLM for RAG.
The text was updated successfully, but these errors were encountered: