Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing: OCR metrics #3552

Open
jacopo-chevallard opened this issue Jan 22, 2025 — with Linear · 1 comment
Open

Parsing: OCR metrics #3552

jacopo-chevallard opened this issue Jan 22, 2025 — with Linear · 1 comment
Assignees

Comments

Copy link
Collaborator

jacopo-chevallard commented Jan 22, 2025

From the experiment tracker, retrieve the JSON containing the ground-truth layout, which, for each PDF page, looks like

{
  "extra": {
    "relation": [
      {
        "relation_type": "parent_son",
        "source_anno_id": 2,
        "target_anno_id": 3
      },
      {
        "relation_type": "parent_son",
        "source_anno_id": 5,
        "target_anno_id": 8
      }
    ]
  },
  "layout_dets": [
    {
      "anno_id": 6,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "title",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            109.3333333333331,
            121.73651418039208,
            722.1022134807848,
            121.73651418039208,
            722.1022134807848,
            195.75809149176507,
            109.3333333333331,
            195.75809149176507
          ],
          "text": "国资背景基金情况"
        }
      ],
      "order": 1,
      "poly": [
        102.5999912116609,
        120.87255879760278,
        719.3118659856144,
        120.87255879760278,
        719.3118659856144,
        194.14083813380114,
        102.5999912116609,
        194.14083813380114
      ],
      "text": "国资背景基金情况"
    },
    {
      "anno_id": 4,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            99.66504579139392,
            227.6650457913944,
            1269.333333333333,
            227.6650457913944,
            1269.333333333333,
            271.3365750838786,
            99.66504579139392,
            271.3365750838786
          ],
          "text": "2022年备案基金规模小幅回升,但仍未恢复至资管新规出台前的水平"
        }
      ],
      "order": 2,
      "poly": [
        97.71487020898245,
        226.92028692633914,
        1271.9932332148471,
        226.92028692633914,
        1271.9932332148471,
        264.88925750697814,
        97.71487020898245,
        264.88925750697814
      ],
      "text": "2022年备案基金规模小幅回升,但仍未恢复至资管新规出台前的水平"
    },
    {
      "anno_id": 3,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "figure_caption",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            253.94664201855937,
            321.21295194692755,
            1076.1203813864063,
            321.21295194692755,
            1076.1203813864063,
            364.93470762745034,
            253.94664201855937,
            364.93470762745034
          ],
          "text": "2014年-2023Q3国资背景基金的备案数量及规模"
        }
      ],
      "order": 3,
      "poly": [
        246.96994018554688,
        318.7444152832031,
        1088.26025390625,
        318.7444152832031,
        1088.26025390625,
        369.0964660644531,
        246.96994018554688,
        369.0964660644531
      ],
      "text": "2014年-2023Q3国资背景基金的备案数量及规模"
    },
    {
      "anno_id": 2,
      "category_type": "figure",
      "ignore": false,
      "order": 4,
      "poly": [
        118.08102792118407,
        379.29373168945347,
        1299.4279383691976,
        379.29373168945347,
        1299.4279383691976,
        1028.2773128579047,
        118.08102792118407,
        1028.2773128579047
      ]
    },
    {
      "anno_id": 8,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "figure_caption",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            1509.6758069519938,
            324.34247361866034,
            2292.4771492866826,
            324.34247361866034,
            2292.4771492866826,
            364.8196229053426,
            1509.6758069519938,
            364.8196229053426
          ],
          "text": "2014年-2023Q3国资背景基金数量TOP10地区"
        }
      ],
      "order": 5,
      "poly": [
        1497.726318359375,
        318.7418518066406,
        2301.80224609375,
        318.7418518066406,
        2301.80224609375,
        367.1272888183594,
        1497.726318359375,
        367.1272888183594
      ],
      "text": "2014年-2023Q3国资背景基金数量TOP10地区"
    },
    {
      "anno_id": 5,
      "category_type": "figure",
      "ignore": false,
      "order": 6,
      "poly": [
        1370.0374839590943,
        424.35013794251097,
        2552.3561471143494,
        424.35013794251097,
        2552.3561471143494,
        1026.8955618700252,
        1370.0374839590943,
        1026.8955618700252
      ]
    },
    {
      "anno_id": 9,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "title",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            169.67751098302242,
            1071.225836994341,
            328.08580770628134,
            1071.225836994341,
            328.08580770628134,
            1111.655822350311,
            169.67751098302242,
            1111.655822350311
          ],
          "text": "核心发现"
        }
      ],
      "order": 7,
      "poly": [
        170.92340081387997,
        1069.7956822171332,
        326.21460986860313,
        1069.7956822171332,
        326.21460986860313,
        1111.7494049722532,
        170.92340081387997,
        1111.7494049722532
      ],
      "text": "核心发现"
    },
    {
      "anno_id": 7,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            165.603649650326,
            1150.009124125815,
            2509.333333333333,
            1150.009124125815,
            2509.333333333333,
            1198.666666666666,
            165.603649650326,
            1198.666666666666
          ],
          "text": "- 2018年4月资管新规出台后,国资背景基金备案数量增速放缓且规模骤减,受新冠疫情影响,2021年新增基金规模再次下降,虽然"
        },
        {
          "category_type": "text_span",
          "poly": [
            219.22996126565647,
            1201.1457902508969,
            2250.770752144285,
            1201.1457902508969,
            2250.770752144285,
            1243.9433217869077,
            219.22996126565647,
            1243.9433217869077
          ],
          "text": "2022年基金规模回升至1.25万亿元,但仍未恢复至资管新规出台前的水平,2023前三季度新增规模略低于2022年同期。"
        }
      ],
      "order": 8,
      "poly": [
        172.66793877059249,
        1155.2640660519091,
        2514.2408071863138,
        1155.2640660519091,
        2514.2408071863138,
        1241.6284871157177,
        172.66793877059249,
        1241.6284871157177
      ],
      "text": "- 2018年4月资管新规出台后,国资背景基金备案数量增速放缓且规模骤减,受新冠疫情影响,2021年新增基金规模再次下降,虽然 2022年基金规模回升至1.25万亿元,但仍未恢复至资管新规出台前的水平,2023前三季度新增规模略低于2022年同期。"
    },
    {
      "anno_id": 1,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            161.7899369148969,
            1278.308761376868,
            2508,
            1278.308761376868,
            2508,
            1317.333333333333,
            161.7899369148969,
            1317.333333333333
          ],
          "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只,基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省,广东"
        },
        {
          "category_type": "text_span",
          "poly": [
            222.66666666666688,
            1325.3333333333335,
            1623.8331583485456,
            1325.3333333333335,
            1623.8331583485456,
            1365.333333333333,
            222.66666666666688,
            1365.333333333333
          ],
          "text": "省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的"
        },
        {
          "category_type": "equation_ignore",
          "poly": [
            1624.4165959289367,
            1327.0154193159506,
            1703.7259660435407,
            1327.0154193159506,
            1703.7259660435407,
            1363.1237504250385,
            1624.4165959289367,
            1363.1237504250385
          ],
          "text": "73%"
        },
        {
          "category_type": "text_span",
          "poly": [
            1704.6905743174548,
            1322.6134268787764,
            2053.985160092844,
            1322.6134268787764,
            2053.985160092844,
            1370.6736155849724,
            1704.6905743174548,
            1370.6736155849724
          ],
          "text": ",规模占全国总量的"
        },
        {
          "category_type": "equation_ignore",
          "poly": [
            2055.1374027302004,
            1326.3706276890023,
            2149.276980264608,
            1326.3706276890023,
            2149.276980264608,
            1365.7029169328305,
            2055.1374027302004,
            1365.7029169328305
          ],
          "text": "68%。"
        }
      ],
      "order": 9,
      "poly": [
        171.69999831539863,
        1278.820932742719,
        2512.084408886781,
        1278.820932742719,
        2512.084408886781,
        1365.690053585406,
        171.69999831539863,
        1365.690053585406
      ],
      "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只,基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省,广东省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的 73% ,规模占全国总量的 68%。"
    },
    {
      "anno_id": 10,
      "category_type": "abandon",
      "ignore": false,
      "order": null,
      "poly": [
        114.12910090860571,
        1403.1676953230935,
        175.21358196554792,
        1403.1676953230935,
        175.21358196554792,
        1462.6586681785502,
        114.12910090860571,
        1462.6586681785502
      ]
    },
    {
      "anno_id": 0,
      "attribute": {
        "text_background": "white",
        "text_language": "text_en_ch_mixed",
        "text_rotate": "normal"
      },
      "category_type": "footer",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            178.18192276049803,
            1409.8767302579377,
            288.0868232114207,
            1409.8767302579377,
            288.0868232114207,
            1467.2607048296584,
            178.18192276049803,
            1467.2607048296584
          ],
          "text": "CVINFO 投中信息"
        }
      ],
      "order": null,
      "poly": [
        180.18207532211585,
        1404.2778174322868,
        289.9793827860912,
        1404.2778174322868,
        289.9793827860912,
        1462.652231000048,
        180.18207532211585,
        1462.652231000048
      ],
      "text": "CVINFO 投中信息"
    }
  ],
  "page_info": {
    "height": 1500,
    "image_path": "eastmoney_59cde7e939acc3124df9d3f2c85b5a0ec41b9da1157d5be38e098672022b47cb.pdf_11.jpg",
    "page_attribute": {
      "data_source": "PPT2PDF",
      "language": "simplified_chinese",
      "layout": "1andmore_column",
      "special_issue": [
        "watermark"
      ]
    },
    "page_no": 11,
    "width": 2667
  }
}

We consider the array layout_dets and for each element we check the presence of the text key. If the key is present, we extract the text with its corresponding poly, which encodes the position information: coordinates (x,y) for top-left, top-right, bottom-right, bottom-left corners of the bounding box.

We extract the same information from the Megaparse output, i.e. for each page we extract the text and and the bounding box, group the pages per block category, document type, language, layout type, and:

We can also have compute the metrics above across all document types.

@jacopo-chevallard jacopo-chevallard self-assigned this Jan 22, 2025
Copy link

linear bot commented Jan 22, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant