Parsing: OCR metrics #3552

jacopo-chevallard · 2025-01-22T10:43:19Z

From the experiment tracker, retrieve the JSON containing the ground-truth layout, which, for each PDF page, looks like

{
  "extra": {
    "relation": [
      {
        "relation_type": "parent_son",
        "source_anno_id": 2,
        "target_anno_id": 3
      },
      {
        "relation_type": "parent_son",
        "source_anno_id": 5,
        "target_anno_id": 8
      }
    ]
  },
  "layout_dets": [
    {
      "anno_id": 6,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "title",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            109.3333333333331,
            121.73651418039208,
            722.1022134807848,
            121.73651418039208,
            722.1022134807848,
            195.75809149176507,
            109.3333333333331,
            195.75809149176507
          ],
          "text": "国资背景基金情况"
        }
      ],
      "order": 1,
      "poly": [
        102.5999912116609,
        120.87255879760278,
        719.3118659856144,
        120.87255879760278,
        719.3118659856144,
        194.14083813380114,
        102.5999912116609,
        194.14083813380114
      ],
      "text": "国资背景基金情况"
    },
    {
      "anno_id": 4,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            99.66504579139392,
            227.6650457913944,
            1269.333333333333,
            227.6650457913944,
            1269.333333333333,
            271.3365750838786,
            99.66504579139392,
            271.3365750838786
          ],
          "text": "2022年备案基金规模小幅回升，但仍未恢复至资管新规出台前的水平"
        }
      ],
      "order": 2,
      "poly": [
        97.71487020898245,
        226.92028692633914,
        1271.9932332148471,
        226.92028692633914,
        1271.9932332148471,
        264.88925750697814,
        97.71487020898245,
        264.88925750697814
      ],
      "text": "2022年备案基金规模小幅回升，但仍未恢复至资管新规出台前的水平"
    },
    {
      "anno_id": 3,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "figure_caption",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            253.94664201855937,
            321.21295194692755,
            1076.1203813864063,
            321.21295194692755,
            1076.1203813864063,
            364.93470762745034,
            253.94664201855937,
            364.93470762745034
          ],
          "text": "2014年-2023Q3国资背景基金的备案数量及规模"
        }
      ],
      "order": 3,
      "poly": [
        246.96994018554688,
        318.7444152832031,
        1088.26025390625,
        318.7444152832031,
        1088.26025390625,
        369.0964660644531,
        246.96994018554688,
        369.0964660644531
      ],
      "text": "2014年-2023Q3国资背景基金的备案数量及规模"
    },
    {
      "anno_id": 2,
      "category_type": "figure",
      "ignore": false,
      "order": 4,
      "poly": [
        118.08102792118407,
        379.29373168945347,
        1299.4279383691976,
        379.29373168945347,
        1299.4279383691976,
        1028.2773128579047,
        118.08102792118407,
        1028.2773128579047
      ]
    },
    {
      "anno_id": 8,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "figure_caption",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            1509.6758069519938,
            324.34247361866034,
            2292.4771492866826,
            324.34247361866034,
            2292.4771492866826,
            364.8196229053426,
            1509.6758069519938,
            364.8196229053426
          ],
          "text": "2014年-2023Q3国资背景基金数量TOP10地区"
        }
      ],
      "order": 5,
      "poly": [
        1497.726318359375,
        318.7418518066406,
        2301.80224609375,
        318.7418518066406,
        2301.80224609375,
        367.1272888183594,
        1497.726318359375,
        367.1272888183594
      ],
      "text": "2014年-2023Q3国资背景基金数量TOP10地区"
    },
    {
      "anno_id": 5,
      "category_type": "figure",
      "ignore": false,
      "order": 6,
      "poly": [
        1370.0374839590943,
        424.35013794251097,
        2552.3561471143494,
        424.35013794251097,
        2552.3561471143494,
        1026.8955618700252,
        1370.0374839590943,
        1026.8955618700252
      ]
    },
    {
      "anno_id": 9,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "title",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            169.67751098302242,
            1071.225836994341,
            328.08580770628134,
            1071.225836994341,
            328.08580770628134,
            1111.655822350311,
            169.67751098302242,
            1111.655822350311
          ],
          "text": "核心发现"
        }
      ],
      "order": 7,
      "poly": [
        170.92340081387997,
        1069.7956822171332,
        326.21460986860313,
        1069.7956822171332,
        326.21460986860313,
        1111.7494049722532,
        170.92340081387997,
        1111.7494049722532
      ],
      "text": "核心发现"
    },
    {
      "anno_id": 7,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            165.603649650326,
            1150.009124125815,
            2509.333333333333,
            1150.009124125815,
            2509.333333333333,
            1198.666666666666,
            165.603649650326,
            1198.666666666666
          ],
          "text": "- 2018年4月资管新规出台后，国资背景基金备案数量增速放缓且规模骤减，受新冠疫情影响，2021年新增基金规模再次下降，虽然"
        },
        {
          "category_type": "text_span",
          "poly": [
            219.22996126565647,
            1201.1457902508969,
            2250.770752144285,
            1201.1457902508969,
            2250.770752144285,
            1243.9433217869077,
            219.22996126565647,
            1243.9433217869077
          ],
          "text": "2022年基金规模回升至1.25万亿元，但仍未恢复至资管新规出台前的水平，2023前三季度新增规模略低于2022年同期。"
        }
      ],
      "order": 8,
      "poly": [
        172.66793877059249,
        1155.2640660519091,
        2514.2408071863138,
        1155.2640660519091,
        2514.2408071863138,
        1241.6284871157177,
        172.66793877059249,
        1241.6284871157177
      ],
      "text": "- 2018年4月资管新规出台后，国资背景基金备案数量增速放缓且规模骤减，受新冠疫情影响，2021年新增基金规模再次下降，虽然 2022年基金规模回升至1.25万亿元，但仍未恢复至资管新规出台前的水平，2023前三季度新增规模略低于2022年同期。"
    },
    {
      "anno_id": 1,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            161.7899369148969,
            1278.308761376868,
            2508,
            1278.308761376868,
            2508,
            1317.333333333333,
            161.7899369148969,
            1317.333333333333
          ],
          "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只，基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省，广东"
        },
        {
          "category_type": "text_span",
          "poly": [
            222.66666666666688,
            1325.3333333333335,
            1623.8331583485456,
            1325.3333333333335,
            1623.8331583485456,
            1365.333333333333,
            222.66666666666688,
            1365.333333333333
          ],
          "text": "省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的"
        },
        {
          "category_type": "equation_ignore",
          "poly": [
            1624.4165959289367,
            1327.0154193159506,
            1703.7259660435407,
            1327.0154193159506,
            1703.7259660435407,
            1363.1237504250385,
            1624.4165959289367,
            1363.1237504250385
          ],
          "text": "73%"
        },
        {
          "category_type": "text_span",
          "poly": [
            1704.6905743174548,
            1322.6134268787764,
            2053.985160092844,
            1322.6134268787764,
            2053.985160092844,
            1370.6736155849724,
            1704.6905743174548,
            1370.6736155849724
          ],
          "text": "，规模占全国总量的"
        },
        {
          "category_type": "equation_ignore",
          "poly": [
            2055.1374027302004,
            1326.3706276890023,
            2149.276980264608,
            1326.3706276890023,
            2149.276980264608,
            1365.7029169328305,
            2055.1374027302004,
            1365.7029169328305
          ],
          "text": "68%。"
        }
      ],
      "order": 9,
      "poly": [
        171.69999831539863,
        1278.820932742719,
        2512.084408886781,
        1278.820932742719,
        2512.084408886781,
        1365.690053585406,
        171.69999831539863,
        1365.690053585406
      ],
      "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只，基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省，广东省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的 73% ，规模占全国总量的 68%。"
    },
    {
      "anno_id": 10,
      "category_type": "abandon",
      "ignore": false,
      "order": null,
      "poly": [
        114.12910090860571,
        1403.1676953230935,
        175.21358196554792,
        1403.1676953230935,
        175.21358196554792,
        1462.6586681785502,
        114.12910090860571,
        1462.6586681785502
      ]
    },
    {
      "anno_id": 0,
      "attribute": {
        "text_background": "white",
        "text_language": "text_en_ch_mixed",
        "text_rotate": "normal"
      },
      "category_type": "footer",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            178.18192276049803,
            1409.8767302579377,
            288.0868232114207,
            1409.8767302579377,
            288.0868232114207,
            1467.2607048296584,
            178.18192276049803,
            1467.2607048296584
          ],
          "text": "CVINFO 投中信息"
        }
      ],
      "order": null,
      "poly": [
        180.18207532211585,
        1404.2778174322868,
        289.9793827860912,
        1404.2778174322868,
        289.9793827860912,
        1462.652231000048,
        180.18207532211585,
        1462.652231000048
      ],
      "text": "CVINFO 投中信息"
    }
  ],
  "page_info": {
    "height": 1500,
    "image_path": "eastmoney_59cde7e939acc3124df9d3f2c85b5a0ec41b9da1157d5be38e098672022b47cb.pdf_11.jpg",
    "page_attribute": {
      "data_source": "PPT2PDF",
      "language": "simplified_chinese",
      "layout": "1andmore_column",
      "special_issue": [
        "watermark"
      ]
    },
    "page_no": 11,
    "width": 2667
  }
}

We consider the array layout_dets and for each element we check the presence of the text key. If the key is present, we extract the text with its corresponding poly, which encodes the position information: coordinates (x,y) for top-left, top-right, bottom-right, bottom-left corners of the bounding box.

We extract the same information from the Megaparse output, i.e. for each page we extract the text and and the bounding box, group the pages per block category, document type, language, layout type, and:

in each page, find the matching bounding boxes (within some errors). We want to compare texts for matching bounding boxes, otherwise we can consider that the problem is in the layout detection step.
compute the Levenshtein (edit) distance
compute the Character Error Rate
compute the Word Error Rate
compute the Match Error Rate

We can also have compute the metrics above across all document types.

The text was updated successfully, but these errors were encountered:

linear · 2025-01-22T10:43:21Z

CORE-333 Parsing: OCR metrics

jacopo-chevallard self-assigned this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing: OCR metrics #3552

Parsing: OCR metrics #3552

jacopo-chevallard commented Jan 22, 2025 •

edited

Loading

linear bot commented Jan 22, 2025

Parsing: OCR metrics #3552

Parsing: OCR metrics #3552

Comments

jacopo-chevallard commented Jan 22, 2025 • edited Loading

linear bot commented Jan 22, 2025

jacopo-chevallard commented Jan 22, 2025 •

edited

Loading