Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML parsing issue in partition_html #3856

Open
meetFarmanUllah opened this issue Jan 2, 2025 · 0 comments
Open

HTML parsing issue in partition_html #3856

meetFarmanUllah opened this issue Jan 2, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@meetFarmanUllah
Copy link

Describe the bug
I am trying to parse md files for chunking, first i have used partition_md but due to many open issues related to it i was not able to parse my md file directly so i parsed md file using markdown-it and then used partition_html. The issue i am facing is that strong tag within the paragraph tag is considered a title by partition_html which is a problem when chunking_by_title.
To Reproduce
``
from unstructured.partition.html import partition_html
import json

text = "

Example:

"

elements = partition_html(text=text)
element_dict = [el.to_dict() for el in elements]
print(json.dumps(element_dict,indent=2)) ``

Expected behavior
it should not be parsed as title it should be parsed as NarrativeText

Screenshots
code
image
output
image

Environment Info
Name: unstructured
Version: 0.16.11
Python 3.11.9

Additional context
Add any other context about the problem here.

@meetFarmanUllah meetFarmanUllah added the bug Something isn't working label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant