Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Refactor markdown_to_tups method to better handle multi-lev… #17508

Merged
merged 5 commits into from
Jan 23, 2025

Conversation

minglu7
Copy link
Contributor

@minglu7 minglu7 commented Jan 14, 2025

…el headers and code blocks

Description

  • Introduced a dictionary (headers) to track multi-level headers and ensure proper hierarchical structure.
  • Improved handling of code blocks by skipping header parsing inside code blocks.
  • Added logic to append previous content before switching to a new header level, preserving section context.
  • Cleaned up unnecessary headers when a new header of a higher level is encountered.
  • Post-processing of content to remove HTML tags and strip whitespace from headers.
  • Ensured that empty sections are handled gracefully, with a fallback placeholder for missing content.

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 14, 2025
@minglu7
Copy link
Contributor Author

minglu7 commented Jan 16, 2025

@logan-markewich Hello, Based on my own project experience, I optimized this part of the code. I believe this refactored code is very important. If possible, please help me review it to see if there are any issues. As for why the automated tests failed, I think it might be because they didn’t use my modified source code. In my local tests, all the test cases passed.

@logan-markewich
Copy link
Collaborator

logan-markewich commented Jan 17, 2025

@minglu7 the tests are failing on a test that you did not modify. Its definitely using your modified code. Did you maybe forget to commit a change?

Specifically, its failing on def test_parse_markdown_with_no_headers()

Also, linting is failing, can you run make lint and add/fix as needed?

@minglu7
Copy link
Contributor Author

minglu7 commented Jan 17, 2025

@minglu7 the tests are failing on a test that you did not modify. Its definitely using your modified code. Did you maybe forget to commit a change?

Specifically, its failing on def test_parse_markdown_with_no_headers()

Also, linting is failing, can you run make lint and add/fix as needed?

I did as you advised, So why does the test fail for Python versions 3.11 and 3.12? Does this matter?

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 23, 2025
@logan-markewich logan-markewich enabled auto-merge (squash) January 23, 2025 21:33
@logan-markewich logan-markewich merged commit 91d2107 into run-llama:main Jan 23, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants