Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a tests file to document and demonstrate a few conversions? #1

Open
DivineDominion opened this issue Jun 15, 2020 · 5 comments
Open
Assignees

Comments

@DivineDominion
Copy link

This looks great! Glancing at the code, I saw that you parse the input document line-by-line based on Regex at some point, and the script is fairly long (longer than a simple global String replacement! :)), so I guess it contains a couple of syntax checks. That'd be great.

My suggestion is to add a tests file, or maybe even embed the tests if the script is launched without any arguments or a --run-tests flag, or something. I first saw this in http://alanwsmith.com/embedding-a-test-suite-in-a-single-file-ruby-app-part-1 and found that to be a neat pattern for simple CLI scripts where I don't want to create a directory with the script and the tests separately, but still want to document how it behaves.

Anyway, I'd be mostly interested in seeing how

foo `[[bar]` baz 

is handled, and GitHub Flavored Markdown fenced code blocks :)

@balaji-dutt balaji-dutt self-assigned this Jun 15, 2020
@balaji-dutt
Copy link
Owner

balaji-dutt commented Jun 15, 2020

Hello @DivineDominion! Thank you for the suggestion on the ---run-tests option! I actually do have some test files that are not part of the repo but I used when developing the script - mostly by visually comparing the input & output. As someone who is not a full-time programmer I will need to first get my head around how to do write unit tests but will look to include them.

I saw that you parse the input document line-by-line based on Regex at some point, and the script is fairly long

The script is fairly long since a good 46% of the code is to parse a config file and set various parameters 😆

Anyway, I'd be mostly interested in seeing how

foo [[bar]] baz

I assume you meant [[bar]] above (since your comment had a missing closing square bracket)? Is your expectation that the script will process such references or that that the script should skip such references. Currently it will process such references but I have a quick regex update that would make it skip links inside code blocks.

@DivineDominion
Copy link
Author

Oops, that was sloppy of me, the missing bracket was not intentional -- I meant a test for wiki links inside inline code in general, sorry! Similarly, wiki links inside code blocks should be skipped.

Unit tests shouldn't be more complicated than the scripts themselves. It boils down to writing a sample string with Markdown, then a string with the expected output, then pass the Markdown to your converter function and compare results using assertEqual.

A testing primer: https://docs.python.org/2/library/unittest.html

Example from there:

import unittest

class TestStringMethods(unittest.TestCase):

    def test_upper(self):
        self.assertEqual('foo'.upper(), 'FOO')

    def test_isupper(self):
        self.assertTrue('FOO'.isupper())
        self.assertFalse('Foo'.isupper())

    def test_split(self):
        s = 'hello world'
        self.assertEqual(s.split(), ['hello', 'world'])
        # check that s.split fails when the separator is not a string
        with self.assertRaises(TypeError):
            s.split(2)

if __name__ == '__main__':
    unittest.main()

You would run unittest.main() if the --runtests flag is present or so.

To get your code tested, I think you'll want to refactor you modify_links function (https://github.com/balaji-dutt/zettel-link-rewriter/blob/master/zettel-link-rewriter.py#L124) so you can make assertions based on strings, not file objects, like so maybe (my Python isn't so great :D):

def modify_links(file_obj):
    logging.debug("Going to open file %s for processing now.", file)
    try:
        with open(file_obj, encoding="utf8") as infile:
            contents = infile.read()
            logging.debug("Finished processing file %s", file)
            return modify_string(contents)
    except EnvironmentError:
        logging.exception("Unable to open file %s for reading", file)
        return ""

def modify_string(str):
    linelist = []
    for line in iter(foo.splitlines()):
        linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
        # Finds  references that are in style [[foo]] only by excluding links in style [[foo]](bar).
        # Capture group $2 returns just foo
        linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
                                 r"[\2](\2 \5.md)", line) for line in linelist]
        # Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
        # returns bar
    return "\n".join(linelist_final)

@balaji-dutt
Copy link
Owner

@DivineDominion thank you for the detailed feedback. I guess I was bit too confident earlier - skipping wikilinks inside inline code was pretty easy. Skipping wikilinks inside code blocks on the other hand is proving to be much more difficult. I will continue to work on it.

@DivineDominion
Copy link
Author

Dang, I was afraid of this :) For fenced code blocks, I think a little boolean flag can do the trick for most cases when you consume the Markdown line by line: When a line starts with three backticks, turn "isInCodeBlock = true", and ignore the remaining lines until you get to another set of backticks again.

For classic code blocks with 4 spaces at the front it gets trickier, depending on whether or not the preceding line is a list item, e.g. this should not be treated as code:

- This list item
    is continued here

Also this is not a code block:

- First item
    - Second item

But this is:

    - Looks like a list but isn't

You can wiggle your way towards higher accuracy with a couple more checks, though, to get it into a workable condition.

The problem is that regex are context free, while Markdown is not a context free grammar: the same thing (4 spaces at the beginning of a line) mean different things in different contexts. That's what makes Markdown so hard :)

(If you really think you want to do it accurately, maybe you can find a Markdown Python library that parses the Markdown and provides you with an abstract syntax tree that you can consume and where you can ignore all code blocks and only apply replacements for the remainder.)

@balaji-dutt balaji-dutt linked a pull request Jun 20, 2020 that will close this issue
@balaji-dutt balaji-dutt removed a link to a pull request Jun 20, 2020
@balaji-dutt
Copy link
Owner

balaji-dutt commented Jun 20, 2020

Hi @DivineDominion I've updated the script with a different regex engine/syntax which I believe correctly handles all recognized Markdown code syntax.

Given the following source/input.md file:

input.md

The script will generate dest\input.md file:

output.md

Note that:

  • Wikilinks in inline code blocks are ignored
  • Wiklinks in fenced code blocks are ignored
    • Code indented with 4 spaces inside code blocks are ignored as well.
  • Wikilinks in code indented with 4 spaces or a single tab are ignored.

The script will process wikilinks in :

  • [[wiklink]] format
  • [[wiki]] (link) format
  • [[wiki]](link) format

But it will fail if the wiklink has a new line in the middle. I believe this is expected behavior for markdown file. Please try it out and let me know what you think!

The tests file will eventually be added as well 😃

EDIT I have discovered that there is one wikilink scenario the regex does not handle - wikilinks that are indented by 4 or more spaces in a list. For example:

- First item
  - Second item
    - [[000000000001]]
      - [[000000000001]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants