Skip to content

Latest commit

 

History

History
104 lines (76 loc) · 4.8 KB

README.md

File metadata and controls

104 lines (76 loc) · 4.8 KB

tree-sitter-rtf

RTF parser built using Tree-Sitter. Tree-Sitter is parser generator tool. It uses a language grammar definition and generates the required C code that provides a parser for that grammar.

Useful scripts

This project uses yarn and declares some scripts you can use for development purposes:

  • yarn generate: Will update and generate again our parser implementation based on the grammar definition.
  • yarn test: Will run all the tests you can find inside corpus folder.
  • yarn test:debug: Will run all the tests with debug mode enabled
  • yarn lint: Will run ESLint and Prettier checks over the grammar file
  • yarn format: Will format the grammar file using Prettier

Parsers generation

Using the information contained in the grammar.js file Tree-Sitter will generate among other files: src/parser.c, and src/tree_sitter/parser.h. These two last files contain the whole language definition, exposed as the tree_sitter_rtf function. That language definition when used toghether with the Tree-Sitter C API provides a complete RTF parser.

You can generate a Tree-Sitter parser at any time by doing:

yarn generate

Tree-Sitter grammar is defined as a JavaScript file with access to some build-it functions. If you want to know more on how it works, you can take a look at:

Testing

The result after a Tree-Sitter parsing will be a contrete syntax tree which can be represented as an S-expression. Tree-Sitter tests are defined around that concept. They are just plain text files, in the corpus folder, that contain a sequence of input texts and resulting tree S-expression. You can run Tree-Sitter tests by doing:

yarn test
yarn test:debug # In case you need more info when you get a test failing

When running the above command, Tree-Sitter will grab all the text files in the corpus folder, feed the parser with every input sequence and then compare the result against the expected value.

As an example we can use this test:

The first block until the ------ separator contains the test name between === separator and the row RTF content divided in the RTF header and body.

After the ------ separator you will find a representation fo the parsed document.

=============================
Minimal RTF document
=============================
{\rtf1\ansi\ansicpg1252\cocoartf2568
\cocoatextscaling1\cocoaplatform1{\fonttbl\f0\fnil\fcharset0 Futura-Bold;}
{\colortbl;\red255\green255\blue255;\red125\green194\blue91;}

\f0\fs24 \cf0 Pedro}
----------------------------
(document
  (fonttbl
    (fontname))
  (colortbl
    (colorvalue
      (staticNumberLiteral)
      (staticNumberLiteral)
      (staticNumberLiteral))
    (colorvalue
      (staticNumberLiteral)
      (staticNumberLiteral)
      (staticNumberLiteral)))
  (textUnit
    (fontIndex)
    (fontSize)
    (textUnitContent)))

Specs

When developing the RTF parser you may need to use some specificactions as a reference. The most interesting ones can be found here:

Utils

When working on the parser you may need to create some raw RTF content. To do this you can use your Mac TextEdit app. Open the app and create a document. Modify inside the content as you wish and then save it. You can use example.rtf as name for simplicity. Once you've saved the document, open it using other editor like Visual Studio Code. Inside you'll be able to read the raw RTF representation for the content you created before. Something like this:

{\rtf1\ansi\ansicpg1252\cocoartf2636
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\paperw11900\paperh16840\margl1440\margr1440\vieww11520\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Hello RTF world!}

If you modify the content using different formating options and visualize the changes from Visual Studio Code you'll notice how the raw RTF content changes to represent the format. You can use this in order to create new tests or validate your implementation for different style configurations.