Skip to content

Commit

Permalink
Converts PDFs to text using pdf2txt
Browse files Browse the repository at this point in the history
  • Loading branch information
dbosk committed Dec 17, 2024
1 parent 77aae6b commit 3618c26
Showing 1 changed file with 20 additions and 2 deletions.
22 changes: 20 additions & 2 deletions src/canvaslms/cli/submissions.nw
Original file line number Diff line number Diff line change
Expand Up @@ -905,6 +905,7 @@ def convert_to_md(attachment: canvasapi.file.File,
<<download [[attachment]] to [[outfile]] in [[tmpdir]]>>
content_type = getattr(attachment, "content-type")
<<if [[content_type]] is text, just use [[outfile]] contents>>
<<else if [[content_type]] is PDF, convert [[outfile]] using [[pdf2txt]]>>
<<else convert [[outfile]] using [[pypandoc]]>>
<<let [[contents]] be the converted [[attachment]]>>=
contents = convert_to_md(attachment, tmpdir)
Expand Down Expand Up @@ -951,6 +952,8 @@ def text_to_md(content_type):
This leaves us with the following.
The advantage of reading the content from the file is that Python will solve
the encoding for us.
Instead of using an [[if]] statement, we'll go all Python and use a
[[try-except]] block.
<<if [[content_type]] is text, just use [[outfile]] contents>>=
try:
md_type = text_to_md(content_type)
Expand All @@ -961,15 +964,30 @@ except ValueError:
pass
@

If the content type is not text, we use [[pypandoc]] to convert it to Markdown.
Now we'll do the same for PDF files.
We'll use [[pdf2txt]] to convert the PDF to text.
However, here we'll use an if statement.
We'll check for the content type to end with [[pdf]], that will capture also
[[x-pdf]].
<<else if [[content_type]] is PDF, convert [[outfile]] using [[pdf2txt]]>>=
if content_type.endswith("pdf"):
try:
return subprocess.check_output(["pdf2txt", str(outfile)],
text=True)
except subprocess.CalledProcessError:
pass
@

Finally, as a last attempt, we use [[pypandoc]] to try to convert it to
Markdown.
Here we'll use Pandoc's ability to infer the file type on its own.
This means we'll have to download the attachment as a file in a temporary
location and let Pandoc convert the file to Markdown.
<<else convert [[outfile]] using [[pypandoc]]>>=
try:
return pypandoc.convert_file(outfile, "markdown")
except Exception as err:
return f"Pandoc cannot convert this file. " \
return f"Cannot convert this file. " \
f"The file is located at\n\n {outfile}\n\n"
@

Expand Down

0 comments on commit 3618c26

Please sign in to comment.