Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: how infile handler works? #645

Closed
lebensterben opened this issue Aug 29, 2020 · 2 comments
Closed

question: how infile handler works? #645

lebensterben opened this issue Aug 29, 2020 · 2 comments

Comments

@lebensterben
Copy link
Contributor

lebensterben commented Aug 29, 2020

It's documented that autospec could accept a directory of URLs as argument to --infile, which is handled by infile_reader function:

def infile_reader(indata, name):
"""
Parse an infile.
The infile parser can take 3 different inputs:
A url to a file
A directory with multiple files or urls
A path or filename
Each file in the directory should scraped to the same dictionary instance.
"""

But I cannot understand how this could work.

infile_reader calls file_handler, which in turn calls check_url_content:

for f in sorted(files, key=sort_files):
output_dict = file_handler(os.path.join(indata, f), output_dict)

if not os.path.isfile(indata):
# check that input is plain or raw text and not html
indata = check_url_content(indata)

check_url_content sends a HEAD request to the url, which is the file name of a infile in a given directory.

if "text/html" in requests.head(url).headers['content-type']:

The problem is, for a valid HTTP or HTTPS URL, it must contain the scheme http:// or https://. And / is not a reserved character on most Linux file systems. That is, / cannot appear in the file name, or in the url passed to check_url_content.

The implication is,

  • Suppose I do have a directory with infiles whose file names are some kind of 'URL's. They won't be valid HTTP(s) URL. And thus they won't be treated correctly.

I think maybe this python module is not completely/correctly implemented. Correct me if I'm wrong.


P.S. From the commit message aef3399, it reads that

The --infile argument now allows a url, file, or directory of files to be passed as the input.

So probably, it was never intended to deal with a directory of URLs.

@phmccarty
Copy link
Contributor

@lebensterben This is a feature we started to implement for autospec but ultimately, we never used it. So, if you see any problems with the docs or code relevant to the feature, that is why.

@phmccarty
Copy link
Contributor

Resolved with #647

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants