question: how infile handler works? #645

lebensterben · 2020-08-29T19:58:43Z

It's documented that autospec could accept a directory of URLs as argument to --infile, which is handled by infile_reader function:

autospec/autospec/infile_handler.py

Lines 130 to 140 in b110b31

    
           def infile_reader(indata, name): 
        
               """ 
        
               Parse an infile. 
        
               The infile parser can take 3 different inputs: 
        
                 A url to a file 
        
                 A directory with multiple files or urls 
        
                 A path or filename 
        
               Each file in the directory should scraped to the same dictionary instance. 
        
               """

But I cannot understand how this could work.

infile_reader calls file_handler, which in turn calls check_url_content:

autospec/autospec/infile_handler.py

Lines 146 to 147 in b110b31

    
           for f in sorted(files, key=sort_files): 
        
               output_dict = file_handler(os.path.join(indata, f), output_dict)

autospec/autospec/infile_handler.py

Lines 112 to 114 in b110b31

    
           if not os.path.isfile(indata): 
        
               # check that input is plain or raw text and not html 
        
               indata = check_url_content(indata)

check_url_content sends a HEAD request to the url, which is the file name of a infile in a given directory.

autospec/autospec/infile_handler.py

Line 64 in b110b31

if "text/html" in requests.head(url).headers['content-type']:

The problem is, for a valid HTTP or HTTPS URL, it must contain the scheme http:// or https://. And / is not a reserved character on most Linux file systems. That is, / cannot appear in the file name, or in the url passed to check_url_content.

The implication is,

Suppose I do have a directory with infiles whose file names are some kind of 'URL's. They won't be valid HTTP(s) URL. And thus they won't be treated correctly.

I think maybe this python module is not completely/correctly implemented. Correct me if I'm wrong.

P.S. From the commit message aef3399, it reads that

The --infile argument now allows a url, file, or directory of files to be passed as the input.

So probably, it was never intended to deal with a directory of URLs.

The text was updated successfully, but these errors were encountered:

phmccarty · 2020-08-31T18:06:38Z

@lebensterben This is a feature we started to implement for autospec but ultimately, we never used it. So, if you see any problems with the docs or code relevant to the feature, that is why.

#645 #646

phmccarty · 2020-09-01T23:56:47Z

Resolved with #647

lebensterben mentioned this issue Sep 1, 2020

Removed infile features #647

Merged

phmccarty pushed a commit that referenced this issue Sep 1, 2020

Removed infile features

9f6870c

#645 #646

phmccarty closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question: how infile handler works? #645

question: how infile handler works? #645

lebensterben commented Aug 29, 2020 •

edited

Loading

phmccarty commented Aug 31, 2020

phmccarty commented Sep 1, 2020

question: how infile handler works? #645

question: how infile handler works? #645

Comments

lebensterben commented Aug 29, 2020 • edited Loading

phmccarty commented Aug 31, 2020

phmccarty commented Sep 1, 2020

lebensterben commented Aug 29, 2020 •

edited

Loading