WebScrapping and Language detector

I read the CSV file with pandas, skipping the bad lines and using the ";" separator because of my PC configuration. after that I created a for loop, that looks into each of the URLS of the dataframe "df" that contains the CSV read.

It uses a try-except-finally to control four exception types used from the urllib, urllib3 and ssl modules for those pages that could not be opened due to different reasons. For the pages that can be opened I used the urlopen and BeautifulSoup for the webscrapping and then with the soup.get_text() I got the text of the page, which I cleaned with the funcion, making a join with the linejumps and the repeated spaces.

In the list "IdentifiedLanguage" I saved the result of PASS or FAIL of each page and later in the line 51 I insert the column with the result, so the user can export the dataframe "df" with the asked results.

I configured the Dockerfile with the installation of python, its modules, requirements of the different packages and main files for the deployment in RailWay but RailWay for some reason didn't trust in my github repository, so I couldn't build and deploy it.

I created a function that with the langdetect module detects what is the language of the visited page.

I made the commitments to Github to the latest version of this program and I couldn't reach the image and bar assesment

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitattributes		.gitattributes
Dockerfile		Dockerfile
README.md		README.md
RailWay Problem.png		RailWay Problem.png
Trial Task Sample URLs.csv		Trial Task Sample URLs.csv
Trial Task Sample URLs.xlsx		Trial Task Sample URLs.xlsx
coding_all_stars.py		coding_all_stars.py
extract.py		extract.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebScrapping and Language detector

About

Releases

Packages

Languages

IsabelAngel/Coding-Allstars-Developer-Trial-Task

Folders and files

Latest commit

History

Repository files navigation

WebScrapping and Language detector

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages