Skip to content

Latest commit

 

History

History
8 lines (8 loc) · 1.96 KB

File metadata and controls

8 lines (8 loc) · 1.96 KB

An algorithm to extract keywords from any sentence using Stanford's Natural Language Processing Log-linear POS Tagger

I have used Stanford's (natural language processing) Log-linear POS tagger in java to handle .xml files and extract sentences which are present inside the <title>.....</title> tag.

The sentences extracted were then tagged using the Parts of Speech tagger, the library for which is available on Stanford NLP's website. You can download the library and get information regarding it's usage here.

In the Java code, for the keywords I have mainly considered nouns, adjectives and verbs. As basically, these are the kind of words which actually contribute in a query. For an example if we have a sentence like - "Mercedes and it's cars". The words of interest here mainly are "Mercedes" and "car", which as it turns out are Noun. Details regarding the POS tags can be found in the "POS tagging terms meanings.txt" or here.

For the code to function, create a folder and copy all the files (apart from - "title.txt", "query.txt", "reqfile.txt" and the "POS tagging terms meanings") into it, and paste the file in your Java IDE's workspace. Import the entire file and then locate the POS tagger library with your IDE. The POS tagger which I have used here is dated 09-06-2017. You can download the latest version from the link provided above. Talking about the .xml file, "query.txt" is the actual query file which the code reads and processes to create "title.txt" and then create the file containing the keywords "reqfile.txt".

Download basic English Stanford Tagger version 3.9.1
Download full Stanford Tagger version 3.9.1