-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME
102 lines (69 loc) · 3.33 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
README
To regenerate the data from scratch:
- compile corpus and syllables (cargo build)
- download Wikipedia dumps (check wikipedia/download.sh)
- collect other data and run corpus on it to create files in words/
- run ./run.sh
SCRIPTS
bn = Bengali
hi = Hindi / Devanagari / Marathi
ta = Tamil
or = Oriya
te = Telugu
gu = Gujarati
pa = Punjabi / Gurmukhi
ml = Malayalam
kn = Kannada
si = Sinhala
SOURCES
Indian translations of "Code Swaraj" by Carl Malamud.
Dictionaries:
- http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Shabdanjali/Shabdanjali.tgz
- https://sanskritdocuments.org/hindi/dict/eng-hin_unic.html
- http://ltrc.iiit.ac.in/onlineServices/Dictionaries/eng-hin-utf/eng-hindi-dict-utf8.zip
Wikipedia:
- https://dumps.wikimedia.org/bnwiki/20190801/bnwiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/hiwiki/20190801/hiwiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/tawiki/20190801/tawiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/orwiki/20190801/orwiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/tewiki/20190801/tewiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/guwiki/20190801/guwiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/pawiki/20190801/pawiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/mlwiki/20190801/mlwiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/knwiki/20190801/knwiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/siwiki/20190801/siwiki-20190801-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/bnwiki/20181001/bnwiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/hiwiki/20181001/hiwiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/tawiki/20181001/tawiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/orwiki/20181001/orwiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/tewiki/20181001/tewiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/guwiki/20181001/guwiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/pawiki/20181001/pawiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/mlwiki/20181001/mlwiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/knwiki/20181001/knwiki-20181001-pages-articles-multistream.xml.bz2
- https://dumps.wikimedia.org/siwiki/20181001/siwiki-20181001-pages-articles-multistream.xml.bz2
Reddit:
- http://files.pushshift.io/reddit/comments/RC_2018-09.xz
- http://files.pushshift.io/reddit/submissions/RS_2018-09.xz
Hindi news:
- https://www.bhaskar.com/national/
- https://www.jagran.com/
Bengali news:
- https://www.anandabazar.com/
Gujarati news:
- https://www.divyabhaskar.co.in/
Punjabi news:
- https://jagbani.punjabkesari.in/
Tamil news:
- http://www.dinamalar.com/
- https://tamil.samayam.com/
- http://www.dinakaran.com/
Malayalam news:
- https://www.manoramaonline.com/news/latest-news.html
Kannada news:
- http://www.kannadaprabha.com/
Sinhala news:
- http://www.lankadeepa.lk/
- http://www.ada.lk/
- http://divaina.com/daily/
- http://www.rivira.lk/online/