Learning Words
Varnam Dictionary
Varnam uses a single dictionary for a language. This means there can be multiple schemes for the same language but the words for all of them will be common.
The storage location of the dictionary is:
- GNU/Linux:
HOME_DIRECTORY/.local/share/varnam/learnings
- Mac:
HOME_DIRECTORY/.local/share/varnam/learnings
Learn A Word
Do this in a terminal:
varnamcli -s ml -learn വാക്കിവിടെ
The -s ml
mentions the scheme ID. This word will then be added to the Malayalam dictionary.
This also will add the word to Malayalam dictionary :
varnamcli -s ml-inscript -learn വാക്കിവിടെ
Because both scheme ml
and ml-inscript
is for the Malayalam language.
Learn From Files
Varnam has the ability to learn words from any text file.
Ordinary File
Varnam can import Indic language words from a file :
varnamcli -s ml -learn-from-file filename.txt
filename.txt
can be any kind of file, an HTML web page or any digital file where there are words to learn. Varnam find words, calculate the frequency of that word in the file and learns from it. The frequency determines the confidence of that word.
That is, if the word “ഒരിക്കൽ” (orikkal) is found in the file 100 times and “ഒരിടത്ത്” (oriTathth) is found 50 times, then after learning, varnamcli -s ml -t ori
gives :
ഒരിക്കൽ
ഒരിടത്ത്
ഒരി
Frequency File
A frequency file looks like this (<word> <frequency>
) :
മലയാളം 120
ഒരിക്കൽ 100
ഒരിടത്ത് 50
This file can be imported to Varnam with the same command :
varnamcli -s ml -learn-from-file word-frequency.txt
What’s different here is that the frequency value is mentioned. Varnam will learn words and use each frequency as the confidence value.
Some places to find such frequency files or to make frequenct files on your own:
- https://github.com/AI4Bharat/indicnlp_corpus#text-corpora
- https://github.com/AI4Bharat/indicnlp_catalog#monolingual-corpus
IME Learning
If you use Varnam IBus Engine, there is in-built learning while you type. See this page.
Training Words
You can make Varnam use a specific pattern for words.
Train A Word
To train a single word:
varnamcli -s ml -train firefox=ഫയർഫോക്സ്
Now if we give the input “firefox” for transliteratrion:
varnamcli -s ml firefox // ഫയർഫോക്സ്
Train From File
Varnam can train patterns from a file like this:
academy അക്കാഡമി
access ആക്സസ്
accident ആക്സിഡന്റ്
accord അക്കോർഡ്
account അക്കൗണ്ട്
Do this to train each pattern from the file:
varnamcli -s ml -train-from-file my-file.txt
Export Learnings
You can export your local Varnam Learnings data to use on another system or share with others.
Simply run this to do an export:
varnamcli -s ml -export my-export
The export file will have the extension .VLF
(Varnam Learnings File).
Depending on the number of words in your dictionary, the number of files generated will vary. Example:
- my-export-1.vlf
- my-export-2.vlf
- my-export-3.vlf
By default, the maximum number of words in a file will be 30,000. The words are exported according to descending order of words.
The export format is JSON. Here’s a tidied-up sample:
{
"words": [
{
"w": "എന്നു", // Word
"c": 210, // Weight/Confidence
"l": 1631129802 // Learned On
},
{
"w": "സമാനപദങ്ങൾ", // Word
"c": 25, // Weight/Confidence
"l": 1631129802 // Learned On
},
{"w": "അക്കാഡമി", "c": 215, "l": 1631129802},
{"w": "ആക്സസ്", "c": 219, "l": 1631129802},
...
],
"patterns": [
{"p":"academy", "w":"അക്കാഡമി"},
{"p":"access", "w":"ആക്സസ്"},
...
]
}
Import Learnings
You can import any .vlf
file with:
varnamcli -s ml -import my-export-1.vlf
The above will import the one file my-export-1.vlf
.
To import all the .vlf
files in the folder, do this:
varnamcli -s ml -import "*.vlf"
The quotes “ are important above. Without it the wildcard file matching won’t work.