Fg-selective-arabic.bin
“Arabic” is ambiguous: Modern Standard Arabic (MSA), Egyptian, Levantine, Gulf, or Maghrebi. The file’s usefulness depends on which variety it encodes. Most .bin models are trained on MSA unless specified.
Binary formats are:
Common .bin creators:
While traditional Tesseract models used .traineddata files, the .bin format often points to: Fg-selective-arabic.bin
import pynini
lexicon = pynini.Far("arabic_lexicon.far", mode="r")
lm = pynini.Fst.read("kenlm_5gram.bin")
# Compose and prune
selective = pynini.compose(lexicon, lm)
selective = pynini.prune(selective, epsilon=2.5)
selective.write("fg-selective-arabic.bin")
This yields exactly the kind of selective, binary FST that matches the filename. Common
