Vāķ Translate

Model on Hugging Face

Vāķ is Shunya's open-weight translation model for Indian languages, 55 languages, 2,970 any-to-any pairs, and a weighted average BLEU of 38.5, with native handling of all major Indic scripts. This page documents vak-translate-1.3b-ct2, the CTranslate2 build of vak-translate-1.3b for fast local inference, released under CC-BY-SA-4.0 as part of the Vāķ suite from Shunya Labs at the India AI Impact Summit 2026.

Highlights

  • 55 Indian languages across 5 language families (Indo-Aryan, Dravidian, Austroasiatic, Sino-Tibetan, Indo-European)
  • 2,970 translation pairs, any-to-any translation between all supported languages
  • 1.3B parameters, encoder-decoder architecture with 24+24 layers
  • Open weights under CC-BY-SA-4.0
  • Weighted average BLEU: 38.5 (by speaker count)
  • Covers 1.17 billion+ native speakers across every region of India
  • First open-weight translation model for many Indian languages including Bhojpuri, Rajasthani, Chhattisgarhi, and Magahi
  • CTranslate2 format for optimized CPU/GPU inference

At a glance

PropertyValue
Modelvak-translate-1.3b-ct2
Parameters~1.3B (dense), encoder-decoder (M2M-style)
Encoder / decoder layers24 / 24
Model dimension1024, 16 attention heads, FFN 8192, ReLU
Vocabulary256,206 tokens, SentencePiece BPE
Max input length512 tokens, max positions 1024
Dropout0.1
Languages55 Indian, 15+ scripts
Translation pairs2,970 any-to-any
Native speakers covered1.17 billion+
Weighted BLEU38.5 (by speaker count)
CT2 formatCTranslate2 (model.bin)
LicenseCC-BY-SA-4.0
Stackctranslate2 + transformers + sentencepiece + huggingface_hub

Model architecture

FieldValue
ArchitectureEncoder-Decoder (M2M-style)
Parameters~1.3B (dense)
Encoder Layers24
Decoder Layers24
Model Dimension1024
Attention Heads16
FFN Dimension8192
ActivationReLU
Vocab Size256,206
TokenizerSentencePiece BPE
Max Input Length512 tokens
Max Positions1024
Dropout0.1
Languages55 Indian
Translation Pairs2,970
Scripts Supported15+
CT2 FormatCTranslate2 (model.bin)

Supported languages

Full language list with BLEU scores (from the model card). BLEU scores are tentative, based on human evaluation (3 independent evaluations per language, 1-5 adequacy scale).

#LanguageSpeakersBLEU#LanguageSpeakersBLEU
1Hindi322.2M4228Pahari3.25M20
2Bengali96.2M4129Bhili3.21M23
3Marathi82.8M4030Harauti2.94M23
4Telugu80.9M4131Nepali2.93M36
5Tamil68.9M4232Bagheli2.68M34
6Gujarati55.0M4033Sambalpuri2.63M23
7Urdu50.7M4134Dogri2.60M3
8Bhojpuri50.6M3635Garhwali2.48M35
9Kannada43.5M4036Nimadi2.31M26
10Malayalam34.8M4137Konkani2.15M15
11Odia34.1M3938Kumauni2.08M34
12Punjabi31.1M4039Kurukh1.98M3
13Rajasthani25.8M3640Tulu1.84M3
14Chhattisgarhi16.3M3241Manipuri (Meitei)1.76M3
15Assamese14.8M3842Surgujia1.74M28
16Maithili13.4M3743Sindhi1.68M35
17Magahi12.7M3544Bagri1.66M12
18Haryanvi9.81M2345Ahirani1.64M34
19Khortha8.04M3446Banjari1.58M34
20Marwari7.83M3647Brajbhasha1.56M35
21Santali6.97M348Bodo1.46M3
22Kashmiri6.55M3549Kangri1.12M3
23Bundeli5.63M3550Garo1.13M3
24Mewari4.21M2851Kachchhi1.03M5
25Awadhi3.85M3652Mahasu Pahari1.00M3
26Wagdi3.39M3553Sanskrit-34
27Lambadi3.28M2854Kodava-3
55Indian English250M43

Language families

Indo-AryanDravidianAustroasiaticSino-TibetanIndo-European
43 languages7 languages1 language3 languages1 language

Performance tiers

TierBLEU rangeCountLanguages
Strong35-4326Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Urdu, Bhojpuri, Kannada, Malayalam, Odia, Punjabi, Rajasthani, Assamese, Maithili, Magahi, Marwari, Kashmiri, Bundeli, Awadhi, Wagdi, Nepali, Sindhi, Garhwali, Brajbhasha, Indian English
Good32-347Chhattisgarhi, Khortha, Bagheli, Kumauni, Ahirani, Banjari, Sanskrit
Adequate20-289Haryanvi, Mewari, Lambadi, Bhili, Harauti, Pahari, Sambalpuri, Nimadi, Surgujia
Partial5-153Konkani, Bagri, Kachchhi
Experimental2-410Dogri, Kurukh, Tulu, Manipuri, Santali, Kangri, Mahasu Pahari, Kodava, Bodo, Garo

Evaluation

  • Weighted Average BLEU (by speaker count): 38.5
  • BLEU scores are tentative, based on human evaluation (3 independent evaluations per language, 1-5 adequacy scale)
  • Covers 1.17 billion+ native speakers across 5 language families

Translate on your machine

Install a few Python packages, run the script from the Hugging Face quick start, and read the translated sentence in your terminal. No cloud API required.

What goes in and what comes out

Same example as the model card: one English sentence in, one Hindi sentence out.

What you send in
Hello, how are you?

Language: English, code eng_Latn

Response you get
नमस्ते, आप कैसे हैं?

Language: Hindi, code hin_Deva

The Hindi line is what print(translation) shows when you run the script below with this English input. Wording may vary slightly between runs.

Install

shell
pip install ctranslate2 transformers sentencepiece

CTranslate2 quick start

python
import ctranslate2
from transformers import NllbTokenizer
from huggingface_hub import snapshot_download

# Download model to local cache and load
model_dir = snapshot_download("shunyalabs/vak-translate-1.3b-ct2")

tokenizer = NllbTokenizer.from_pretrained("shunyalabs/vak-translate-1.3b-ct2")

device = "cuda" if ctranslate2.get_cuda_device_count() > 0 else "cpu"
translator = ctranslate2.Translator(model_dir, device=device)

# Translate English to Hindi
src_lang = "eng_Latn"
tgt_lang = "hin_Deva"

tokenizer.src_lang = src_lang
inputs = tokenizer("Hello, how are you?")
src_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"])

results = translator.translate_batch(
    [src_tokens],
    target_prefix=[[tgt_lang]],
    beam_size=4,
    max_decoding_length=256,
)

output_tokens = results[0].hypotheses[0]
output_ids = tokenizer.convert_tokens_to_ids(output_tokens)
translation = tokenizer.decode(output_ids, skip_special_tokens=True)
print(translation)

Translate many sentences at once

From the model card, pass a list of English sentences; the script prints one Hindi line per sentence.

What you send in

The sun rises in the east.

Water is essential for life.

Education is the most powerful weapon.

Language: English, code eng_Latn

Response you get

Three Hindi lines in your terminal, one for each English sentence above.

Run the batch script below to see the exact text.

Language: Hindi, code hin_Deva

python
# Translate a batch of sentences (English → Hindi)
texts = [
    "The sun rises in the east.",
    "Water is essential for life.",
    "Education is the most powerful weapon.",
]

tokenizer.src_lang = "eng_Latn"
all_src_tokens = [
    tokenizer.convert_ids_to_tokens(tokenizer(t)["input_ids"])
    for t in texts
]

results = translator.translate_batch(
    all_src_tokens,
    target_prefix=[["hin_Deva"]] * len(texts),
    beam_size=4,
    max_decoding_length=256,
)

for orig, result in zip(texts, results):
    ids = tokenizer.convert_tokens_to_ids(result.hypotheses[0])
    print(tokenizer.decode(ids, skip_special_tokens=True))

Language codes

Vāķ uses NLLB-style language codes: <iso-639-3>_<iso-15924>. Examples from the quick start:

LanguageCode
Englisheng_Latn
Hindihin_Deva

Use cases

Government

Citizen services in every mother tongue | Sovereign deployment, data stays in India | Healthcare, education, judiciary outreach

Developers and Startups

Zero API cost for open-weight models | Build voice-first apps for any language | Fine-tune for domain-specific use cases | 2,970 translation pairs out of the box

Researchers and Academia

Full model weights for research | Benchmark against global state of art | Extend to more Indian languages | Advance Indian NLP and speech science

Combining with ASR + TTS

Vāķ is the translation bridge for live 2-way conversation. The dashboard's live-translation feature (and any voice agent you build) uses it in a pipeline like this: