Vāķ Translate
Model on Hugging FaceVāķ is Shunya's open-weight translation model for Indian languages, 55 languages, 2,970 any-to-any pairs, and a weighted average BLEU of 38.5, with native handling of all major Indic scripts. This page documents vak-translate-1.3b-ct2, the CTranslate2 build of vak-translate-1.3b for fast local inference, released under CC-BY-SA-4.0 as part of the Vāķ suite from Shunya Labs at the India AI Impact Summit 2026.
Highlights
- 55 Indian languages across 5 language families (Indo-Aryan, Dravidian, Austroasiatic, Sino-Tibetan, Indo-European)
- 2,970 translation pairs, any-to-any translation between all supported languages
- 1.3B parameters, encoder-decoder architecture with 24+24 layers
- Open weights under CC-BY-SA-4.0
- Weighted average BLEU: 38.5 (by speaker count)
- Covers 1.17 billion+ native speakers across every region of India
- First open-weight translation model for many Indian languages including Bhojpuri, Rajasthani, Chhattisgarhi, and Magahi
- CTranslate2 format for optimized CPU/GPU inference
At a glance
| Property | Value |
|---|---|
| Model | vak-translate-1.3b-ct2 |
| Parameters | ~1.3B (dense), encoder-decoder (M2M-style) |
| Encoder / decoder layers | 24 / 24 |
| Model dimension | 1024, 16 attention heads, FFN 8192, ReLU |
| Vocabulary | 256,206 tokens, SentencePiece BPE |
| Max input length | 512 tokens, max positions 1024 |
| Dropout | 0.1 |
| Languages | 55 Indian, 15+ scripts |
| Translation pairs | 2,970 any-to-any |
| Native speakers covered | 1.17 billion+ |
| Weighted BLEU | 38.5 (by speaker count) |
| CT2 format | CTranslate2 (model.bin) |
| License | CC-BY-SA-4.0 |
| Stack | ctranslate2 + transformers + sentencepiece + huggingface_hub |
Model architecture
| Field | Value |
|---|---|
| Architecture | Encoder-Decoder (M2M-style) |
| Parameters | ~1.3B (dense) |
| Encoder Layers | 24 |
| Decoder Layers | 24 |
| Model Dimension | 1024 |
| Attention Heads | 16 |
| FFN Dimension | 8192 |
| Activation | ReLU |
| Vocab Size | 256,206 |
| Tokenizer | SentencePiece BPE |
| Max Input Length | 512 tokens |
| Max Positions | 1024 |
| Dropout | 0.1 |
| Languages | 55 Indian |
| Translation Pairs | 2,970 |
| Scripts Supported | 15+ |
| CT2 Format | CTranslate2 (model.bin) |
Supported languages
Full language list with BLEU scores (from the model card). BLEU scores are tentative, based on human evaluation (3 independent evaluations per language, 1-5 adequacy scale).
| # | Language | Speakers | BLEU | # | Language | Speakers | BLEU |
|---|---|---|---|---|---|---|---|
| 1 | Hindi | 322.2M | 42 | 28 | Pahari | 3.25M | 20 |
| 2 | Bengali | 96.2M | 41 | 29 | Bhili | 3.21M | 23 |
| 3 | Marathi | 82.8M | 40 | 30 | Harauti | 2.94M | 23 |
| 4 | Telugu | 80.9M | 41 | 31 | Nepali | 2.93M | 36 |
| 5 | Tamil | 68.9M | 42 | 32 | Bagheli | 2.68M | 34 |
| 6 | Gujarati | 55.0M | 40 | 33 | Sambalpuri | 2.63M | 23 |
| 7 | Urdu | 50.7M | 41 | 34 | Dogri | 2.60M | 3 |
| 8 | Bhojpuri | 50.6M | 36 | 35 | Garhwali | 2.48M | 35 |
| 9 | Kannada | 43.5M | 40 | 36 | Nimadi | 2.31M | 26 |
| 10 | Malayalam | 34.8M | 41 | 37 | Konkani | 2.15M | 15 |
| 11 | Odia | 34.1M | 39 | 38 | Kumauni | 2.08M | 34 |
| 12 | Punjabi | 31.1M | 40 | 39 | Kurukh | 1.98M | 3 |
| 13 | Rajasthani | 25.8M | 36 | 40 | Tulu | 1.84M | 3 |
| 14 | Chhattisgarhi | 16.3M | 32 | 41 | Manipuri (Meitei) | 1.76M | 3 |
| 15 | Assamese | 14.8M | 38 | 42 | Surgujia | 1.74M | 28 |
| 16 | Maithili | 13.4M | 37 | 43 | Sindhi | 1.68M | 35 |
| 17 | Magahi | 12.7M | 35 | 44 | Bagri | 1.66M | 12 |
| 18 | Haryanvi | 9.81M | 23 | 45 | Ahirani | 1.64M | 34 |
| 19 | Khortha | 8.04M | 34 | 46 | Banjari | 1.58M | 34 |
| 20 | Marwari | 7.83M | 36 | 47 | Brajbhasha | 1.56M | 35 |
| 21 | Santali | 6.97M | 3 | 48 | Bodo | 1.46M | 3 |
| 22 | Kashmiri | 6.55M | 35 | 49 | Kangri | 1.12M | 3 |
| 23 | Bundeli | 5.63M | 35 | 50 | Garo | 1.13M | 3 |
| 24 | Mewari | 4.21M | 28 | 51 | Kachchhi | 1.03M | 5 |
| 25 | Awadhi | 3.85M | 36 | 52 | Mahasu Pahari | 1.00M | 3 |
| 26 | Wagdi | 3.39M | 35 | 53 | Sanskrit | - | 34 |
| 27 | Lambadi | 3.28M | 28 | 54 | Kodava | - | 3 |
| 55 | Indian English | 250M | 43 |
Language families
| Indo-Aryan | Dravidian | Austroasiatic | Sino-Tibetan | Indo-European |
|---|---|---|---|---|
| 43 languages | 7 languages | 1 language | 3 languages | 1 language |
Performance tiers
| Tier | BLEU range | Count | Languages |
|---|---|---|---|
| Strong | 35-43 | 26 | Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Urdu, Bhojpuri, Kannada, Malayalam, Odia, Punjabi, Rajasthani, Assamese, Maithili, Magahi, Marwari, Kashmiri, Bundeli, Awadhi, Wagdi, Nepali, Sindhi, Garhwali, Brajbhasha, Indian English |
| Good | 32-34 | 7 | Chhattisgarhi, Khortha, Bagheli, Kumauni, Ahirani, Banjari, Sanskrit |
| Adequate | 20-28 | 9 | Haryanvi, Mewari, Lambadi, Bhili, Harauti, Pahari, Sambalpuri, Nimadi, Surgujia |
| Partial | 5-15 | 3 | Konkani, Bagri, Kachchhi |
| Experimental | 2-4 | 10 | Dogri, Kurukh, Tulu, Manipuri, Santali, Kangri, Mahasu Pahari, Kodava, Bodo, Garo |
Evaluation
- Weighted Average BLEU (by speaker count): 38.5
- BLEU scores are tentative, based on human evaluation (3 independent evaluations per language, 1-5 adequacy scale)
- Covers 1.17 billion+ native speakers across 5 language families
Translate on your machine
Install a few Python packages, run the script from the Hugging Face quick start, and read the translated sentence in your terminal. No cloud API required.
What goes in and what comes out
Same example as the model card: one English sentence in, one Hindi sentence out.
The Hindi line is what print(translation) shows when you run the script below with this English input. Wording may vary slightly between runs.
Install
pip install ctranslate2 transformers sentencepieceCTranslate2 quick start
import ctranslate2
from transformers import NllbTokenizer
from huggingface_hub import snapshot_download
# Download model to local cache and load
model_dir = snapshot_download("shunyalabs/vak-translate-1.3b-ct2")
tokenizer = NllbTokenizer.from_pretrained("shunyalabs/vak-translate-1.3b-ct2")
device = "cuda" if ctranslate2.get_cuda_device_count() > 0 else "cpu"
translator = ctranslate2.Translator(model_dir, device=device)
# Translate English to Hindi
src_lang = "eng_Latn"
tgt_lang = "hin_Deva"
tokenizer.src_lang = src_lang
inputs = tokenizer("Hello, how are you?")
src_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"])
results = translator.translate_batch(
[src_tokens],
target_prefix=[[tgt_lang]],
beam_size=4,
max_decoding_length=256,
)
output_tokens = results[0].hypotheses[0]
output_ids = tokenizer.convert_tokens_to_ids(output_tokens)
translation = tokenizer.decode(output_ids, skip_special_tokens=True)
print(translation)Translate many sentences at once
From the model card, pass a list of English sentences; the script prints one Hindi line per sentence.
The sun rises in the east.
Water is essential for life.
Education is the most powerful weapon.
Three Hindi lines in your terminal, one for each English sentence above.
Run the batch script below to see the exact text.
# Translate a batch of sentences (English → Hindi)
texts = [
"The sun rises in the east.",
"Water is essential for life.",
"Education is the most powerful weapon.",
]
tokenizer.src_lang = "eng_Latn"
all_src_tokens = [
tokenizer.convert_ids_to_tokens(tokenizer(t)["input_ids"])
for t in texts
]
results = translator.translate_batch(
all_src_tokens,
target_prefix=[["hin_Deva"]] * len(texts),
beam_size=4,
max_decoding_length=256,
)
for orig, result in zip(texts, results):
ids = tokenizer.convert_tokens_to_ids(result.hypotheses[0])
print(tokenizer.decode(ids, skip_special_tokens=True))Language codes
Vāķ uses NLLB-style language codes: <iso-639-3>_<iso-15924>. Examples from the quick start:
| Language | Code |
|---|---|
| English | eng_Latn |
| Hindi | hin_Deva |
Use cases
Citizen services in every mother tongue | Sovereign deployment, data stays in India | Healthcare, education, judiciary outreach
Zero API cost for open-weight models | Build voice-first apps for any language | Fine-tune for domain-specific use cases | 2,970 translation pairs out of the box
Full model weights for research | Benchmark against global state of art | Extend to more Indian languages | Advance Indian NLP and speech science
Combining with ASR + TTS
Vāķ is the translation bridge for live 2-way conversation. The dashboard's live-translation feature (and any voice agent you build) uses it in a pipeline like this: