Vāķ Translate

Model on Hugging Face

Vāķ is Shunya's open-weight translation model for Indian languages, 55 languages, 2,970 any-to-any pairs, and a weighted average BLEU of 38.5, with native handling of all major Indic scripts. This page documents vak-translate-1.3b-ct2, the CTranslate2 build of vak-translate-1.3b for fast local inference, released under CC-BY-SA-4.0 as part of the Vāķ suite from Shunya Labs at the India AI Impact Summit 2026.

Highlights

55 Indian languages across 5 language families (Indo-Aryan, Dravidian, Austroasiatic, Sino-Tibetan, Indo-European)
2,970 translation pairs, any-to-any translation between all supported languages
1.3B parameters, encoder-decoder architecture with 24+24 layers
Open weights under CC-BY-SA-4.0
Weighted average BLEU: 38.5 (by speaker count)
Covers 1.17 billion+ native speakers across every region of India
First open-weight translation model for many Indian languages including Bhojpuri, Rajasthani, Chhattisgarhi, and Magahi
CTranslate2 format for optimized CPU/GPU inference

At a glance

Property	Value
Model	`vak-translate-1.3b-ct2`
Parameters	~1.3B (dense), encoder-decoder (M2M-style)
Encoder / decoder layers	24 / 24
Model dimension	1024, 16 attention heads, FFN 8192, ReLU
Vocabulary	256,206 tokens, SentencePiece BPE
Max input length	512 tokens, max positions 1024
Dropout	0.1
Languages	55 Indian, 15+ scripts
Translation pairs	2,970 any-to-any
Native speakers covered	1.17 billion+
Weighted BLEU	38.5 (by speaker count)
CT2 format	CTranslate2 (`model.bin`)
License	CC-BY-SA-4.0
Stack	ctranslate2 + transformers + sentencepiece + huggingface_hub

Model architecture

Field	Value
Architecture	Encoder-Decoder (M2M-style)
Parameters	~1.3B (dense)
Encoder Layers	24
Decoder Layers	24
Model Dimension	1024
Attention Heads	16
FFN Dimension	8192
Activation	ReLU
Vocab Size	256,206
Tokenizer	SentencePiece BPE
Max Input Length	512 tokens
Max Positions	1024
Dropout	0.1
Languages	55 Indian
Translation Pairs	2,970
Scripts Supported	15+
CT2 Format	CTranslate2 (model.bin)

Supported languages

Full language list with BLEU scores (from the model card). BLEU scores are tentative, based on human evaluation (3 independent evaluations per language, 1-5 adequacy scale).

#	Language	Speakers	BLEU	#	Language	Speakers	BLEU
1	Hindi	322.2M	42	28	Pahari	3.25M	20
2	Bengali	96.2M	41	29	Bhili	3.21M	23
3	Marathi	82.8M	40	30	Harauti	2.94M	23
4	Telugu	80.9M	41	31	Nepali	2.93M	36
5	Tamil	68.9M	42	32	Bagheli	2.68M	34
6	Gujarati	55.0M	40	33	Sambalpuri	2.63M	23
7	Urdu	50.7M	41	34	Dogri	2.60M	3
8	Bhojpuri	50.6M	36	35	Garhwali	2.48M	35
9	Kannada	43.5M	40	36	Nimadi	2.31M	26
10	Malayalam	34.8M	41	37	Konkani	2.15M	15
11	Odia	34.1M	39	38	Kumauni	2.08M	34
12	Punjabi	31.1M	40	39	Kurukh	1.98M	3
13	Rajasthani	25.8M	36	40	Tulu	1.84M	3
14	Chhattisgarhi	16.3M	32	41	Manipuri (Meitei)	1.76M	3
15	Assamese	14.8M	38	42	Surgujia	1.74M	28
16	Maithili	13.4M	37	43	Sindhi	1.68M	35
17	Magahi	12.7M	35	44	Bagri	1.66M	12
18	Haryanvi	9.81M	23	45	Ahirani	1.64M	34
19	Khortha	8.04M	34	46	Banjari	1.58M	34
20	Marwari	7.83M	36	47	Brajbhasha	1.56M	35
21	Santali	6.97M	3	48	Bodo	1.46M	3
22	Kashmiri	6.55M	35	49	Kangri	1.12M	3
23	Bundeli	5.63M	35	50	Garo	1.13M	3
24	Mewari	4.21M	28	51	Kachchhi	1.03M	5
25	Awadhi	3.85M	36	52	Mahasu Pahari	1.00M	3
26	Wagdi	3.39M	35	53	Sanskrit	-	34
27	Lambadi	3.28M	28	54	Kodava	-	3
55	Indian English	250M	43

Language families

Indo-Aryan	Dravidian	Austroasiatic	Sino-Tibetan	Indo-European
43 languages	7 languages	1 language	3 languages	1 language

Performance tiers

Tier	BLEU range	Count	Languages
Strong	35-43	26	Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Urdu, Bhojpuri, Kannada, Malayalam, Odia, Punjabi, Rajasthani, Assamese, Maithili, Magahi, Marwari, Kashmiri, Bundeli, Awadhi, Wagdi, Nepali, Sindhi, Garhwali, Brajbhasha, Indian English
Good	32-34	7	Chhattisgarhi, Khortha, Bagheli, Kumauni, Ahirani, Banjari, Sanskrit
Adequate	20-28	9	Haryanvi, Mewari, Lambadi, Bhili, Harauti, Pahari, Sambalpuri, Nimadi, Surgujia
Partial	5-15	3	Konkani, Bagri, Kachchhi
Experimental	2-4	10	Dogri, Kurukh, Tulu, Manipuri, Santali, Kangri, Mahasu Pahari, Kodava, Bodo, Garo

Evaluation

Weighted Average BLEU (by speaker count): 38.5
BLEU scores are tentative, based on human evaluation (3 independent evaluations per language, 1-5 adequacy scale)
Covers 1.17 billion+ native speakers across 5 language families

Translate on your machine

Install a few Python packages, run the script from the Hugging Face quick start, and read the translated sentence in your terminal. No cloud API required.

What goes in and what comes out

Same example as the model card: one English sentence in, one Hindi sentence out.

What you send in

Hello, how are you?

Language: English, code eng_Latn

Response you get

नमस्ते, आप कैसे हैं?

Language: Hindi, code hin_Deva

The Hindi line is what print(translation) shows when you run the script below with this English input. Wording may vary slightly between runs.

Install

pip install ctranslate2 transformers sentencepiece

CTranslate2 quick start

import ctranslate2
from transformers import NllbTokenizer
from huggingface_hub import snapshot_download

# Download model to local cache and load
model_dir = snapshot_download("shunyalabs/vak-translate-1.3b-ct2")

tokenizer = NllbTokenizer.from_pretrained("shunyalabs/vak-translate-1.3b-ct2")

device = "cuda" if ctranslate2.get_cuda_device_count() > 0 else "cpu"
translator = ctranslate2.Translator(model_dir, device=device)

# Translate English to Hindi
src_lang = "eng_Latn"
tgt_lang = "hin_Deva"

tokenizer.src_lang = src_lang
inputs = tokenizer("Hello, how are you?")
src_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"])

results = translator.translate_batch(
    [src_tokens],
    target_prefix=[[tgt_lang]],
    beam_size=4,
    max_decoding_length=256,
)

output_tokens = results[0].hypotheses[0]
output_ids = tokenizer.convert_tokens_to_ids(output_tokens)
translation = tokenizer.decode(output_ids, skip_special_tokens=True)
print(translation)

Translate many sentences at once

From the model card, pass a list of English sentences; the script prints one Hindi line per sentence.

What you send in

The sun rises in the east.

Water is essential for life.

Education is the most powerful weapon.

Language: English, code eng_Latn

Response you get

Three Hindi lines in your terminal, one for each English sentence above.

Run the batch script below to see the exact text.

Language: Hindi, code hin_Deva

# Translate a batch of sentences (English → Hindi)
texts = [
    "The sun rises in the east.",
    "Water is essential for life.",
    "Education is the most powerful weapon.",
]

tokenizer.src_lang = "eng_Latn"
all_src_tokens = [
    tokenizer.convert_ids_to_tokens(tokenizer(t)["input_ids"])
    for t in texts
]

results = translator.translate_batch(
    all_src_tokens,
    target_prefix=[["hin_Deva"]] * len(texts),
    beam_size=4,
    max_decoding_length=256,
)

for orig, result in zip(texts, results):
    ids = tokenizer.convert_tokens_to_ids(result.hypotheses[0])
    print(tokenizer.decode(ids, skip_special_tokens=True))

Language codes

Vāķ uses NLLB-style language codes: <iso-639-3>_<iso-15924>. Examples from the quick start:

Language	Code
English	`eng_Latn`
Hindi	`hin_Deva`

Use cases

Government

Citizen services in every mother tongue | Sovereign deployment, data stays in India | Healthcare, education, judiciary outreach

Developers and Startups

Zero API cost for open-weight models | Build voice-first apps for any language | Fine-tune for domain-specific use cases | 2,970 translation pairs out of the box

Researchers and Academia

Full model weights for research | Benchmark against global state of art | Extend to more Indian languages | Advance Indian NLP and speech science

Combining with ASR + TTS

Vāķ is the translation bridge for live 2-way conversation. The dashboard's live-translation feature (and any voice agent you build) uses it in a pipeline like this: