IIIT-H Launches Patram-7B, India’s First Vision-Language Model for Document Understanding
The 7-billion parameter model is trained to interpret scanned and photographed documents using natural language instructions.

Researchers from the International Institute of Information Technology, Hyderabad (IIIT-H) have unveiled Patram-7B-Instruct, India’s first indigenous vision-language foundational model tailored for document understanding.
The 7-billion parameter model is trained to interpret scanned and photographed documents using natural language instructions.
Despite its relatively compact size, Patram has demonstrated performance on par with larger global models like DeepSeek-VL-2 across benchmarks such as DocVQA, VisualMRC, and the custom-built Patram-Bench, which simulates Indian document contexts.
“With Patram, we’ve built a model that understands the unique structure and diversity of Indian documents. This is just the beginning of what India can achieve in vision-language AI,” Dr. Ravi Kiran Sarvadevabhatla, associate professor and lead researcher at IIIT-Hyderabad, said.
The model, built in just five months by IIIT-H alumni and student interns with support from TiH-IoT, IIT Bombay, is now open-source on Hugging Face and IndiaAI’s AIKosh platform.
Launched by Union Minister Jitendra Singh on June 2 at the BharatGen National Summit in New Delhi, Patram is part of the DST-funded BharatGen initiative to develop multimodal AI models.
“Patram marks a significant step as India designs state-of-the-art foundational models. With this launch, we integrate language available in all forms: as text, as speech, and as images,” Prof. P. J. Narayanan, Director, IIIT Hyderabad, said.