What is Namsel OCR?

Optical character recognition or OCR is a technology that converts digital images (such as scans and photographs) to searchable text. Many commercial-grade OCR systems have been developed but none have provided robust support for the Tibetan language. Namsel OCR was built in response to the need for Tibetan OCR and with the goal of pushing forward the state of the art of Tibetan technology. Making use of various techniques from the fields of machine learning and computer vision, Namsel OCR currently achieves accuracy rates of over 99% on many types of machine-print documents.

History

Namsel OCR was created by Zach Rowinski, a student at the University of Virginia, in coordination with David Germano and the Tibetan and Himalayan Library in 2011-2012. In early 2013, the project moved from Virginia to Berkeley, California so Zach could continue his work with Prof. Kurt Keutzer in the EECS Department of the University of California-Berkeley. Prof. Keutzer had been working intermittently on the OCR of Tibetan texts since his collaborations with OCR researcher Henry Baird at Bell Labs Research in Murray Hill in the late 1980's. While Prof. Keutzer's work, together with former students Jike Chong and Fares Hedayati, had produced some innovations and publications, they had never produced a robust OCR system. Through joining their efforts, Zach is able to continue his own efforts and to integrate state of the art techniques to Tibetan OCR. In 2014, the Namsel OCR project began a partnership with the Tibetan Buddhist Resource Center (TBRC) in an effort to digitize and make searchable the TBRC's entire collection of machine-print books. As of mid-2014, the project has digitized over a million pages from the TBRC's collection.