This site uses cookies to improve customer service and for other purposes.
For more information, please click here.


  1. HOME
  2. News
  3. 2023/12/19

Morpho AI Solutions Launches Japanese-language Dataset Generation Service for LLMs


Tokyo, Japan – December 19th, 2023– Morpho AI Solutions, Inc. (hereinafter “Morpho AIS”), which is responsible for the commercialization of AI within the Morpho Group, announced today that it has begun providing an AI-OCR output service for generating Japanese-language LLM training data.

This service provides highly accurate and diverse Japanese-language text data for organizations (such as companies, government agencies, and local governments) considering creating their own LLMs and AI companies and research institutions developing LLMs.

The Lack of Diversity in the Japanese-language Training Data Used to Create LLMs

Creating high-quality Japanese-language LLMs requires the collection of diverse Japanese-language data. However, most of the Japanese-language text data that can be easily collected is text from the 1990s onward, after the rise of the Internet. Many of the documents from before 1990 (such as company histories, public relations magazines, public records, meeting minutes, and the like) have yet to be digitized, so this data cannot be efficiently collected. Many organizations creating LLMs therefore find themselves unable to collect diverse Japanese-language training data, and instead must use publicly-available, shared datasets. This limits their ability to create high-quality LLMs.

The Importance of AI-OCR that Supports Japanese Documents

OCR is essential for digitizing saved documents, but the majority of the OCR products on the market were developed for use with billing statements, receipts, and other forms. Japanese documents have diverse layouts (using vertical writing, horizontal writing, and multiple columns) and have a mix of character types. This has made it difficult to accurately extract Japanese text, including sequential reading order, using commercial OCR products.
The OCR output service provided by Morpho AIS is capable of high resolution text generation, including correctly identifying text reading order, something that commercial OCR products struggle with. Organizations can therefore use their scanned image data to generate diverse and accurate Japanese-language data sets, assisting with the creation of training data for Japanese-language LLMs.

Service Contents, Features, and Track Record

Service Contents

Digitalization of existing documents (company histories, public relations magazines, public records, meeting minutes, etc.) and conversion into LLM training data


1: AI-OCR that supports a wide range documents, not just forms

– Reproduces the reading order, which is important for LLM input

– Supports roughly 7,000 characters and can read even highly difficult kanji characters


2: Can output test (in various formats) from miscellaneous documents containing images (JPEGs, PDFs, PNGs)

Track Record

This service is already being used to generate text in various organizations, including the National Diet Library.

(Tomigusuku City in Okinawa Prefecture, University of Bologna, Juntendo University, Shiga Prefectural Library, large newspaper companies, etc.)

Requests and Inquiries

A free trial is also available from this page.


FROG AI-OCR is a single package that combines the high resolution OCR processing of NDLOCR, which makes it easy to perform OCR, with correction and text output functions. All of its functions can be used via the cloud, enabling highly efficient confirmation and correction of output text. FROG AI-OCR uses the National Diet Library’s NDLOCR ( as its core engine.

About Morpho AI Solutions, Inc.

Morpho AI Solutions is a company engaged in the commercialization of AI (Artificial Intelligence). It promotes the introduction and actual operation of cutting-edge AI technologies, including AI-OCR, in the areas of social infrastructure such as government, electric power, transportation, and manufacturing.

For more information, visit or contact