Deep Learning for Classical Japanese Literature

Kuzushiji-MNIST can be used as a replacement to the normal MNIST dataset.The paper also applies generative modelling to a domain transfer task between unseen Kuzushiji Kanji to Modern Kanji.DatasetsThe Kuzushiji dataset is created by the National Institute of Japanese Literature (NIJL) and is curated by the Center of Open Data in the Humanities(CODH).Kuzushiji full dataset was released in November 2016, and now the dataset contains 3,999 character types and 403,242 characters.The authors of this paper pre-processed characters scanned from 35 classical books printed in the 18th century and divided them into 3 datasets:- Kuzushiji-MNIST — A drop-in replacement for MNIST dataset (28×28)- Kuzushiji-49 — A much larger but imbalanced dataset containing 48 Hiragana characters and 1 Hiragana iteration mark (28×28)- Kuzushiji-Kanji — An imbalanced dataset of 3832 Kanji characters, including rare characters with very few samples..(64×64)One characteristic of Classical Japanese which is very different from Modern Japanese is Hentaigana(変体仮名).Hentaigana are Hiragana characters which have more than one form of writing as they were derived from different Kanji.Therefore, one Hiragana class of Kuzushiji-MNIST and Kuzushiji-49 may have many characters mapped to it (as seen in the above image)..This makes the Kuzushiji dataset more challenging than the MNIST dataset.The high class imbalance in Kuzushiji-49 and Kuzushiji-Kanji is due to the appearance frequency in real textbooks and is kept that way to represent the real data distribution.Kuzushiji-49 — has 49 classes with a total of 266,407 images (28×28)Kuzushiji-Kanji — has 3832 classes with a total of 140,426 images ranging from 1,766 examples to only a single example per class..(64×64)Kuzushiji-MNIST is balanced.Kuzushiji-Kanji is created for more experimental tasks rather than merely classification and recognition benchmarks.Fig 7..Kuzushiji-49 ClassesExperimentsClassification Baselines for Kuzushiji-MNIST and Kuzushiji-49Domain Transfer from Kuzushiji-Kanji to Modern KanjiThe Kuzushiji-Kanji dataset is used for domain transfer from pixel images to vector images (opposed to previous such approaches which focuses on domain transfer from pixel images to pixel images)The proposed model aims to generate Modern Kanji version of a given Kuzushiji-Kanji input, in both pixel and stroke based formats.In the figure below, the overall approach is presented.They train two separate Convolutional Variational Autoencoders, one on the Kuzushiji-Kanji dataset and another one on a pixel version of the KanjiVG dataset rendered to 64×64 pixels for consistency..The architecture of the VAE is identical to [3].And both datasets are compressed into their own 64-dimensional latent space, z_old and z_new..The KL loss term is not optimized below a certain threshold.Then, a Mixture Density Network (MDN) is trained with 2 hidden layers to model the density function of P(z_new | z_old) approximated as a mixture of Gaussians.We can then sample a latent vector z_new given a latent vector z_old encoded from Kuzushiji-Kanji.The paper says that training two separate VAE models on each dataset is much more efficient and achieves better results compared to training a single model end-to-end.In the last step, a sketch-RNN decoder model is trained to generate Modern Kanji based on z_new.There are 3600 overlapping characters between the two datasets.- For the ones which are not in the overlapping space, we condition the sktech-RNN model from z_new encoded on KanjiVG data to generate the stroke data also from KanjiVG [see (1) in Fig 10]- For the ones which are present in the overlapping dataset, we use z_new sampled from the MDN which is conditioned on z_old, to generate the stroke data also from KanjiVG [see (2) in Fig 10]This helps the sketch-RNN fine-tune aspects of the VAE’s latent space that may not capture well parts of the data distribution of Modern Kanji when trained only on pixels.References[1] Y..LeCun..The MNIST database of handwritten digits, 1998..http://yann.lecun.com/exdb/mnist/[2] C..for Open Data in the Humanities..Kuzushiji dataset, 2016.. More details

Leave a Reply