Translation Canvas: A python software package for in-depth analysis and visualization of machine translation model outputs.
LightSeq: A High Performance Training and Inference Library for Transformer models. It is widely used for machine translation, text generation, visual recognition, and more. With the custom CUDA implementation, it achieves 10x speed-up over the original tensorflow seq2seq package, and faster than other implementations.
NeurST : A toolbok with readily available models for neural machine translation and speech-to-text translation.
Xiaomingbot: an intelligent news-writing robot. [Demo]
BLOG: a probabilistic programming language for machine learning
Swift: a compiler for the probabilistic programming language BLOG.
DynaMMo: learning toolbox for multi-dimensional co-evolving time series. github page
CLDS: complex-valued linear dynamical system
PLiF: time-shift-invariant feature extraction for time series
BoLeRO: human motion capture occlution recovering
paralearn: a parallel algorithm for learning Markov models and linear dynamical systems (i.e. Kalman filter)
MLDS: learning dynamical model for tensor time series

Dataset

TTNews: a dataset for Chinese document summarization. 50,000 news articles with summary for training, and 4,000 news articles for testing. [Task description] [Training data] [Testing data and evaluation script] [Reports from NLPCC2017 and NLPCC2018]
CNewSum: an extended version of TTNews for Chinese document summarization. It includes 304,307 documents and human-written summaries. It includes additional adequacy-level and deducibility-level labels. [Project URL]
MLGSum: a multilingual text summarization corpus with 1.2 million articles in 12 languages. Average length per article is 570 words. [Project URL] [Data]
MTG: A large-scale multilingual dataset for text generation tasks. It consists of 6.9 million raw articles and 400 thousand annotated documents for story generation, question generation, title generation, and summarization in five languages, English, French, German, Spanish, and Chinese. Download data here.
SJN-210k: a dataset for jersey number recognition in soccer games. It contains 210,000 images for training and testing. Paper. Download data here.

Past Projects

developing controllable and interpretable methods for effective text generation.
Xiaomingbot: an intelligent news-writing robot. [Demo]
Bayesian sampling methods for controllable text generation: CGMH, MHA, TSMH, controlling the language generation explicitly using various constraints.
VAE with hierarchical latent priors. Solving the training problem for VAE with mixture of exponential family distribution. DEMVAE
Training better data-to-text generation with both data-text pairs and additional raw text. check out variational template machine that learns infinite templates for generation in the latent space.
One embedding is not enough to represent a word! Bayesian Softmax improves text generation.
Application in Advertising system: Generating bidwords for sponsored search, News headline editing.

Modelling, summarization, clustering, imputation, and forecasting for multiple co-evolving time series data, with or without missing values. [DynaMMo paper] [BRITS neural based approach]
Human motion and motion capture analysis: natural motion stitching, motion clustering,
Data center monitoring: to forecast temperature distribution across servers using approximate thermo-dynamics, therefore to control the cooling with minimal energy consumption in data centers. [ThermoCast paper]
Tensor time series (MLDS)