• Translation Canvas: A python software package for in-depth analysis and visualization of machine translation model outputs.
  • LightSeq: A High Performance Training and Inference Library for Transformer models. It is widely used for machine translation, text generation, visual recognition, and more. With the custom CUDA implementation, it achieves 10x speed-up over the original tensorflow seq2seq package, and faster than other implementations.
  • NeurST : A toolbok with readily available models for neural machine translation and speech-to-text translation.
  • Xiaomingbot: an intelligent news-writing robot. [Demo]
  • BLOG: a probabilistic programming language for machine learning
  • Swift: a compiler for the probabilistic programming language BLOG.
  • DynaMMo: learning toolbox for multi-dimensional co-evolving time series. github page
  • CLDS: complex-valued linear dynamical system
  • PLiF: time-shift-invariant feature extraction for time series
  • BoLeRO: human motion capture occlution recovering
  • paralearn: a parallel algorithm for learning Markov models and linear dynamical systems (i.e. Kalman filter)
  • MLDS: learning dynamical model for tensor time series

Dataset

  • TTNews: a dataset for Chinese document summarization. 50,000 news articles with summary for training, and 4,000 news articles for testing. [Task description] [Training data] [Testing data and evaluation script] [Reports from NLPCC2017 and NLPCC2018]
  • CNewSum: an extended version of TTNews for Chinese document summarization. It includes 304,307 documents and human-written summaries. It includes additional adequacy-level and deducibility-level labels. [Project URL]
  • MLGSum: a multilingual text summarization corpus with 1.2 million articles in 12 languages. Average length per article is 570 words. [Project URL] [Data]
  • MTG: A large-scale multilingual dataset for text generation tasks. It consists of 6.9 million raw articles and 400 thousand annotated documents for story generation, question generation, title generation, and summarization in five languages, English, French, German, Spanish, and Chinese. Download data here.
  • SJN-210k: a dataset for jersey number recognition in soccer games. It contains 210,000 images for training and testing. Paper. Download data here.

Past Projects

Intelligent Writing and Text Generation

  • developing controllable and interpretable methods for effective text generation.
  • Xiaomingbot: an intelligent news-writing robot. [Demo]
  • Bayesian sampling methods for controllable text generation: CGMH, MHA, TSMH, controlling the language generation explicitly using various constraints.
  • VAE with hierarchical latent priors. Solving the training problem for VAE with mixture of exponential family distribution. DEMVAE
  • Training better data-to-text generation with both data-text pairs and additional raw text. check out variational template machine that learns infinite templates for generation in the latent space.
  • One embedding is not enough to represent a word! Bayesian Softmax improves text generation.
  • Application in Advertising system: Generating bidwords for sponsored search, News headline editing.

Probabilistic programming languages and Bayesian inference

Time series learning

Parallel Learning for Sequential Models

Network analysis

  • social network and social media analysis
  • CDEM :fly embryo gene pattern mining. (finished)