- Translation Canvas: A python software package for in-depth analysis and visualization of machine translation model outputs.
- LightSeq: A
High Performance Training and Inference Library for Transformer models.
It is widely used for machine translation, text generation, visual recognition, and more.
With the custom CUDA
implementation, it achieves 10x speed-up over the original
tensorflow seq2seq package, and faster than other implementations.
- NeurST : A
toolbok with readily available models for neural machine
translation and speech-to-text translation.
- Xiaomingbot: an intelligent news-writing robot. [Demo]
- BLOG: a
probabilistic programming language for machine learning
- Swift: a compiler
for the probabilistic programming language BLOG.
- DynaMMo: learning
toolbox for multi-dimensional co-evolving time series. github
page
- CLDS: complex-valued linear
dynamical system
- PLiF: time-shift-invariant
feature extraction for time series
- BoLeRO: human motion
capture occlution recovering
- paralearn: a parallel
algorithm for learning Markov models and linear dynamical systems
(i.e. Kalman filter)
- MLDS: learning dynamical
model for tensor time series
Dataset
- TTNews: a dataset for Chinese document summarization. 50,000
news articles with summary for training, and 4,000 news articles
for testing. [Task
description] [Training
data] [Testing
data and evaluation script] [Reports from NLPCC2017
and NLPCC2018]
- CNewSum: an extended version of TTNews for Chinese document
summarization. It includes 304,307 documents and human-written
summaries. It includes additional adequacy-level and
deducibility-level labels. [Project
URL]
- MLGSum: a multilingual text summarization corpus with 1.2
million articles in 12 languages. Average length per article is
570 words. [Project
URL] [Data]
-
MTG: A large-scale multilingual dataset for text generation tasks. It consists of 6.9 million raw articles and 400 thousand annotated documents for story generation, question generation, title generation, and summarization in five languages, English, French, German, Spanish, and Chinese. Download data here.
-
SJN-210k: a dataset for jersey number recognition in soccer games. It contains 210,000 images for training and testing. Paper. Download data here.
Past Projects
Intelligent Writing and Text Generation
- developing controllable and interpretable methods for effective
text generation.
- Xiaomingbot: an intelligent news-writing robot. [Demo]
- Bayesian sampling methods for controllable text generation: CGMH,
MHA, TSMH,
controlling the language generation explicitly using various
constraints.
- VAE with hierarchical latent priors. Solving the training
problem for VAE with mixture of exponential family distribution. DEMVAE
- Training better data-to-text generation with both data-text
pairs and additional raw text. check out variational
template machine that learns infinite templates for
generation in the latent space.
- One embedding is not enough to represent a word! Bayesian
Softmax improves text generation.
- Application in Advertising system: Generating bidwords for
sponsored
search, News headline editing.
Probabilistic programming languages and Bayesian inference
Time series learning
Parallel Learning for Sequential Models
Network analysis
- social network and social media analysis
- CDEM
:fly embryo gene pattern mining. (finished)