KDD’10tutorial: Indexing and Mining Time Sequences.

Christos Faloutsos and Lei Li (CMU/SCS)

July 23, 2010

DESCRIPTION - OBJECTIVES

How can we find patterns in a sequence of sensor measurements (eg., a sequence of temperatures, or water-pollutant measurements)? How can we compress it? What are the major tools for forecasting and outlier detection? The objective of this tutorial is to provide a concise and intuitive overview of the most important tools, that can help us find patterns in sensor sequences. Sensor data analysis becomes of increasingly high importance, thanks to the decreasing cost of hardware and the increasing on-sensor processing abilities. We review the state of the art in three related fields: (a) fast similarity search for time sequences, (b) linear forecasting with the traditional AR (autoregressive) and ARIMA methodologies, (c) Kalman filters, and (d) non-linear forecasting, for chaotic/self-similar time sequences, using lag-plots and fractals. The emphasis of the tutorial is to give the intuition behind these powerful tools, which is usually lost in the technical literature, as well as to give case studies that illustrate their practical use.

FOILS

The pdf of the foils is here part A and part B

CONTENT AND OUTLINE

GOAL - WHO SHOULD ATTEND

Researchers that want to get up to speed with the major tools in time sequence analysis. Also, practitioners who want a concise, intuitive overview of the state of the art.

PREREQUISITES

None. The emphasis is on the intuition behind all these mathematical tools.

PRESENTERS - BIOS

Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, the Innovations award in KDD’10, 16 “best paper” awards, and several teaching awards. He has served as a member of the executive committee of SIGKDD; he has published over 200 refereed articles, 11 book chapters and one monograph. He holds five patents and he has given over 30 tutorials and over 10 invited distinguished lectures. His research interests include data mining for graphs and streams, fractals, database performance, and indexing for multimedia and bio-informatics data.

Lei Li is currently a Ph.D. candidate at Computer Science Department, Carnegie Mellon University. Lei Li is received his B.E. degree from the Department of Computer Science and Engineering, Shanghai Jiao Tong University. His research interests include machine learning, data mining, time series analysis, and motion capture data.

[89][34434551][44485249383936][2526][6][545655][11][292][1210][37][21][2340][28,  2714221520][75346550415857351431319][32303133][2442][16171847]

References

[1]   Shivnath Babu and Jennifer Widom. Continuous Queries over Data Streams. SIGMOD Record 109-120, 30(3), 2001.

[2]   Brian T Bartell, Garrison W Cottrell, and Richard K Belew. Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling. SIGIR, pages 161–167, 1992.

[3]   Michael W Berry. Large-Scale Sparse Singular Value Computations. The International Journal of Supercomputer Applications, 6(1):13–49, 1992.

[4]   Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Joerg Sander. {LOF}: Identifying Density-Based Local Outliers. In SIGMOD Conference, pages 93–104, Dallas, TX, 2000.

[5]   M Castagli and S Eubank. Nonlinear Modeling and Forecasting. Addison Wesley, 1992.

[6]   Deepayan Chakrabarti and Christos Faloutsos. F4: large-scale automated forecasting using fractals. In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 4-9, 2002, pages 2–9, 2002.

[7]   Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB, pages 426–435, 1997.

[8]   F J Damerau. A Technique for Computer Detection and Correction of Spelling Errors. CACM, 7(3):171–176, March 1964.

[9]   Ingrid Daubechies. Ten Lectures on Wavelets. Capital City Press, Montpelier, Vermont, 1992.

[10]   Susan T Dumais. Latent Semantic Indexing ({LSI}) and {TREC}-2. In D K Harman, editor, The Second Text Retrieval Conference (TREC-2), pages 105–115, Gaithersburg, MD, March 1994. NIST.

[11]   Christos Faloutsos and King-Ip Lin. FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, May 22-25, 1995, pages 163–174, 1995.

[12]   Peter W Foltz and Susan T Dumais. Personalized Information Delivery: An Analysis of Information Filtering Methods. Comm. of ACM (CACM), 35(12):51–60, December 1992.

[13]   Volker Gaede and Oliver Guenther. Multidimensional Access Methods. Computing Surveys, 30(2):170–231, 1998.

[14]   Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. Data Stream Management: Processing High-Speed Data Streams. Springer, 2009.

[15]   J E Gehrke, Flip Korn, and Divesh Srivastava. On Computing Correlated Aggregates Over Continual Data Streams. ACM Sigmod, May 2001.

[16]   Zoubin Ghahramani and Geoffrey E. Hinton. Parameter estimation for linear dynamical systems. Technical Report CRG-TR-96-2, February 1996.

[17]   Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000.

[18]   N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. IEE Proceedings F Radar and Signal Processing, 140(2):107–113, 1993.

[19]   Dimitrios Gunopulos and Gautam Das. Time Series Similarity Measures and Time Series Indexing. In SIGMOD Conference, Santa Barbara, CA, 2001.

[20]   A Guttman. R-Trees: {A} Dynamic Index Structure for Spatial Searching. In Proc. ACM SIGMOD, pages 47–57, Boston, Mass, June 1984.

[21]   A Hyvarinen, J Karhunen, and E Oja. Independent Component Analysis. John Wiley and Sons, 2001.

[22]   Christian S Jensen and Stardas Pakalnis. TRAX - Real-World Tracking of Moving Objects. In Christoph Koch, Johannes Gehrke, Minos N Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J Neuhold, editors, VLDB, pages 1362–1365. ACM, 2007.

[23]   I Jolliffe. Principal Component Analysis. Springer, 2002.

[24]   R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME šC Journal of Basic Engineering, (82 (Series D)):35–45, 1960.

[25]   Eamonn J Keogh, Kaushik Chakrabarti, Sharad Mehrotra, and Michael J Pazzani. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. In SIGMOD Conference, Santa Barbara, CA, 2001.

[26]   Eamonn J Keogh, Stefano Lonardi, and Chotirat (Ann) Ratanamahatana. Towards parameter-free data mining. In KDD Conference, pages 206–215, Seattle, WA, 2004.

[27]   Eamonn J Keogh, Themis Palpanas, Victor B Zordan, Dimitrios Gunopulos, and Marc Cardle. Indexing Large Human-Motion Databases. In Mario A Nascimento, M Tamer Özsu, Donald Kossmann, Renée J Miller, José A Blakeley, and K Bernhard Schiefer, editors, VLDB, pages 780–791. Morgan Kaufmann, 2004.

[28]   Vikrant Kobla, David S Doermann, King-Ip Lin, and Christos Faloutsos. Compressed-Domain Video Indexing Techniques Using DCT and Motion Vector Information in MPEG Video. In Storage and Retrieval for Image and Video Databases (SPIE), pages 200–211, 1997.

[29]   Joseph B Kruskal and Myron Wish. Multidimensional scaling. SAGE publications, Beverly Hills, 1978.

[30]   Lei Li, Wenjie Fu, Fan Guo, Todd C Mowry, and Christos Faloutsos. Cut-and-stitch: efficient parallel learning of linear dynamical systems on smps. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pages 471–479, 2008.

[31]   Lei Li, James McCann, Christos Faloutsos, and Nancy Pollard. Laziness is a virtue: Motion stitching using effort minimization. In Short Papers Proceedings of EUROGRAPHICS, 2008.

[32]   Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. DynaMMo: mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009, pages 507–516, 2009.

[33]   Lei (CMU) Li, B. Aditya (CMU) Prakash, and Christos (CMU) Faloutsos. Parsimonious Linear Fingerprinting for Time Series. In VLDB, Singapore, 2010.

[34]   Vasileios Megalooikonomou, Qiang Wang, Guo Li, and Christos Faloutsos. A Multiresolution Symbolic Representation of Time Series. In Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 5-8 April 2005, Tokyo, Japan, pages 668–679, 2005.

[35]   I J Oppenheim, A Jain, and D W Greve. A {MEMS} Ultrasonic Transducer for Resident Monitoring of Steel Structures. In SPIE Smart Structures Conference SS05, San Diego, March 2002.

[36]   Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Proceedings, pages 519–528, 2004.

[37]   Christos H Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. Latent Semantic Indexing: {A} Probabilistic Analysis. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 61(2):217–235, 2000.

[38]   Spiros Papadimitriou, Anthony Brockwell, and Christos Faloutsos. Adaptive, Hands-Off Stream Mining. In VLDB, pages 560–571, 2003.

[39]   Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos. Streaming Pattern Discovery in Multiple Time-Series. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005, pages 697–708, 2005.

[40]   William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical Recipes in {C}. Cambridge University Press, 2nd edition, 1992.

[41]   Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speach Recognition. Prentice Hall, 1993.

[42]   H. Rauch, F Tung, and C T Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, pages 1445–1450, 1965.

[43]   Yasushi Sakurai, Christos Faloutsos, and Masashi Yamamuro. Stream Monitoring under the Time Warping Distance. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, April 15-20, 2007, The Marmara Hotel, Istanbul, Turkey, pages 1046–1055, 2007.

[44]   Yasushi Sakurai, Spiros Papadimitriou, and Christos Faloutsos. AutoLag: Automatic Discovery of Lag Correlations in Stream Data. In Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 5-8 April 2005, Tokyo, Japan, pages 159–160, 2005.

[45]   Yasushi Sakurai, Masatoshi Yoshikawa, and Christos Faloutsos. FTW: fast similarity search under the time warping distance. In Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA, pages 326–337, 2005.

[46]   T Sauer. Time series prediction using delay coordinate embedding. In A S Weigend and N A Gershenfeld, editors, Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley, 1994.

[47]   R. H. Shumway and D. S. Stoffer. An approach to time series smoothing and forecasting using the em algorithm. Journal of Time Series Analysis, 3:253–264, 1982.

[48]   Jimeng Sun, Spiros Papadimitriou, and Christos Faloutsos. Online Latent Variable Detection in Sensor Networks. In Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 5-8 April 2005, Tokyo, Japan, pages 1126–1127, 2005.

[49]   Jimeng Sun, Spiros Papadimitriou, and Christos Faloutsos. Distributed Pattern Discovery in Multiple Streams. In Advances in Knowledge Discovery and Data Mining, 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006, Proceedings, pages 713–718, 2006.

[50]   F Takens. Detecting strange attractors in fluid turbulence. In Dynamical Systems and Turbulence. Springer-Verlag., Berlin, 1981.

[51]   Yufei Tao, Christos Faloutsos, Dimitris Papadias, and Bin Liu. Prediction and Indexing of Moving Objects with Unknown Motion Patterns. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18, 2004, pages 611–622, 2004.

[52]   Mengzhi Wang, Ngai Hang Chan, Spiros Papadimitriou, Christos Faloutsos, and Tara M Madhyastha. Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic. In Proceedings of the 18th International Conference on Data Engineering, 26 February - 1 March 2002, San Jose, CA, pages 507–516, 2002.

[53]   Andreas S Weigend and Neil A Gerschenfeld. Time Series Prediction: Forecasting the Future and Understanding the Past. Addison Wesley, 1994.

[54]   Byoung-Kee Yi and Christos Faloutsos. Fast Time Sequence Indexing for Arbitrary Lp Norms. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 385–394, 2000.

[55]   Byoung-Kee Yi, H V Jagadish, and Christos Faloutsos. Efficient Retrieval of Similar Time Sequences Under Time Warping. In Proceedings of the Fourteenth International Conference on Data Engineering, February 23-27, 1998, Orlando, Florida, USA, pages 201–208, 1998.

[56]   Byoung-Kee Yi, Nikolaos Sidiropoulos, Theodore Johnson, H V Jagadish, Christos Faloutsos, and Alexandros Biliris. Online Data Mining for Co-Evolving Time Sequences. In ICDE, pages 13–22, 2000.

[57]   Y Zhu and D Shasha. Efficient Elastic Burst Detection in Data Streams. In KDD, 2003.

[58]   Yunyue Zhu and Dennis Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. In VLDB, pages 358–369, 2002.