LPIS Home Page
Google Search

Title: Tracking Recurring Contexts using Ensemble Classifiers: An Application to Email Filtering
Author(s): I. Katakis, G. Tsoumakas, I. Vlahavas.
Availability: Click here to download the PDF (Acrobat Reader) file (23 pages).
Keywords: data streams, classification, concept drift, text mining, text classification, recurring contexts, recurring themes, text streams, email mining, email classification.
Appeared in: Knowledge and Information Systems, Springer, 22(3), pp. 371-391, 2010.
Abstract: Concept drift constitutes a challenging problem for the machine learning and data mining community that frequently appears in real world stream classification problems. It is usually defined as the unforeseeable concept change of the target variable in a prediction task. In this paper, we focus on the problem of recurring contexts, a special sub-type of concept drift, that has not yet met the proper attention from the research community. In the case of recurring contexts, concepts may re-appear in future and thus older classification models might be beneficial for future classifications. We propose a general framework for classifying data streams by exploiting stream clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual representation model is proposed. The clustering algorithm is then applied in order to group batches of examples into concepts and identify recurring contexts. The ensemble is produced by creating and maintaining an incremental classifier for every concept discovered in the data stream. An experimental study is performed using a) two new real-world concept drifting datasets from the email domain, b) an instantiation of the proposed framework and c) five methods for dealing with drifting concepts. Results indicate the effectiveness of the proposed representation and the suitability of the concept-specific classifiers for problems with recurring contexts.
See also : Datasets

        This paper has been cited by the following:

1 Žliobaitė I., "Learning under Concept Drift: an Overview", Vilnius University, Faculty of Mathematics and Informatics, Technical Report, 2009.
2 Zliobaite I., "Adaptive Training Set Formation", PhD Thesis, Vilnius University, 2010.
3 Lu, J., Li, R., Zhang, Y., Zhao, T., Lu, Z. Image annotation techniques based on feature selection for class-pairs, Knowledge and Information Systems, 24(2), 325-337.
4 Gomes, J. B., Menasalvas, E., and Sousa, P. A. C. (2010). CALDS: context-aware learning from data streams. In Proceedings of the First international Workshop on Novel Data Stream Pattern Mining Techniques (Washington, D.C., July 25 - 25, 2010). StreamKDD '10. ACM, New York, NY, 16-24.
5 Gomes, J. B., Menasalvas, E., and Sousa, P. A. C. (2010), "Tracking Recurrent Concepts Using Context", Rough Sets and Current Trends in Computing, 7th International Conference, RSCTC 2010, Lecture Notes in Computer Science, 2010, Volume 6086/2010, 168-177, Warsaw, Poland, June 28-30,2010.
6 Gomes, J. B., Eibe, S., Sousa, P. A. C., Menasalvas, E. (2010), "Context-Aware Stream Ensemble for Ubiquitous Environments", Workshop on Ubiquitous Data Mining, in conjunction with the 19th European Conference on Artificial Intelligence (ECAI 2010), in Lisbon, Portugal, August 16-20, 2010
7 Peipei Li, Xindong Wu, Xuegang Hu, (2010) Mining Recurring Concept Drifts with Limited Labeled Streaming Data, JMLR: Workshop and Conference Proceedings 13: 241-252, 2nd Asian Conference on Machine Learning (ACML2010), Tokyo, Japan, Nov. 8-10, 2010.
8 Žliobaitė, I. (2010). Three Data Partitioning Strategies for Building Local Classifiers: an experiment. Proc. of SUEMA workshop at ECML/PKDD'10, p.151-160
9 Žliobaitė, I. (2011) "Combining similarity in time and space for training set formation under concept drift", Intelligent Data Analysis 15(4), pp. 589-611.
10 Gomes, J.B., Menasalvas, E., Sousa, P.A.C. (2011) Learning recurring concepts from data streams with a context-aware ensemble, Proceedings of the ACM Symposium on Applied Computing, pp. 994-999.
11 Kamath, K., Caverlee, J. (2011) "Expert-Driven Topical Classification of Short Message Streams", Proceedings of 3rd IEEE Conference on Social Computing (SocialCom 2011), MIT, Boston, USA.
12 Žliobaite, I. (2011) Three data partitioning strategies for building local classifiers, Studies in Computational Intelligence, 373, pp. 233-250.
13 Žliobaitė, I. (2011). Identifying Hidden Contexts in Classification. Proceedings of PAKDD'11, pp. 277-288.
14 Zhang, Y., Zhu, X., Wu, X., Bond, J.P. (2011) Corrective classification: Learning from data imperfections with aggressive and diverse classifier ensembling, Information Systems, 36 (8), pp. 1135-1157.
15 Hoens, T.R., Chawla, N.V., Polikar, R. (2011) Heuristic Updatable Weighted Random Subspaces for non-stationary environments, Proceedings - IEEE International Conference on Data Mining, ICDM, art. no. 6137228, pp. 241-250.
16 Hosseini, M.J., Ahmadi, Z., Beigy, H. (2011) Pool and accuracy based stream classification: A new ensemble algorithm on data stream classification using recurring concepts detection, Proceedings - IEEE International Conference on Data Mining, ICDM, art. no. 6137433, pp. 588-595.
17 Sarah N. Kohail (2011) “Learning Concept Drift Using Adaptive Training Set Formation Strategy”, MSc Thesis, Faculty of Information Technology, The Islamic University of Gaza, October 2011.
18 Gama, J., Kosina, P. (2011) Learning about the learning process, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7014 LNCS, pp. 162-172.
19 Grossi, V., Turini, F. (2012) Stream mining: A novel architecture for ensemble-based classification, Knowledge and Information Systems, 30 (2), pp. 247-281.
20 Žliobaite, I., Bakker, J., Pechenizkiy, M. (2012) Beating the baseline prediction in food sales: How intelligent an intelligent predictor is? Expert Systems with Applications, 39 (1), pp. 806-815.
21 Nishida, Y., Hoshide, T., Fujimura, K. (2012) Improving tweet stream classification by detecting changes in word probability, In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR '12), pp 971-980.
22 Attar, V., Chaudhary, P., Rahagude, S., Chaudhari, G., Sinha, P. (2012) An instance-window based classification algorithm for handling gradual concept drifts, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7103 LNAI, pp. 156-172.
23 Li, P., Wu, X., Hu, X. (2012) Mining recurring concept drifts with limited labeled streaming data, ACM Transactions on Intelligent Systems and Technology, 3 (2), art. no. 29
24 Ahmadi, Z., Beigy, H. (2012) Semi-supervised ensemble learning of data streams in the presence of concept drift, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7209 LNAI (PART 2), pp. 526-537.
25 Schlüter, T., Conrad, S. (2012) An approach for automatic sleep stage scoring and apnea-hypopnea detection, Frontiers of Computer Science in China, 6 (2), pp. 230-241.
26 Lee, C.-J., Wu, Y.-C., Chen, Y.-C. (2012) Building news sentiment indicators for stock marketing application, International Journal of Advancements in Computing Technology, 4 (2), pp. 103-110.
27 Ying, K., ZhaoJie (2012) Study on image multiple classification based on boosting method, Journal of Convergence Information Technology, 7 (13), pp. 58-65.
28 Gomez, J.C., Boiy, E., Moens, M.-F. (2012) Highly discriminative statistical features for email classification, Knowledge and Information Systems, 31 (1), pp. 23-53.