LPIS Home Page
Google Search

Title: Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams
Author(s): I. Katakis, G. Tsoumakas, I. Vlahavas.
Availability: Click here to download the PDF (Acrobat Reader) file (10 pages).
Keywords: Text Classification, Data Streams, Concept Drift, Dynamic Feature Space, Incremental Feature Selection.
Appeared in: ECML/PKDD-2006 International Workshop on Knowledge Discovery from Data Streams, pp. 107-116, Berlin, Germany, 2006.
Abstract: Real world text classification applications are of special interest for the machine learning and data mining community, mainly because they introduce and combine a number of special difficulties. They deal with high dimensional, streaming, unstructured, and, in many occasions, concept drifting data. Another important peculiarity of streaming text, not adequately discussed in the relative literature, is the fact that the feature space is initially unavailable. In this paper, we discuss this aspect of textual data streams. We underline the necessity for a dynamic feature space and the utility of incremental feature selection in streaming text classification tasks. In addition, we describe a computationally undemanding incremental learning framework that could serve as a baseline in the field. Finally, we introduce a new concept drifting dataset which could assist other researchers in the evaluation of new methodologies.
See also : Concept drifting datasets descibed in paper

        This paper has been cited by the following:

1 Kim, H.J., Chang, J.: Integrating incremental feature weighting into naïve bayes text classifier, Proc. 6th International Conference on Machine Learning and Cybernetics (ICMLC 2007), Volume 2, 2007, pp. 1137-1143.
2 Kim, Han-Joon and Chang, Jae-Young, "Improving Naive Bayes Text Classifiers with Incremental Feature Weighting", KIPS Transactions, Korea Information Processing Society, b15 (5), pp. 457-464, 2008
3 Qu, W., Zhang, Y., Zhu, J., and Qiu, Q. 2009. Mining Multi-label Concept-Drifting Data Streams Using Dynamic Classifier Ensemble. In Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning (Nanjing, China, November 02 - 04, 2009). Z. Zhou and T. Washio, Eds. Lecture Notes In Artificial Intelligence, vol. 5828. Springer-Verlag, Berlin, Heidelberg, 308-321. DOI= http://dx.doi.org/10.1007/978-3-642-05224-8_24
4 Žliobaitė I., "Learning under Concept Drift: an Overview", Vilnius University, Faculty of Mathematics and Informatics, Technical Report, 2009.
5 Ruan, Guangchen and Tan, Ying "A three-layer back-propagation neural network for spam detection using artificial immune concentration", Soft Computing: A Fusion of Foundations, Methodologies and Applications, 2009.
6 Chih-Chin, L, Chih-Hung, W, Ming-Chi, T. (2009) Feature selection using particle swarm optimization with application in spam filtering, International Journal of Innovative Computing, Information and Control, Volume 5, Issue 2, February 2009, Pages 423-432
7 Han-Joon Kim, "Improving Techniques for Naive Bayes Text Classifiers", in Handbook of Research on Text and Web Mining Technologies, Min Song and Yi-Fang We (Eds), pp. 111-127, Idea Group, 2009.
8 Liu, P., Cai, L., Wang, Y., Zhang, L. (2010) Classifying skewed data streams based on reusing data, ICCASM 2010 - 2010 International Conference on Computer Application and System Modeling, Proceedings, 4, art. no. 5620201, pp. V490-V493.
9 Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Jiawei Han and Bhavani Thuraisingham, "Classification and Novel Class Detection of Data Streams in a Dynamic Feature Space", Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, Volume 6322/2010, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010
10 Zliobaite I., "Adaptive Training Set Formation", PhD Thesis, Vilnius University, 2010.
11 Nambiar, U., Faruquie, T., Subramaniam, L.V., Negi, S., Ramakrishnan, G. (2011) Discovering customer intent in real-time for streamlining service desk conversations, International Conference on Information and Knowledge Management, Proceedings, pp. 1383-1388.
12 Hoens, T.R., Chawla, N.V., Polikar, R. (2011) Heuristic Updatable Weighted Random Subspaces for non-stationary environments, Proceedings - IEEE International Conference on Data Mining, ICDM, art. no. 6137228, pp. 241-250.
13 Masud, M.; Chen, Q.; Khan, L.; Aggarwal, C.; Gao, J.; Han, J.; Srivastava, A.; Oza, N. (2012) Classification and Adaptive Novel Class Detection of Feature-Evolving Data Streams, IEEE Transactions on Knowledge and Data Engineering, vol.PP, no.99, pp.1.
14 Žliobaitė,I.;Gabrys,B.; (2012) Adaptive Preprocessing for Streaming Data, IEEE Transactions on Knowledge and Data Engineering, vol.PP, no.99, pp.1