Skip to main content
Skip main navigation
No Access

Research on web page classification-based core characteristics and web structure

Published Online:pp 253-257https://doi.org/10.1504/IJWMC.2014.062003

The explosive growth of web pages currently makes the research on web page classification technology a hotspot of web mining. This paper introduces experiment data of fashion document corpus by many feature selection and classification methods, gives characterising expressions for specific documents based on core feature terms and web page categorisation algorithm is put forward based on web structure. Through the classification experiment on fashion web pages corpus, the algorithm has higher accuracy rate than other classification algorithms, and thus improves several points relative to the result before adjustment on web structure. The algorithms studied in this paper can be applied in other domains besides web pages of fashions.

Keywords

web page, web mining, text classification, web structure

References

  • 1. Aggarwal, C.C. , Zhai, C. (2012). Mining Text Data. New York:Springer Science+Business Media Google Scholar
  • 2. Angelova, R. , Siersdorfer, S. (2006). ‘A neighborhood-based approach for clustering of linked page collections’. proceedings of the 15th ACM International Conference on Information and Knowledge Management. 5–11 November, Arlington, VA, USA, 778-779 Google Scholar
  • 3. Basu, A. , Watters, C. , Shepherd, M. (2003). ‘Support vector machines for text categorization’. proceedings of the 36th Hawaii International Conference on System Sciences. 6–9 January, Hilton Waikoloa Village, Island of Hawaii, HI, USA Google Scholar
  • 4. Busagala, L.S.P. , Ohyama, W. , Wakabayashi, T. , Kimura, F. (2012). ‘Multiple feature-classifier combination in automated text classification’. 10th IAPR International Workshop on Document Analysis Systems, 27–29 March, Gold Cost QLD, Australia, 43-47 Google Scholar
  • 5. Dai, L. , Huang, H. , Chen, Z. (2004). ‘A comparative study on feature selection in text categorization’. Journal of Chinese Information Processing. 18, 1, 26-32, (in Chinese) Google Scholar
  • 6. Guo, G. , Wang, H. , Bell, D. , Bi, Y. , Greer, K. (2003). ‘Using kNN model-based approach for automatic text categorization’. Available online at: http://citeseer.ist.psu.edu/713815.html Google Scholar
  • 7. Han, Y. , Zhou, B. , pei, J. , Jia, Y. (2009). ‘Understanding importance of collaborations in co-authorship networks: a supportiveness analysis approach’. proceedings of the SIAM International Conference on Data Mining. 30 April–2 May 2009, Sparks NV USA, 1111-1122 Google Scholar
  • 8. Jin-Shu, S. , Bo-Feng, Z. , Xin, X. (2006). ‘Advances in machine learning based text categorization’. Journal of Software. 17, 9, 1848-1859 Google Scholar
  • 9. Lu, M. , Diao, L. , Lu, Y. , Zhou, L. (2002). ‘The design and implementation of an excellent text categorization system’. proceedings of the 4th World Congress on Intelligent Control and Automation. 5, 10–14 June, Shanghai China, 459-463 Google Scholar
  • 10. Sun, J. , Dou, S. , Lu, Y. , Shi, C. (2004). ‘Web page classification techniques’. Journal of Tsinghua University. 44, 1, 65-68 Google Scholar
  • 11. Yang, Y. , Liu, X. (1999). ‘A re-examination of text categorization methods’. proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval. 15–19 August, Berkeley, CA, USA, 42-49 Google Scholar
  • 12. Zengmin, G. (2006). Domain-specific Web Mining and Its Application’. Beijing:Beijing Institute of Technology , Doctoral dissertation Google Scholar