Web Document Classification Using Naïve Bayes

This work was carried out in collaboration between all authors. Author ABA and ODF designed the study, wrote the protocol and supervised the work. Authors JPO and NOA carried out all laboratories work and performed the statistical analysis, Author JPO wrote the first draft of the manuscript. Author ABA and JPO managed the analyses of the study. Authors ODF and NOA managed the literature searches. All authors read and approved the final manuscript.


Introduction
The Internet is a vast resource of information of different types: text, images, audio and video [1].The amount of information available on the World Wide Web (WWW) has been increasing at an exponential rate.These Web documents contain rich textual information, but the rapid growth of the internet has made it increasingly difficult for users to locate the relevant information quickly on the Web.
According to Pierre [2], the number of web pages available on the web is around 1 billion and another 1.5 million are being added on a daily basis, this explosive growth rate has put huge amounts of information at the disposal of anyone with access to the Internet.Hence, how to access a particular web document out of these enormous web pages available on the internet and how to correctly classify them has being a problem researchers have been trying to solve.Though different search engines are available, they do not provide the exact information that matches at a high degree of relevance what the user's interests and preferences are simply because the information available on the internet are not well organized.This has led to a great deal of interest in developing useful and efficient tools that can be used to properly organize web pages.
The World Wide Web continues to grow both in the huge volume of traffic and the size and complexity of Web sites.It is difficult to identify the relevant information present in the web [3].With the growing number of web documents and online information, web mining plays an important role in extracting useful information from the World Wide Web through Web page classification, also known as web page categorization.Web page classification may be defined as the task of determining whether a web page belongs to a particular category or not.
Web content mining is nothing but the discovery of valuable information from web documents and these web documents may contain text, image, hyperlinks, metadata and structured records [4].Web mining is applied to extract the interesting, useful patterns and hidden information from the Web documents and Web activities [5].
In this paper, a method of automatically classifying Web documents into a set of categories using the Naive Bayes algorithm is proposed and Waikato Environment for Knowledge Analysis (WEKA) is used as the machine learning platform.The outline of this paper is the following.In section 2 we review related works and in section 3, we introduced our classification algorithm which is the Naïve Bayes.Our research methodology was described in detail in section 3 while our research methodology was discussed in the following section.Our result was discussed in section 5. We concluded our paper in the last section.

Review of Related Works
The Web mining can be said to have three operations of interests: Clustering (e.g., finding natural grouping of users, pages, etc.), Association (e.g., which URLs tend to be requested together), Sequential Analysis (e.g., the order in which URLs tends to be accessed).The clusters and associations in web mining do not have clear-cut boundaries and often overlap considerably in most real world problems [6].
Consequently, an increasing number of approaches have been developed for web document classification, including k-nearest-neighbour (KNN) classification [7,8,9], Naïve Bayes classification [10,11,12], Support Vector Machines (SVM) [13,14,15], decision tree (DT) [16,17], Neural Network (NN) [18,19] and maximum entropy [8,20].Among these approaches, the Naïve Bayes text classifier has been widely used because of its simplicity in both the training and classifying stage [14].The naive Bayesian classifier is uncomplicated and widely used method for supervised learning.It is one of the fastest learning algorithms, and can deal with any number of features and classes [21].Bayesian classification is based on Bayes theorem.A simple Bayesian classification namely the Naïve classifier is comparable in performance with decision tree and neural network classifiers [22].
Loan [23] submitted that Naïve Bayes algorithm improves the tasks of the Web Mining by its accurate classification of web documents.Its applications are important in the following areas: e-mail spamming; filtering spam results out of search queries; mining log files for computing system management; machine learning for Semantic Web; document ranking by text classification; hierarchical text categorization; managing content with automatic classification and other areas from Web Mining.
Ziqiang and Xia [24] proposes a web classification algorithm using Maximum Margin Projection (MMP) and Least Square Support Vector Machines (LS-SVM).The high-dimensional document data is first projected into lower-dimensional feature space via MMP algorithm, then, the LS-SVM classifier is used to classify the test documents into different class in terms of the extracted semantic features.Experiments performed on two popular document datasets demonstrate the superior performance of the proposed document classification algorithm.
Guan, Zhou, Xiao, Guo and Yang [25] introduced a Fast dimension reduction for document classification based on imprecise spectrum analysis.It uses a representative matrix composed of top-k column vectors derived from the original feature vector space and reduces the dimension of a feature vector by computing its product with this representative matrix.Howard, Paull, Biletskiy and Yang [19] developed a fast backpropagation neural network model to build document classifiers and the information gain method is used for feature selection.According to the rank of the information gain of all the words contained in the documents, those words that contain more information to classify the documents were selected as the input features of the artificial neural network (ANN) classifiers.The neural network developed assumes a three-layer structure with a fast back-propagation learning algorithm.
Rujiang and Xiaoyue [26] proposed a system that uses integrated ontologies and natural language processing techniques to index texts.The traditional words matrix is replaced by a concepts-based matrix.For this purpose, a fully automated method for mapping keywords to their corresponding ontology concepts was developed using SVM for classification.Their results show an improved text classification performance.In this paper, we propose a novel approach to classifying web document using Naïve Bayes10-fold cross validation.The data used was extracted from the Website of Ladoke Akintola University of Technology (LAUTECH), Ogbomoso, Nigeria.WEKA was used as the machine learning workbench which provides a general-purpose environment for automatic classification and feature selection.

Naive Bayes Approach
Naive Bayes is the simplest Bayesian Network (BN) Classifier, in which each attribute node (which is the attribute variable) has the class node (which is the class variable) as its parent, but does not have any other parent.
The Naïve Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f(x) can take on any value from some finite set V. A set of training examples of the target function is provided and a new instance is presented, described by the tuple of attribute values this will predict the target value, or classification, for this new instance.Using Bayesian approach in classifying the new instance means assigning the most probable target value, given the attribute values that describes the instance. where The Bayes theorem can be used to rewrite the expression above as

Waikato Environment for Knowledge Analysis (WEKA)
The Waikato Environment for Knowledge Analysis (WEKA) is a machine learning workbench currently being developed at the university of Waikato.Its purpose is to allow users to access a variety of machine learning techniques for the purposes of experimentation and comparison using real world data sets.Weka is a comprehensive suite of java class libraries that implement many state-of-the-art machine learning and data mining algorithms.WEKA is freely available on the world-wide web and accompanies a new text on data mining which documents and fully explains all the algorithms it contains [27].Applications written using the WEKA class libraries can be run on any computer with a web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.

Research Methodology
The research methodology for this study involves the following steps which include Data collection, Data preparation and the Machine Learning.This section summarizes these steps.

Data collection
Generally, the data collection step involves gathering text or web documents.The web document used in this study were collected from LAUTECH website.Fig. 1 illustrates a web page sample from of the data collected.The web page consists of texts, hyperlinks and pictures.Consequently, data pre-preparation will be needed to remove other element of the web page order than text.This is necessary due to the fact that the HTML structure sometimes have semantics associated with the document class.Therefore, the HTML structure is ignored so as to simplify the document processing and document representation.Also, the web page used were classified into categories and a class label.

Data preparation stage
After extracting the texts in the web pages, the data collected was converted into data sets in Weka's Attribute Relation File Format (ARFF) to be later used in the Machine Learning phase.
All text or web documents (text corpus) obtained from the data collection step were concatenated and saved in a single text file where each document is represented on a separate line in plain text format.This representation uses three attributes: document_name, document_content, and document_class, all of type string.

Data convertion into .arff format
To use WEKA as the machine learning tool, the data to be used must be in .arffformat.WEKA provides three ways of data conversion which are: Excel, Notepad and Ms Word.In this study, the collected data was converted to.arff format using the notepad.

Data classification
In this study, two types of data were collected, the first set of data are a set of pages describing the different units that exist in LAUTECH such as ICT Centre, physical and planning units, academic planning unit, the health centre, sports development unit, works and maintenance unit.
The second sets of data are a set of data describing some of the departments we have in LAUTECH such as biology, physics, computer, chemistry, fine arts, general studies, accounting, earth science etc.Hence, data are in two categories.

Machine learning phase
WEKA is the machine learning tools used in this study for web document classification.The prepared data in .arffformat was loaded into WEKA and Naïve Bayes algorithm was applied to the data i.e. after the data had been converted from string to nominal form.Classifying text document, the attribute content is usually much higher than the attribute name, this leads to the problem of having too many 0's in the document-term matrix, hence, a subset of words (bag of Words) that best represent the document collection with respect to the classification task was created.The process of removing these unwanted elements is called Feature (attribute) Selection.Weka provides a good number of algorithms for this purpose which is available through the attribute selection filter.Three file header which are: document_name, document_content and document_class were used.The document class has two classes i.e. class A and B displayed in two different colours which are red and blue.The classes in blue colour belong to class A while those in red colours belong to class B. The order of the arrangements shows the order in which the web pages occur on the LAUTECH Website.

Results and Discussion
The network structure describes the structure of the data used.Each of the variables is followed by a list of parents, so the class variable has parent document_class, the number in braces is the cardinality of the variable.It shows that in the dataset there are three class variables.All other variables are made binary by running it through a discretization filter

Log result
The logarithmic score shows the logarithmic values of the network structure for various methods of scoring.

Stratified cross-validation
The stratified cross-validation shows 77% correctly classified instances, 23% incorrect classified instances, 0 kappa statistic, 0.4838 mean absolute error, 0.3108 root mean square error, 68.9937% relative absolute error and 100 total number of instances.

Detailed accuracy by class
From Table 4 two important observations can be made; First, all attributes have only one of its values occurring in class B. This is indicated by the fact that one of the counts is always 1, which means that the actual count is 0 (according to the Laplace estimator used by the algorithm the actual value count is incremented by 1).Second, the confusion matrix indicates that 55 of 66 instances in class A were accurately classified while 22 of 34 instances in class B were accurately classified.This shows the ability of WEKA to correctly classify documents using Naïve Bayes classifier.

Predictions on test data
In Table 5; for all documents from actual class B, the class distribution decisively predicts class B that means all the documents under class B were accurately classified but the predictions show that the six errors (marked with +) happen in actual class A.  These forms show the first and second documents that were classified wrongly, both belongs to class A but were wrongly classified as class B.

Conclusion
In this Study, the strength of Naïve Bayes classifier in classifying web documents was discovered and WEKA by the virtue of its performance in classifying web document is a good machine learning environment for web mining.The main strength of this approach lies in its ability to correctly classify the web documents into the right categories and its ability to classify web pages in a short time of zero seconds.
The result obtained can be improved to achieved an increased accuracy of a web page classification by combining other techniques like Support Vector Machine (SVM) and K-Nearest Neighbor (K-NN).
attempt to estimate the two terms in equation (3) based on the training data.It is easy to estimate each of the simply by counting the frequency with which each target value occurs in the training data.The Naïve Bayes classifier is established on the basic postulation that the characteristic values are conditionally independent with respect to a target value.In other words, the assumption is that given the target value of the instance, the probability of observing the conjunction is just the product of the probabilities for the individual attributes: (4) Substituting this into equation (3) we have Naïve Bayes Classifier: (5) Where denotes the target value output by the Naïve Bayes Classifier.

Fig. 2 .Fig. 3 .
Fig. 2. Graphical representation of form showing the six error of the classifierClicking on the first two squares in the plot reveals these two documents as shown in Fig.3below: