Machine Learning in The Laboratory

Despite from other functional techniques the main focus of this thesis is the review of Data Mining algorithms for Content Negotiation purposes. Many problems of language classification in Internet email were discussed in the previous chapters and sections. The interesting question upcoming now is if the advantages of Data Mining with Machine Learning techniques overweight. After discussing some basic terms some popular ML algorithms will be introduced. Each algorithm traverses the same test scenarios to make the results comparable. The test scenarios were selected using worst case data to see the real strengths and weaknesses of each algorithm under extreme conditions. These conditions made the popular Spam Filter Software called Spam Assassin fail at 100 percent, using only two categories, Spam and Not-Spam.

Learning Paradigms
In order to be able to recognize foreign information Machine Learning means to negotiate and build a World Model in a specific way.

Neuronal Learning
Neuronal Learning follows the Nature Model. Patterns are stored in a Network of Neurons. A Neuron is the smallest atom in a Neuronal Network. Spoken in Information Technology (IT) terms it has a couple of input channels, a threshold value, and an output channel. If the input data reaches the threshold value the neuron sends an impulse (fire signal) to the output channel. The output channel of a neuron is connected to the input channels of other neurons, building a neuronal network. To be able to store knowledge neurons have to be connected in the correct way, configured using the correct threshold values. This exactly reflects the Nature Model.

Statistical Learning
The statistical distribution of information in training data used to build a model is considered to be Statistical Learning. Analyzing input data, calculating variance factors, intervals, extreme values, and alikes for selected subsets can greatly help deriving and predicting patterns. This statistical conclusion builds the knowledge model. The methods for statistical learning run the gamut from statistics to computational complexity, combinatorics and geometry.

Stochastical Learning
Stochastical Learning, like the Statistical Learning trains on the distribution of information. The main (mathematical) focus relies on stochastical presumptions. In other words, the world model is stochastically analyzed, building up a knowledge model of presumptions. The presumption of the machine learned weather, when it is raining, is 20 percent likely to be nice.

Inductive Learning
Induction or Concept Learning has the task to induce rules and categories by observing the world model (training data). Learning by example the techniques must be capable of building class definitions dynamically on their own, because the classes (categories) may change rapidly or not be adequate.

Empirical Learning
Empirical Learning is the process of examining and comparing a number of different informations, selecting their common observable characteristics, and, based on this, deriving a general pattern of this class of information. As a subset from the Inductive Learning paradigm the empirical area of learning is often used for Natural Language Processing (NLP) problems.

Algorithms
After having introduced the main learning paradigms this section gives an overview of the most common Machine Learning Algorithms and Techniques.

Naive Bayes
The Bayes Algorithm based on Bayes Theorem and Formula is one of the most popluar Machine Learning algorithms. The idea is as simple as the algorithm and the learning and categorization success is sufficient. Many spamfilters use it for Spam Negotiation purposes.

Bayes Theorem
The theorem of Bayes (see figure [bayesTheoremExample]) is a stochastic presumption of for example a person having the rare blood group is percent (prevalence). Assuming there is a tool which quickly tests one person for having this blood group, the manufacturer guarantees a test success rate of percent (sensitivity). Having 10,000 people, percent should statistically have the blood group, and 9998 people should not. Using the blood group test tool ensures 99 percent success rate, and 1 percent false positives. Testing the two people having will classify both correctly (at 99 percent rate). Testing the 9998 peoples having other blood types will incorrectly classify about 100 persons (1 percent).
The induction based on this theorem allows to invert the event-result chain to calculate presumptions. This is called Reverse Induction. The well known formular of Bayes is used to calculate to probability of at least two informations to belong to a category. A really good example is the explanation of Paul Graham stated on his website.

Naive Bayes
The Naive Bayes Algorithm strictly follows the theorem of Bayes Formula. The simplicity of this algorithm induces unexpected good results, even in worst case scenarios. The algorithm outperforms nearly every other Machine Learning algorithm. Vice versa, classifiers and learners using the Bayes concept suffer the lack of performance and have problems to categorize text with too many classes (see test results in the next sections).

Complement Naive Bayes
The Complement Naive Bayes Algorithm is an improved version of the Naive Bayes Algorithm which focuses on a better calculation of textual information (String to Word Vector Transformation). The result is an algorithm which is 100 percent based on the theorem of Bayes, exhibiting information to be effectively and efficiently categorized.

R.I.P.P.E.R.
The Repeated Incremental Pruning to Produce Error Reduction Algorithm (RIPPER) which is proposed by William W. Cohen as an optimized version of the IREP Algorithm is a pruning learner:
Initialize, and for each class from the less prevalent one to the more frequent one, DO:

Building Stage: Repeat 2 and 3 until the description length (DL) of the ruleset and examples is 64 bits greater than the smallest DL met so far, or there are no positive examples, or the error rate is greater equal than 50 percent.
Grow Phase: Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100 percent accurate). The procedure checks every possible value of each attribute and selects the condition with highest information gain.
Prune Phase: Incrementally prune each rule and allow the pruning of any final sequences of the antecedents.
Optimization Stage: After generating the initial ruleset, generate and prune two variants of each rule from randomized data using procedure 2 and 3. But one variant is generated from an empty rule while the other one is generated by greedily adding antecedents to the original rule. Moreover, the pruning metric used is. Then the smallest possible DL for each variant and the original rule is computed. The variant with the minimal DL is selected as the final representative of in the ruleset. After all the rules in have been examined and if there are still residual positives, more rules are generated based on the residual positives using Building Stage again.
Delete the rules from the ruleset that would increase the DL of the entire ruleset if it were in it, and add resultant ruleset.

END DO

The RIPPER algorithm, one of the greatest competitors of the Naive Bayes Algorithm, follows a concept of rule integrity.

J48 Decision Tree Inducing
The J48 algorithm is an implementation of the original C4.5 Decision Tree Learning Algorithm invented by Ross Quinlan from the University of Sydney. The data model is a decision tree where nodes embody decisions based on the values of attributes, and the leaves of the tree represent predictions (category probability). The example (figure [decisiontree]) visualizes the tree model of the decision whether to choose the cabriolet, or the coupe, dependent on the weather. A higher attribute value embodies the best way to the decision. Looking at the nodes of Decision Trees (DT) the similarity with a Bayes Classifier is obvious. Using probability (weight) values to determine the correct categorization of an information atom can be implemented using different approaches. This allows infinite variants of the DT algorithm.

K-Nearest Neighbors
The K-Nearest Neighbors Algorithm (see figure [knearestcluster]) differs from the usual Training – Learn – Think concept of machine learning algorithms. The training data is stored in an initial representation without consulting the algorithm. The learning process is the categorization process at the same time. In the so called Prediction Time the algorithm compares the new information with the stored training set in the database. A query for the most likely information is done K-Nearest and examined. If the new information differs too much a new information type is opened. This ensures the real time learning ability of the algorithm. Due to its dynamic nature the algorithm can produce good categorization results. Vice versa, it is possible that the algorithm fails because of insufficient knowledge of new types.

Logistic Regression
The Linear Regression Model is typically stated in the form:

The right hand side may take other forms, but generally comprises a Linear Combination of the parameters. The term represents the unpredicted or unexplained variation in the dependent variable, it is conventionally called the error whether it is really a measurement error or not. The error term is conventionally assumed to have the expected value equal to zero, as a nonzero expected value could be absorbed into.

P.A.R.T.
The PART Algorithm is a Decision List Inducer generating a list of IF-THEN clauses to match and categorize data. The categorization of information is always ensured, because the last clause is a general rule (Default Else) which matches every kind of information. Reaching the last rule notifies the algorithm that rule improvement is required. The top-down simplicity can cause precise categorization results. Vice versa, the PART Algorithm can suffer Information Overload, by having too many IF-THEN clauses with similar classes. The run time performance of the classification is rather linear and therefore increases by high amounts of training and different classes.

Multilayer Perceptron
The Multilayer Perceptron Algorithm is an Artificial Neuronal Network (ANN). Based on the Neuronal Learning Paradigm (see [neuronalparadigm]) the algorithm uses a network of neurons to learn and store pattern information. The Multilayer Perceptron Algorithm is the most popular ANN, also called Multilayer Feed-forward Neural Network (MFNN).

Multilayer Perceptron Example Application
For a better understanding of a multilayer neuronal network a simplified version of a MFNN was trained the XOR function (see figure [nnXOR]). This simple network is already wired and the four neurons in it already have the correct threshold values (1 and 2). Given the input of none of the lower three neurons fire because none of the threshold values are reached. The upper Output Neuron does not fire, because none of the lower ones do fire. The threshold of 1 is not reached. Given the input of all of the lower three neurons fire, because all threshold values are met. The upper neuron does not fire because the threshold value of 1 is exceeded. Given the input of or one of the lower right or left neurons fire. The threshold value of the upper Output Neuron is met causing it to fire, too.

Support Vector Machines
The Support Vector Machine Algorithm (SVM) is a new generation learning algorithm based on the advantages in statistical learning. It is applied to a large number of real world applications, such as text categorization. When used in a categorization and learning context the SVM generates a hyperplane separating the training classes to a maximum margin distance. Motivated by Vapnik Chervonenkis Theory a probabilistic test error bound is provided, minimizing when the margin is maximized. Generating a Maximum Margin Hyperplane from given training data is a linear quadratic optimization problem. Several algorithms solve this problem including J. Platt’s Sequential Minimal Optimization Algorithm (SMO) that is used for the later experiments.

VMs deliver state-of-the-art performance in real-world applications such as text categorization, hand-written character recognition, image classification, bio sequence analysis, etc. Their first introduction in the early 1990s lead to a recent explosion of applications and deepening theoretical analysis, that has now established Support Vector Machines along with neural networks as one of the standard tools for machine learning and data mining.

The WEKA Framework
The WEKA Machine Learning Framework is a project of the University of Waikato, New Zealand. It implements all popular Machine Learning methods in a Java Framework. A graphical user interface (GUI), and a simple command line interface (CLI) allow a comfortable way of running training and tests on the several algorithms.

ARFF File Format
Training and classification data experiments using the WEKA Framework are based on the ARFF File Format. This is a simple text file format containing a header section describing the relation and type of the data in the data section. WEKA refers to keys being attributes, categories being classes and atoms of training being instances. A simple ARFF file for a XOR-Function training example is:

 @relation XOR_Training Example
 @attribute input1 numeric
 @attribute input2 numeric
 @attribute class {?, true, false}

 @DATA
 0, 0, false
 0, 1, true
 1, 0, true
 1, 1, false

See also the simple ARFF file generator in the appendix section [ARFFGEN].

Algorithm Contest
The major problems of Content Negotiation in Internet Mail and the need for Artificial Intelligent Solutions have been introduced and discussed in the first chapters. Now the most popular approaches and techniques have to prove their value in a detailed experiment. Each mentioned algorithm is being trained and tested using a worst case scenario. The worst case scenario in this cases means the training and categorization of Internet Email.

Laboratory Environment
The experiment uses presorted email of the categories Business, Private, University and Spam. The email data is converted into the ARFF file format using 50 training instances, 100, 150 … up to 500 instances. The ARFF attributes are the 150 mostly used words in all emails overall. For the worst case scenario categorization experiment a set of 100 presorted emails has been generated the same way the training data was generated. The training and test data generation was done on a Mac OS X Power PC G4-1.25 GHz with 512M RAM and finished after 43 minutes.
Each introduced ML algorithm is trained using every training instance file, step by step. After that, each algorithm tries to categorize each instance of the training set. This shows, if the algorithm is able to determine unique rules and patterns. In the second part of the experiment each algorithm has to categorize the unknown set of emails. This shows up the ability of an algorithm to use the training data.
The experiment itself was run using the WEKA Framework on a AMD Sempron() 3000 with 1G RAM Linux System. The whole experiment took 117 minutes to finish.

Training and Classification Experiment
The Naive Bayes Algorithm (see figure [bayes.naiveBayes]) causes irregular training success rates, decreasing with a rising amount of training data. Vice versa, the classification success rate of unknown emails is rather constant at about 42 percent overall. A lack of performance was noticeable. The Complement Naive Bayes Algorithm (see figure [bayes.complementnaiveBayes]) is optimized for text classification. This shows up in the training success rate with rather constant values between 65 and 75 percent. The classification of unknown emails is slightly better. The runtime performance is twice as fast as compared to the simple Naive Bayes test. The RIPPER Algorithm (see figure [rules.JRip]), known as one of the worst competitors of the Bayes Algorithm shows a clearly better training performance at constant 90 percent. But the Naive Bayes Algorithm, especially the Complement Naive Bayes Algorithm, definitely outperforms the RIPPER when classificating unknown information. Training using the J48 Decision Tree Inducer Algorithm (see figure [trees.J48]) are 93 percent effective, overall. This is a real good result and shows the J48 being able to build strong and unique decision trees with just a few training instances. The classification success of unknown email is rather average, and even decreasing with higher amounts of training. The dynamic nature of the K-Nearest Neighbor Algorithm is nicely shown up in this experiment (see figure [lazy.IBk]). The algorithm dynamically learns by classification and results in constant success values, no matter if the training data or the unknown information is classified. The Logistic Regression Algorithm (see figure [functions.Logistic]) causes slightly worser results than the K-Nearest Neighbor Algorithm. The training test was a real success at over 95 percent. The classification of unknown information is suffering irregularities and worser results. A training using the PART Algorithm (see figure [rules.PART]) causes success rates at 95 percent. The classification success is irregularly alternating between 45 and 52 percent. The decreasing success tendency when classifying unknown information can be caused by too many If-Then Clauses of the PART Learner. So the PART itself seems to suffer Information Overload in the NLP context. Neuronal Networks are known to cause good categorization results. But they are also known to cause untraceable irregularities. The Multilayer Perceptron Algorithm (see figure [multilayerPerceptron]) has major training problems between 200 and 300 training instances. This can be the result of a new training progression in the neuronal network, which is destroys or is controversial to already learned structures. The interesting about this training failure is that the classification at that point is rather successful. The SMO Support Vector Machine Algorithm (see figure [functions.SMO]) causes rather bad training results, starting at 80 percent success rate. The slowly but constantly increasing success rate gives rise to the supposition that the SMO needs still more training. The parallelism of the two curves, the training and classification success rate, proves the efficient use of learned data.

Training Overall
Plotting and comparing all training results (see figure [allTraining]) cleary shows that the K-Nearest Neighbor and Logistic Regression algorithms are the best in learning and recognizing trained information.
The popular Naive Bayes and Complement Naive Bayes algorithms lose the contest.

Classification Overall
In contrast to the training result contest the Complement Naive Bayes algorithm cleary outperforms all other algorithms. The interesting result is that the Naive Bayes loses again, though its successor wins. The SMO Support Vector Machine, one of the losers in this experiment, shows an increasing tendency and yields further investigation and research.
The categorization success rate of 50 percent is a success in this experiment, given that the static approaches failed the experiment with 0 percent success rate. Therefore AI and ML seem to be a great improvement in Content Negotiation in Internet Mail.

Conclusion and Outlook
The Machine Learning Approaches have promise. The training and classification experiments have proven that it is possible to improve the categorization process. The results are still not sufficient but show a tendency. Therefore the topic of Artificial Intelligence for Email Classification yields further research and investigation. The progress of this topic is really interesting.
Nearly all introduced algorithms allow to be combined with others. A never ending possibility of variations to research on.
In retrospect using the WEKA Machine Learning Framework greatly simplified the experiments. The major problem of experiments with ML algorithms is the runtime performance and need for high amounts of memory and calculation power. This clearly overstraines private personal computers.
The results of the experiments can only be concerned rather theoretical. Unfortunately there are still implementation problems with different email formats, natural language and the usual problems with email threads. Therefore the experiment results show only a trend.
The human being will never be superfluously when classifying email, hence computers can only assist.
High amounts of implementation work and the requirement of high-end equipment make AI often unattractive. AI is a rather young topic of computer science. New questions and techniques raise up day by day, making the topic really interesting. To reason the Complement Naive Bayes Algorithm is the winner of the experiment means that it is possible to cause good results with simple approaches. Looking at the current trend Support Vector Machines receive rave reviews, though not showing up in the experiments’ results.
It doesn’t have to be that complicated. Looking at the experiments’ results it is reasonable to examine the mathematical and logical background concepts of the mentioned Nature Model. These concepts must be translated to Machine Model. This translation is a technical problem. It seems to be the only barrier, but the worst at the same time.
In the face of Information and Internet Mail Overload it would be frivolous to disregard the methods of AI. The problems of Information Overload rapidly increase, day by day. Now is the time to countersteer the trend, because the introduced approaches are by all means realizable.
Thinking of Nature Language Processing (NLP) and Classification not being a problem only of email it is furthermore only a sub topic of a big area of applications like Document Classification, Search Engines, Data Warehousing and even Teleservice Control in the area of Terror Abatement.