Content Negotiation as a topic of Computer Linguistics has undergone a great development in the past years, relapsing again and again. The main difficulties have emphasized to be the human language complexity that comes along with the topic of semiotic (see [semiotic]).
Machine Versus Human Language
Machine language follows a strict syntax with unique semantics, while consistently avoiding pragmatics. Built to help the human society doing structured (and often repetitive) tasks the human interface relies on simple devices like keyboards and screens. If human language would be entered in some program the computer would not understand it. Only commands are understood – and that is what computers were built for. An intellectual (human) discussion with a computer is impossible. Even if it would be able to understand the language (syntax) the semantic and pragmatic problems overweight.
Example Problems
Syntactical Problems
How mani megabites are laft on your hartdrive? This question is filled up with syntax errors. Every human being is able to process and understand the question without being syntactically correct. It would take high effort to make a computer program as much tolerant to find the real words meant.
Wieviel Platz ist noch auf Deiner Festplatte? Changing the language of the question violates the syntax, though many human being would be able to get the meaning without completely knowing the language. Computers will not.
Semantical Problems
How many megabytes are left on your harddrive? The semantics beyond this question is to output the free disc space in megabytes left on the harddrive. Based on their experience human beings would immediately conclude that this question is referred to limited disc space. So this is the only possibility of the meaning of this question. A computer cannot refer to experiences. If the computer program could somehow understand the question it would have to choose between several scenarios:
- The amount of megabytes does not change when saving data on the harddrive, only the amount of megabytes allocated -> consequently the answer is 100 percent
- What is a megabyte? The computer only knows bits and bytes. Megabytes are a human view on the data.
- How many megabytes are left? -> All, nobody has stolen it
- Megabytes? -> Zero, there are only bytes on my disc
- What is harddrive? The computer only knows nodes and devices.
- Your harddrive? -> I don’t own a harddrive – you purchased it
Pragmatical Problems
How many megabytes are left on your harddrive? Similar to the semantic problems of this question there are two possible (by human beings disregarded by experience) meanings. The two possible meanings (tasks) could be:
- What is the difference between total amount of available bytes on the harddrive minus the amount of already allocated bytes on the harddrive? Convert the result to megabytes.
- What is the total amount of bytes on the harddrive? Convert the result to megabytes.
The man choked the man with the cravat. This statement is the standard sentence when talking about pragmatics in the computer linguistics research. Even for human beings it is difficult to determine the correct meaning of the sentence.
- The man chokes the man, who is wearing the cravat.
- The man chokes the man, using a cravat to do so.
The second meaning seems more reasonable due to the correlation of choking and cravat. This is a human knowledge experience, difficult for computers to implement.
General Content Negotiation Problems
Like discussed in the previous sections Content Negotiation suffers the difference between machine and human language. There is not only the need for solving the semiotic problems of a language, in fact the understanding of language yields for new approaches. It has to be figured out that functional algorithms will not fit the situation. Rather complex (intelligent) algorithms come in place to solve the differences between computer and human language. The approach vastly leads to topics like Computer Linguistics served by Artificial Intelligence. Intelligent approaches provide the ability to be fault tolerant, and to be deterministic in finding a sufficient solution. Using knowledge (experience) based algorithms often means to have a complex footprint and a high equipment rate.
The implementation of such algorithms can be vast.
Problems Specific to Internet Mail
Besides the general content negotiation problems the negotiation of Internet Mail content is much more complex. The example email thread below gives an outcast of a usual worst case which occurs every day in the average email usage:
A. Email sent by Hans on Thursday
Hi Martina, Joe and Marc, Tomorrow I will be out of office. If there are questions regarding our project software call my mobile. If Mr. Chapel calls in tell him to ask Chris. He is involved in that project, as well. Cheers, Hans -- Hans Meyers CEO / HM Software Services
B. On the same day, Joe answers
Hans, please don't forget to send your approval for the funding of the REPO project. > Hi Martina, Joe and Marc, > Tomorrow I will be out of office. If there are questions regarding > our project software call my mobile. If Mr. Chapel calls in tell him > to ask Chris. He is involved in that project, as well. > > Cheers, Hans > -- > Hans Meyers > CEO / HM Software Services
C. Hans answers immediately
FYI: Done. > Hans, > please don't forget to send your approval for the funding of the > REPO project. > > > Hi Martina, Joe and Marc, > > Tomorrow I will be out of office. If there are questions regarding > > our project software call my mobile. If Mr. Chapel calls in tell him > > to ask Chris. He is involved in that project, as well. > > > > Cheers, Hans > > -- > > Hans Meyers > > CEO / HM Software Services
D. On Friday, Martina answers
Hans, did you really think we wouldn't notice? H A P P Y B I R T H D A Y ! We wish you all the best, Joe, Marc and Martina > Hi Martina, Joe and Marc, > Tomorrow I will be out of office. If there are questions regarding > our project software call my mobile. If Mr. Chapel calls in tell him > to ask Chris. He is involved in that project, as well. > > Cheers, Hans > -- > Hans Meyers > CEO / HM Software Services
Assuming the content negotiating classifier works, managing and having knowledge about the earlier mentioned five categories ([emailtypes]), email A would be correlated with category Business, Project or Notification. Email B is Business email but would incorrectly be correlated with the Project category (funding of projects is business). The acknowledgement of Hans, email C, consequently is business, too. But it is a notification as well. It will incorrectly be correlated with the Project category. Email D, is unambiguous private email sent via a business account. It could incorrectly be sorted into Project category. The following sections point out the major issues regarding Internet Mail and Content Negotiation.
EMail Use Issues
The use of the media Internet Mail is not controlled, neither by any software nor by any human being, and therefore completely individual. It does not follow any rules, except the email protocol itself.
Foreign Subject Fields
The Subject information of an email is not a mandatory field. It is often left blank or incorrectly polluted by words which are not correlated with the intention and content of the email. The subject can be changed manually with each step of a thread life cycle. Therefore the Subject information cannot be used as a source of information nor as a primary key for referencing threads.
Foreign Signatures
Using signatures means polluting the information with redundancy. When a classifier correctly correlates two different mails (using the same signature) into one category it is difficult (and potentially insecure) to determine, whether the classifier noticed the affiliation of information, or just noticed the two identical signatures.
Quoting Noise
When replying to an email a new thread is opened. Many mail user clients (MUAs) paste the information from the initial mail quoted into the reply. This information can be considered as noise when new information is added to the reply (see example email B-D). The quoted text is rated, too, and is causing noisy results.
Insufficiency
Many emails contain short texts because the recipient often already knows the topic. Following a human conversation an email like the following has no references to a special topic nor category:
Hi Joe,
I do agree with your opinion. I guess we can start now…
It is completely impossible for a machine to determine what was meant, nor what category it belongs to.
References
Referencing email thread information may cause confusion, too. Just replying Yes or No does not supply information to categorize. In this case the complete email thread has to be looked up and the computer has to decide, whether Yes or No may change the topic of the thread.
Email Protocol Issues
Beyond the email use issues caused by the society there are basic issues regarding the email protocol. Content Negotiation is vastly hampered by technical peculiarities of Internet email.
Different Origins and Systems
The Email standard is extended and violated by the diversity of today’s email clients and transfer agents. Due to the robustness of the Simple Mail Transfer Protocol (RFC 821), the Arpanet Text Message Format and the newer MIME extensions an email can often be delivered, read and answered, even if it is not standard compliant. The contained information is represented in different formats with several additions, making it rather difficult to parse it. Email parsers have to be rock solid and fault tolerant.
Lack of Authenticity
The Protocol Email cannot guarantee authority nor authenticity. It is easy to fake email headers and to lie about the origin. The sender is private if a simple Nickname is used. These issues apply to Spam and Virus Email, but may greatly aggravate the correlation to a specific category.
Multi Purpose Internet Mail Extension
Along with the upcoming of the WWW in 1993 the MIME standard was introduced. On the one hand it allows to send additional content, using different formats, and on the other hand it allows to embed styled information like served by the Hypertext Markup Language (HTML). Each part of a MIME email has a content type. This sounds good for categorization and negotiation purposes. But it is not. Especially it aggravates the negotiation process because more information, often alternative contents, is embedded in one email.
Markup and Style
The MIME introduction made it possible to markup the information using styles and images. Images are extremely difficult to negotiate and can contain inapplicable information. Styles, e.g. font selection and colors, markup the information. This is often not useful (and difficult to determine) and causes noisy results.
Attachments
A great feature of MIME emails are attachments. As a result an email can have binary files attached. The negotiation of binary content is often impossible, inducing it is insecure. The determination if an attachment is in some way correlated with the information of the email is difficult to realize.