看答案考点梳理相似题. . .
Fri, 05/14/2010 - 14:53 — Fuller
根据Coping with the
News: the machine learning way
When dealing with multiple classes there are two possible ways of averaging these
measures(i.e. recall, precision, F1-measure) , namely, macro-average and
micro-average. The macro-average weights equally all the classes, regardless of how many
documents belong to it. The micro-average weights equally all the documents, thus favouring
the performance on common classes. Different classifiers will perform different in common
and rare categories. Learning algorithms are trained more often on more populated classes
thus risking local over-fitting.
示被正确分到该类的实例的个数，b 表示被误分到该类的实例的个数，c 表示属于该类但被误分到其它类
r = a / (a + c), if a + c > 0; otherwise r = 1
p = a / (a + b), if a + b > 0; otherwise p = 1
其中参数β 用来为准确率（p）和召回率（r）赋予不同的权重，当β 取1 时，准确率和召回率被赋予相
Coping with the News: the machine learning
Dr. Rafael A. Calvo [HREF1], Web Engineering Group The University of
Sydney [HREF2], NSW, 2006. firstname.lastname@example.org
Prof. Jae-Moon Lee [HREF1], Hansung University[HREF3], Web Engineering
Group The University of Sydney [HREF2], NSW, 2006.
Abstract News articles represent some of the most popular and commonly accessed
content on the web. This paper describes how machine learning and
automatic document classification techniques can be used for managing
large numbers of news articles. In this paper, we work with more than
800,000 of Reuters news stories and classify them using a Naive Bayes and
k-Nearest Neighbours approach. The articles are stored in newsML format,
commonly used for content syndication. The methodology developed would
enable a web based routing system to automatically filter and deliver news
to users based on an interest profile.
Information overload ? in which individuals are faced with an oversupply of
content ? could become the road rage for the new millennium. In order for
this content to become useful information and empower users, rather than
frustrate or confuse them, we need novel ways of delivering only what is
needed, at the right time and in the right format. News stories are the most
important source of up-to-date information about the world around us, and
represent some of the most often updated, and highest quality content on
the web. Therefore, it is most important to develop ways to processes news
efficiently. In this paper we have studied machine learning models, and
applied them to the automatic classification of large numbers of online news
Language technologies have provided excellent tools for information
retrieval by exploiting the content being indexed. In the beginnings of the
Internet, search engines and classification systems on the web, only used
information from unstructured text to index and find content search engines
(i.e. Altavista); later they used the topology of the net to find which of the
information being indexed was more important (i.e. Google). Meanwhile,
the progress achieved in document classification has not been as noticeable,
and most web page classification is done manually by thousands of human
indexers, in private catalogues or in large scale ones such as Yahoo! and the
Open Directory Project. Instead of using human labour, automatic
document classification tools[3,11,12] use statistical models, similar to
those in information retrieval. These systems can be used to create
hierarchical taxonomies of content as are commonly found on the web, on
e-learning systems and even on traditional library information systems.
News story articles are written by reporters from all over the world. These
reporters often work for news agencies such as the Associated Press and
Reuters. These agencies collect the news, edit it and sell bundles of articles
to the periodicals accessed by web users (e.g. news.yahoo.com, Sydney
Morning Herald, etc). It is important for both the agencies and the periodicals to have an organized well-managed stream of news. News are
normally classified according to taxonomies that are relevant to readers,
(e.g. “politics”, “Iraq” or “Oil”). This classification can be very difficult
because it requires human expertise to spot relationships between the
taxonomy and the documents. Even the experts do not agree on what
should go where and inter-indexer consistency in classification has
considerable variation .
Automatic classification techniques use algorithms that learn from these
human classifications, so they can only do as well as the human training
data provided. In addition, different algorithms can learn different types of
patterns in the data. In order to compare the classification performance of
different algorithms, researchers have a set of standarised benchmarks,
with a particular dataset, and a well defined task. The most popular
classification benchmark during the late nineties was a Reuters collection
called Reuters-21578 (based on Reuters-22173) with 21578 documents,
that had to be classified in about 100 different [4,13]. This benchmark is still
used currently to compare the performance of different algorithms but the
challenges now lie in moving towards larger scale document classification
. In 2002, Reuters released a new research dataset with over 800,000
documents that we discuss in this paper.
There are several Machine Learning (ML) algorithms that have been
successfully used in the past [4,7,11,13]. They include Neural Networks,
Naive Bayes, Support Vector Machines (SVM) and k-Nearest neighbours
(kNN). Each of these methods has their advantages and limitations on
classification performance and scalability. The choice of algorithms will
depend in the application, and the amount of data to be used. In web
applications, efficiency is of particular importance, since the large number of
users and data can make some algorithms unfeasible.
Section 2 of this paper describes the Reuters RCV1 collection used in this
project, focusing on the challenges offered by its size, structure and the
richness of its XML structure. Section 3 describes the ML algorithms we have
used, namely Naive Bayes and k-Nearest Neighbours, section 4 describes
how we have improved a classification framework  in order to make it
scalable for web applications. Section 5 describes the performance results
for the classifier and the improvements in memory and CPU requirements.
Section 6 concludes and summarizes our future work in the area.
The Reuters collection, challenges and solutions
The Reuters RCV1 Corpus  consists of all English news stories (806,791)
published by Reuters in the period between 20/8/1996 and 19/8/1997. The news is stored as files in XML format, using a News ML Document Type
definition (DTD). NewsML is an open standard being developed by the
International Press and Telecommunications Council (IPTC). The news is
written by approximately 2000 reporters and then classified by Reuters
specialists in a number of ways. The classified news articles are then
syndicated by websites such as news.yahoo.com and news.google.com or
periodicals like the Sydney Morning Herald that may or may not have a
Due to seasonal variations, the number of stories per day is not a constant.
In addition, on weekdays there are an average of 2,880 stories per day
compared to 480 on weekends. Approximately 3.7Gb is required for the
storage of the uncompressed XML files.
The NewsML schema contains metadata produced by human indexers about
themes and categories. When two humans index a document differently
they create inter-indexer variations. These can be measured using a
correction rate C=(NC/NE)*100, where NE is the number of stories indexed
by an editor and NC is the number of times an editor has been corrected by
a second editor. Normally untrained editors will have a higher C than more
experts one, but even when they are all experienced correction rates of 10%
are common. In the RCV1 collection there are correction rates of up to 77%.
Since ML algorithms learn from examples ?classifications done by humans-
correction rates are an important limiting factor to their performance.
Performance measures in classification systems are really a measure of how
much they correlate to the human classifiers, it is not possible for a student
to be better than the teacher, or at least it not possible for the teacher to
RCV1 data is stored in XML documents providing the metadata information
normally required by news agencies and periodicals who need to deliver the
stories to end users. NewsML defines a rich schema shown simplified in
Figure 1, we can see that it has entities for title, headline, text, copyright
and several types of classification. For our experiments we have used all
three available classifications (topic, country and industry) as a single task.
Figure 1: An example of newsML document
Web applications will increasingly exploit this type of metadata. As it has
been discussed by researchers studying the concept of the semantic web ,
the next revolution in the Internet will come when web applications have
access to structured collections of information, and set of inference rules
that can be used to perform automated reasoning. This project aims at
producing such applications.
Language Technologies and document classification
Language technologies is an emerging field with researchers from different
backgrounds. Mathematicians, linguistics and computers scientists are
producing theories and software tools that manage content more efficiently.
This research includes speech processing, information retrieval and
document classification. The models in document classification are similar to
those in the other areas, and extensive literature describes them in detail
[4,7,11,14]. For the sake of brevity, in this section we only summarize the
The essence of the classification techniques is that documents and
categories can be represented in high dimensional vector space. Since
classification consists of assigning documents to predefined a categories,
machine learning techniques have to learn the mapping between the two
vector spaces, document to categories.
The simplest model to represent a document as a vector is the binary
representation. In this model, a vector has the dimension of the dictionary, and the occurrence of a word puts a 1 in the corresponding element, all
other elements being 0.
The next level of complexity, and the one we have used, is called Term
Frequency (TF) since the value of the element is equal to the number of
occurrences of the term in the document. If the term appears more often, it
is often because it is more important. This is not always the case, since
words like articles and prepositions often do not add much to the information
value. These terms are called stop-words and normally eliminated from the
document. Other technique to reduce the number of terms is stemming,
where words are reduced to their common stem (i.e. “runner” and “runners”
are reduced to “run”.
Finally, more elaborate models take into account the idea that terms that
appear in many documents are not telling us much about an individual one,
so an Inverse Document Frequency (IDF) weight is used. These models
require computing statistics across the whole corpus, making it much more
computational expensive. Since our goal in this project was to study scalable
techniques that could be used in web applications, we have used the simpler
TF vector representations.
Once the documents have been represented as vectors, they can be further
reduced using statistical reduction techniques such as c2, or term frequency.
These techniques select those terms that have higher impact or correlation
with the classification. We do not discuss them here as several other sources
describe them in detail [4,7,11,13].
Machine learning algorithms are trained on a subset of the corpus, and later
tested on a different subset to measure their performance. Several ML
algorithms have been used successfully in a number of applications [3,11],
including Naive Bayes and kNN both used here.
The success of a classification is often stated with two standard efficacy
measures: precision and recall. Precision is the probability that if a random
document is classified as part of C, this classification is correct. Recall is the
probability that if a random document ought to be classified under C, this
classification is actually made. These two quantities are highly and inversely
correlated. We can have absolute recall by assigning all documents to all
categories, but at the cost of having the worst precision. We can also have
high precision by never assigning a document to a category, but this would
make the classifier useless.
The trade-off between recall and precision is controlled by setting the
classifiers’ parameters. Both values should be provided to describe the
performance. Another common performance measure is the F1-measure:
When dealing with multiple classes there are two possible ways of averaging
these measures, namely, macro-average and micro-average. The
macro-average weights equally all the classes, regardless of how many
documents belong to it. The micro-average weights equally all the
documents, thus favouring the performance on common classes. Different
classifiers will perform different in common and rare categories. Learning
algorithms are trained more often on more populated classes thus risking
Object Buffering and performances
Object Oriented Application Frameworks (OOAF) are software engineering
artefacts that improve reusability of design and implementation [5,6,10].
The classification framework used in this project[HREF4]  was designed
to provide a consistent way to build document categorization systems. It
allows the developer to focus on the quality of the more critical and complex
modules by allowing reuse of common, simple base modules. In this project
we study how the framework can be used in a news stories classification
system and extended the framework to be more scalable as required in web
applications by adding an object buffering module.
One of the most serious limitations when training ML algorithms, is the
amount of data required to achieve satisfactory performance and the
scalability issues that arise when training such algorithms. Memory
requirements are often the first obstacle for managing large corpora,
specially for a text categorization system, where the ML algorithms are
being trained on very large vector spaces and large amounts of data. Most
operating systems have dealt with this problems by adding support for
virtual memory, but these general purpose techniques are not always
enough or appropriate for special applications like ours. In those cases,
applications often have an adaptive buffer management tool that allows the
application to allocate specific amounts of memory.
Figure 2: Run time object diagram
We have extended our text categorization framework  so it can handle
much larger corpora. Figure 2 shows the object diagram of important
classes in the run time of the framework. The circle in Figure 2 represents an
object and the arrow represents a relationship meaning a
category/document ‘HAS-A’ feature vector (FV). The document object
corresponds exactly to a document in the training or testing sets. The
category object corresponds exactly to a category. For the RCV1, there are
over 800K documents, so the framework needs to manage that number of
document objects and the same number of feature vectors. The object
buffer can be configured to store as many objects as possible in RAM
memory, and requesting other ones from the hard disk as they are required.
Thus, there is only a constant amount of objects in the memory, and the
system can be optimised for the available resources and independently on
the amount of data.
The buffered object module we implemented is described in Figure 3. The
module supports a unique persistent Object Identifier (OID) for all objects
stored in it. The framework will then use this OID instead of the object or its
reference. If the framework needs an object to be stored in the buffered
object module, it makes a request to the buffered object manager using the
OID. The buffered object manager first searches for the request object in the
object pool. If object manager finds it, then it returns the object’s reference,
otherwise it requests the object to the buffered file manager which tries to
minimize I/O to the disk. The buffered file manager reads the request object
from the disk, and returns it to the buffered object manager. As the buffered
object manager receives the object from the buffered file manager, it firstly
stores the received object to the buffer pool and returns it to the requester
in the framework.
Figure 3: Architecture of the buffered object module
In order to optimise memory usage, the buffered object manager uses a
constant amount of memory. Our current implementation follows the Least
Recently Used (LRU) strategy . In this strategy, common for paging in
operating systems, the least recently used (read or written) object (or page)
is selected to be taken out of the buffer (paged out). The same rule is also
used in other cache systems when they select which cache entry to flush. In
future work we plan to add new buffering strategies such as: Most Recently
Used (MRU) and others. We expect that the selection of these strategies will
depend on each Machine Learning algorithm. For example, Naive Bayes
might work better with a LRU and kNN with a MRU strategy.
With a scalable framework we are able to show the feasibility of
automatically classifying large amounts of news stories. Since news stories
in the RCV1 dataset are in newsML format the extensions to the framework
need to include more efficient XML parsing. We first tried the Perl and C
implementations of the SAX package. Adapting one of the libraries
implemented in C and integrating it into the framework produced the best
We have tested the extended framework on the Reuters RCV1 data in order
· Assess the feasibility of automatic document classification on large-scale
management of news stories.
· Measure the classification performance of Naive Bayes and kNN
classifiers on this new corpus, and find ways to improve it.
· Measure the improvements over the non-buffered framework in
computing time and memory management. We performed three series of experiments: one to determine how Naive
Bayes and kNN algorithms performed for different amounts of training data,
a second one to see if using the newsML structure could improve
performance by giving different weights to the title of the story, and finally
how they scaled ?in computation time- for different amounts of data.
Although the precision of the classification itself should not be modified by
the extensions to the framework (buffering objects), we expected that being
able to train them with larger amount of data would improve performance.
In the first set of experiments we studied the performance for 10, 50 and
100 days of training data.
The data was selected from the original RCV1 dataset using a simple
strategy where we randomly chose 80% of the total amount of dates as
training set, and 20% as test set. With this approach the total number of
news on each set is not an exact percentage since there are different
number of stories on each day. This sampling strategy has the disadvantage
of producing test sets that might have somewhat different statistics, for
example a particular date (i.e. Bush’s election or September 11) could start
a new type of documents that did not appear in the training set. The notation
XTr-Yte means that the news stories of X days are used as the training
documents and news stories of Y days are used as testing documents. Table
1 and Table 2 describe the data used in each subset. 10Tr, 50Tr and 100Tr
means that we used 10, 50 and 100 days for training.
The second goal of the experiments was to see if the newsML structure could
be used to improve performance and find guidelines on how to do it. We
expected that giving more weight to more important attributes of the
newsML schema (i.e. the title) would be beneficial. The content of title
element is similar to the headline. In the experiments we have fixed the
weighting factor of headlines and text elements to 1, and we tried several
weighting factors for title, T = 0, 1, 3, 7, 17. The weighting factor T means
we give T times the frequency count of the word in the document.
Figure 4 through 7 show the performance results for the different subsets.
Figure 4 and 6 are the results for the Naive Bayes classifier, and indicate that
micro-precision (miP) of 80-90% are achievable. We must remember that:
first, due to the inter-indexer variations, 100% precision is not humanly
automatically possible. Second, and that these results only show the
correlation with the human classifier, this means we assume humans are
always right, and this is not always the case, in fact machines might be
doing it better.
Figure 5 and 7 show the results for the kNN classifier. We can see that the
classification performance is better than for the Naive Bayes classifier. This is particularly true with the recall measures. The computational performance
of kNN is an obstacle and we have not performed tests for all the subsets.
KNN is not optimum for web applications, such as the one we discuss in this
paper, since all the processing is performed at test time (there is no training),
this means that in a real application all the computation would be performed
when the news arrive and need to be redirected. kNN showed to have much
better recall performance (miR=0.55) compared to Naive Bayes (miR=0.19).
This means kNN is able to assign more documents to some category, and
this might be of great importance in some applications. Precision (the
number of correct classifications) was somewhat degraded by using kNN,
but not in a considerable amount.
Analysing the performances for different weighting schemes we find that the
optimum T seems to be around 3. We can also see that increasing the
number of training days from 10 to 50 improved the miP but does not seem
to improve the miF1 measure in a considerable amount.
Training Data Testing Data
Files Bytes Files Bytes
10Tr-2Te 23,298 70MB 5,872 18MB
50Tr-10Te 107,543 331MB 14,936 46MB
100Tr-20Te 228,089 701MB 43,008 133MB
Table 1: Amount of data for the selected subsets
Data Type Categories Categories
10Tr-2Te 670 5.17 43.54
50Tr-10Te 718 5.18 170.58
100Tr-20Te 730 5.22 371.37
Table 2: Categories for the selected subsets
Figure 4: Performance of Naive Bayes for 10Tr-2Te subset
Figure 5: Performance of kNN for 10Tr-2Te subset
Figure 6: Performance of Naive Bayes for 50Tr-10Te subset
Figure 7: Performance of kNN for 50Tr-10Te subset
A third set of experiments was performed to measure computational
efficiency. We measured the three kinds of executing time: preprocessing,
training and testing. The x axis in Figure 8 represents the three subsets
studied and the y axis represents the executing time measured in seconds
The preprocessing includes reading the training documents, vectorising and
storing them to disk as needed. Thus, the preprocessing is the same to all
the algorithms. The training time depends in the machine learning
algorithms, for Naive Bayes we show is linear. Finally, the testing time refers
to categorizing the documents, and it is also dependent on the given algorithm. The executing time of preprocessing, training and testing
increase linearly showing good scalability. This means that the object
buffering module worked well.
Figure 8: Execution Time of Naive Bayes
We have shown the feasibility of using automatic document classification for
large-scale management of news stories. We have focused on scalability
issues that are particularly important to web applications.
The classification performance shown with these experiments is extremely
promising and would allow real web applications to use this type of
functionality. We obtained 0.9 miP, meaning that 9 out of 10 stories were
correctly classified. The miR and therefore the miF1 are somewhat lower
because some classes contain few examples and are harder to classify.
Naive Bayes classifiers produce lower quality classification but seem to be
better suited for applications were classification needs to be performed at
real time, kNN instead produces better classification but places to much load
on classification time since there is not training.
Since we wanted to assess the possibility of using automatic classification
techniques for real web applications, we chose a sampling strategy that
produces lower classification performance, but is more scalable and
We have extended our classification framework in order to manage large
amounts of documents. The object buffer has been successful in increasing the number of documents that we were able to manage by an order of
magnitude and reduced the computational time to approximately one fifth.
Using faster XML parsing for the pre-processing stage showed to be
important in managing large amounts of newsML documents. So very
efficient parsers are required for content syndicators who manage large
quantities of news stories.
This project will continue by further studying the scalability issues of other
algorithms and integrating the classifier into real web application. This will
be done following the integration of the framework to the Postgres database,
as described in , and using it in web application frameworks as
described in .
This project was partially funded by the Australian Research Council,
through a Linkage International Fellowship, Hansung University and the
University of Sydney. The authors also want to acknowledge Ken Williams
for the development and support in using the classification framework.
1. Asit D. and Towsley D.,"An approximate analysis of the LRU and FIFO
buffer replacement schemes", Proceedings of the 1990 ACM
SIGMETRICS conference on Measurement and modeling of computer
systems, pages 143 ? 152, 1990.
2. Berners-Lee T., Hendler J., Lassila O.,"The Semantic Web", Scientific
American. May 2001.
3. Calvo R.A.,"Classifying financial news with neural networks", In 6th
Australasian Document Symposium, December 2001.
4. Calvo, R. A. and H. A. Ceccatto,"Intelligent Document Classification",
Intelligent Data Analysis, 4(5), 2000.
5. Fayad, M., Schmidt D., and Johnson, R. Eds.,"Building Application
Frameworks : Object-oriented foundations of framework design",
John Wiley and Sons 1999.
6. Fayad, M. and Schmidt, D.,"Object-Oriented Application
Frameworks", Commun. ACM 40, 10 (Oct. 1997), 32-38.
7. Manning C.D. and Schutze H.,"Foundations of Statistical Natural
Language Processing", The MIT Press, 1999.
8. Mayfield, J., P. McNamee, C. Costello, C. Piatko and A.
Banerjee,"JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval", Proceedings of the Tenth Text
Retrieval Conference (TREC 2001), NIST Special Publication, 2002.
9. Rose, T.G., M. Stevenson and M. Whitehead,"The Reuters Corpus
Volume 1 - from Yesterday's News to Tomorrow's Language
Resources", In Proceedings of the Third International Conference on
Language Resources and Evaluation, Las Palmas de Gran Canaria,
29-31 May 2002.
10.Rumbaugh J., Blaha, M, Premerlani W., Eddy F. and Lorensen
W.,"Object Oriented Modeling and Design", Prentice Hall, 1991.
11.Sebastiani F., "Machine learning in automated text categorization",
ACM Computing Surveys, 34(1):1-47, 2002
12.Williams K. and R.A. Calvo, "A Framework for Text Categorization",
7th Australasian Document Computing Symposium. December 2002.
13.Williams K., R.A. Calvo and Bell, D., "Automatic Categorization of
Questions for a Mathematics Education Service", Artificial Intelligence
in Education Conference, July 2003, Sydney, Australia (to appear).
14.Yang, Y. and X. Liu., "A re-examination of text categorization
methods", In 22nd Annual International SIGIR, pages 42?49, Berkley,
Andrew Treloar, ? 2000. The authors assign to Southern Cross University
and other educational and non-profit institutions a non-exclusive licence to
use this document for personal use and in courses of instruction provided
that the article is used in full and this copyright statement is reproduced. The
authors also grant a non-exclusive licence to Southern Cross University to
publish this document in full on the World Wide Web and on CD-ROM and in
printed form with the conference papers and for the document to be
published on mirrors on the World Wide Web.
一般成年男性血红蛋白<120g/L,成年女性血红蛋白<110g/L为贫血。根据血红蛋白减低的程度贫血可分为四级。轻度：血红蛋白>90g/L、中度：血红蛋白90～60g/L、重度：血红蛋白60～30g/L、极度：血红蛋白<30g/L。 1.生理性减少： 3个月的婴儿至15岁以前的儿童，因生长发育迅速而致造血原料相对不足，红细胞和血红蛋白可较正常人低10%～20%。妊娠中、后期由于孕妇血容量增加使血液稀释，老年人由于骨髓造血功能逐渐减低，均可导致红细胞和血红蛋白含量减少。
浏览：14913 提问时间：2006-10-30 回答
比如：一种物品，它有A、B、C、D四种报价，各自的份额为20%，15%，30%，35%则加权平均数（价）就是 A*20%+B*15%+C*30%+D*35% 黄金报价：