文本特征提取方法研究

一、课题背景概述

文本挖掘是一门交叉性学科,涉及数据挖掘、机器学习、模式识别、人工智能、统计学、计算机语言学、计算机网络技术、信息学等多个领域。文本挖掘就是从大量的文档中发现隐含知识和模式的一种方法和工具,它从数据挖掘发展而来,但与传统的数据挖掘又有许多不同。文本挖掘的对象是海量、异构、分布的文档(web);文档内容是人类所使用的自然语言,缺乏计算机可理解的语义。传统数据挖掘所处理的数据是结构化的,而文档(web)都是半结构或无结构的。所以,文本挖掘面临的首要问题是如何在计算机中合理地表示文本,使之既要包含足够的信息以反映文本的特征,又不至于过于复杂使学习算法无法处理。在浩如烟海的网络信息中,80%的信息是以文本的形式存放的,WEB文本挖掘是WEB内容挖掘的一种重要形式。 继续阅读文本特征提取方法研究

#3 集成学习–机器学习中的群策群力 !

#3 集成学习--机器学习中的群策群力 !
/

背景:

总览

机器学习方法在生产、生活和科研中有着广泛应用,而集成学习则是机器学习的热门方向之一。
集成学习是使用一系列学习器进行学习,以某种规则把各个学习结果进行整合,从而获得比基学习器有更好学习效果集成学习器.

今天, 我们在分析讨论集成学习和多类集成学习的同时, 提出目前多类集成学习的一些问题, 供大家参考。

集成学习图例

sss

研究现状

理论丰富

二类集成学习已有较成熟理论基础。多类集成理论基于二类集成。

国际成果

Bagging (Leo Breiman, 1994,Technical Report No. 421.)

Boosting (Schapire, Robert E,1990 ,“The Strength of WeakLearnability”. Machine Learning (Boston, MA: Kluwer Academic Publishers)

AdaBoost (Yoav Freund and Robert Schapire,2003)

AdaBoost.MH, SAMME, PIBoost, GentleBoost, AdaCost

国内成果:

南大周志华等人提出选择性集成理论,于2001年在国际人工智能

顶级会议IJCAI上发表。另周志华等人提出了二次学习的思想,将集成学习用作预处理,在IEEE Trans. Information Technology in Biomedicine(2003)上发表。 继续阅读#3 集成学习–机器学习中的群策群力 !

Brief History of Machine Learning 机器学习简史

My subjective ML timeline
My subjective ML timeline

Since the initial standpoint of science, technology and AI, scientists following Blaise Pascal and Von Leibniz ponder about a machine that is intellectually capable as much as humans. Famous writers like Jules

Pascal’s machine performing subtraction and summation – 1642
Pascal’s machine performing subtraction and summation – 1642

Machine Learning is one of the important lanes of AI which is very spicy hot subject in the research or industry. Companies, universities devote many resources to advance their knowledge. Recent advances in the field propel very solid results for different tasks, comparable to human performance (98.98% at Traffic Signs – higher than human-).

Here I would like to share a crude timeline of Machine Learning and sign some of the milestones by no means complete. In addition, you should add “up to my knowledge” to beginning of any argument in the text.

First step toward prevalent ML was proposed by Hebb , in 1949, based on a neuropsychological learning formulation. It is called Hebbian Learning theory. With a simple explanation, it pursues correlations between nodes of a Recurrent Neural Network (RNN). It memorizes any commonalities on the network and serves like a memory later. Formally, the argument states that;

Let us assume that the persistence or repetition of a reverberatory activity (or “trace”) tends to induce lasting cellular changes that add to its stability.… When an  axon  of cell  A is near enough to excite a cell  B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that  A’s efficiency, as one of the cells firing  B, is increased.[1]

Arthur Samuel
Arthur Samuel

In 1952 , Arthur Samuel at IBM, developed a program playing Checkers . The program was able to observe positions and learn a implicit model that gives better moves for the latter cases. Samuel played so many games with the program and observed that the program was able to play better in the course of time.
继续阅读Brief History of Machine Learning 机器学习简史