用户名: 密码: 验证码:
基于强化学习的倒立摆控制
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
70年代以来,人们探索不同的学习策略和学习方法,且在本阶段已开始把学习系统与各种应用结合起来,并取得很大的成功,促进机器学习的发展。1980年,在美国的卡内基—梅隆(CMU、)召开了第一届机器学习国际研讨会,标志着机器学习研究已在全世界兴起。1989年,Carbonell发表文章指出机器学习有4个研究方向:连接机器学习、基于符号的归纳机器学习、遗传机器学习与分析机器学习。十年过去了,人们的研究热点发生了转移,1997年,Dietterich提出了另外4个新的研究方向:分类器的集成、海量数据的有教师学习算法、强化机器学习(即强化学习)与学习复杂统计模型。
     在1954年,“强化”和“强化学习”这些术语由Minsky首次提出并出现在工程文献上。于1965年,在控制理论中,由Waltz和付京孙分别独立提出这一概念。在六七十年代,强化学习研究进展比较缓慢,进入80年代以后,随着人们人工神经网络的研究不断地取得进展以及计算机技术的进步,人们对强化学习的研究出现了高潮,逐渐成为机器学习研究中的活跃领域。世界各地的学者提出了各种算法及学习策略,也把强化学习应用到很多领域,比如说,游戏比赛,在这方面最早的应用例子是Samuel的下棋程序;调度优化;应用最多的莫过于机器人领域:控制问题,其中典型实例,就是倒立摆控制系统。
     在稳定性控制问题上,倒立摆既具有普便性又具有典型性。倒立摆作为一个装置,成本低廉,结构简单。作为一个被控对
    
    太原理工大学硕士研究生学位论文
    象,又相当复杂,高阶次,不稳定,多变量,非线性,强祸合
    系统,只有采取行之有效的方法方才能使之稳定。而且当一种
    新的理论和方法提出以后,在不能用理论加以严格证明时,可
    以用倒立摆系统装置来验证其正确性及实用性。倒立摆的研究
    不仅有其深刻的理论意义,还有重要的工程背景。直升飞机、
    火箭飞行、人造卫星的运行、机器人的举重、做体操、及机器
    人的行走都存在有类似于倒立摆系统稳定控制相似问题。因此,
    倒立摆的研究对于火箭飞行以及机器人的控制等现代高新技术
    的研究具有重要的实践意义。
     本文主要是在对机器学习、强化学习及倒立摆进行简明面
    深入的综述的基础上,并把强化学习的思想用于一阶倒立摆和
    二阶倒立摆的控制,并对学习结果做了进一步的分析,论文中
    的创新点如下:
     首先,本文把强化学习的思想与多维线性插值结合起来平
    衡控制一阶倒立摆,本算法是把状态空间离散化,用规则表作为
    值函数的表达结构,用强化学习直接对平衡控制倒立摆所需要
    的力进行学习。学习结果表明,所学到力与各个状态变量之间
    的关系几乎呈线性的,所以它为学习线性控制方程的系数做了
    必要准备。
     其次,通过学习一阶、二阶倒立摆控制方程系数,倒立摆
    取得很好的控制效果。对于二阶倒立摆,本文分析了系数初始
    值对学习的影响,初始值对学习时间有一定的影响,但对学习
    效果基本上没影响;从最后学习效果可以看出,此算法对二级
    倒立摆的控制取得了很好的效果;把一级的学习结果作为二级
    学习的初始值时,学习时间会大大缩短,因此这种学习方法有
    很好的从低级到高级的拓展性:它不需要太多的先验知识,是
    解决一类控制问题的好的学习方法。
Since 1970's, people have explored all kinds of learning strategies and learning algorithms, and combined learning with kinds of applications in the same time. As a result, they won a great success and accelerate the development of the machine learning. In 1980,the first international workshop about machine learning in CMU was a symbol of the rising of the machine learning in the world. In 1989,Carbonell delivered an article and pointed out that there are four researching direction about machine learning: connection machine learning, symbol-based induced machine learning, genetic machine learning and analyzing machine learning. Ten years later, in 1997,Dietterich proposed other new four directions: the integration of the classified implements the instructive learning algorithm about magnanimity data, reinforcement learning, learning the complex statistical model.
    In 1954, "reinforcement" and "reinforcement learning" was firstly proposed by Minsky and appeared in engineering literature [26]. In 1965, Waltz and Jingsun Fu put forward the concept separately in controlling theory. From 1960's to 1970's, the research about reinforcement learning got along much slower, 1980's later, along with the researching on neural network and the
    
    
    
    progress about computer technology, the researching on reinforcement learning appeared upsurge, gradually became the active field of the machine learning. The researcher through the world proposed kinds of learning algorithms and learning strategies, and applied reinforcement learning to many fields. Such as game competing, the earliest application example is Samuel chess program; scheduling optimization; the most application is robot field, controlling problem, the representative example is the controlling on the inverted pendulum.
    In stable controlling program, the inverted pendulum is universal as well as representative. As equipment it is cost is low and the structure is simple. As a controlling object it is much more complex, high steps, non-stable, non-linear, strong coupling system, only an effective method can make it be stable. When a new theory or method is proposed and can't be strictly proved, the inverted pendulum system can be used to validate its correctness and practicability. The researching on the inverted pendulum not only has the profound theory meaning but also has important engineering background. The helicopter, rocket flight, man-made satellite running, robot's weight lifting, doing gymnastics and hoofing are all similar to the stable controlling of the inverted pendulum system. So the researching on the inverted pendulum is of important practice meaning to the high technology such as the rocket flight and the controlling of the robot.
    On the base of the thorough summarizing about machine learning, reinforcement learning and the inverted pendulum, this paper apply the idea of reinforcement learning to the controlling of the one-link and two-link inverted pendulum, and further analyze
    
    the learning result, the innovations in the paper is as follows:
    Firstly this paper combines the idea of reinforcement learning with the multidimensional linear interpolation to control the inverted pendulum. In the method, the state space is discrete, the rule as the value function to express the structure , and the reinforcement learning directly learn the force about controlling the inverted pendulum. The learning result indicates that the force learned is almost linear to the state variables. So it is a necessary preparation for learning the coefficients of the controlling equation of the inverted pendulum.
    Secondly through learning the coefficients of the controlling equation of the one-link and two-link inverted pendulum, the inverted pendulum can be controlled well. To the two-link inverted pendulum, the influence of the initial value of the coefficients of controlling equation to the learning is analyzed in this paper, the experiment shows that the initial value has certain influence to the learning time, but has little influence to the learning effect; the last controll
引文
[1] 刘琴。机器学习。武钢职工大学学报,2001,13(2):41-44
    [2] 蔡自兴,徐光褡。人工智能及其应用[M]。北京清华大学出版社。1996年
    [3] Carbonell J. Introduction: Paradigm s for machine learning[J]. Artificial Intelligence, 1989,40(1): 1-9.
    [4] Dietterich T. Machine learning research: Four current directions(Final draft) [J]. AI Magazine, 1997, 18(4) : 97-136.
    [5] 王珏,石纯一。机器学习研究。广西师范大学学报(自然科学版)。2003,21(2):1-15
    [6] Kaelbling L, Littman M ,Moore A. Reinforcement learning: A survey [J].Journal of Artificial Intelligence Research, 1996, 4: 237-285.
    [7] Wiener N.控制论(中译本)[M].北京:科学出版社,1962.
    [8] Arbib M. Brainsmachines and mathematics[M].New York:McGraw H ill companies, 1964.
    [9] Ashby W. Design for a brain the origin of adaptive behavior[M]. London: Chapman & Hall, 1950.
    [10] Holland J. Adaptation in natural and artificial systems[M]. Ann Arbor:University of Michigan Press, 1975.
    [11] Sutton R,Barto A. Reinforcement learning:A n introduction[M]. Cambridge,MA:MIT Press, 1998.
    [12] Gotdberg D. Genetic algorithm s in search optimazation and machine learning [M]. Reading, MA: Addison-Wesley Publishing Company, 1989.
    [13] Brooks R. Intelligence without reason [A]. John M, Ray R. Proceedings of the 12th international joint conference on artificial intelligence[C]. San Mateo:Morgan Kaufmann Publishers, 1991. 569-595.
    [14] Minsky M. The society of mind[M ]. New York: Simon & Schuster, 1986.
    [15] Picard R W. Affective computing (Technical report 321) [R]. Cambridge,MA:MIT Media Laboratory, 1995.
    [16] 赵凯,王珏。适应性计算[J]。模式识别与人工智能,2002,13(4):407-414。
    
    
    [17] Rosenblatt F. The perceptron: A perceiving and recognizing automaton (Technical report 85246021) [R]. Ithaca NY:Cornell Aeronautical Laboratory, 1957.
    [18] Minsky M, Parpert S. Perceptron (Expanded edition 1988)[M]. Cambridge,MA:MIT Press, 1988.
    [19] McCulloch W, PittsW. A logical calculus of the ideas immanent in nervous activity[J]. Bulletin of Mathematical Biophysics, 1943, 5(1):115-133.
    [20] Minsky M, Parpert S. Perceptron[M].Cambridge,MA:MIT Press, 1969.
    [21] Rumelhart D E,McClelland J L. Parallel distributed processing[M]. Cambridge,MA:MIT Press, 1986.
    [22] Widrow B,Hoff M. Adaptive switching circuits[A ]. IREW ESCON convention record (Part 4)[C]. New York: Institute of Radio Engineers, 1960.96-104.
    [23] Gold M. Language identification in the limit[J]. Information and Control, 1967, 10(5):447-474.
    [24] Samuel L.Some studies in machine learning using the game of checkers [J]. IBM Journal Research and Development,1967, 11(4):601-618.
    [25] Lester A. Lefton. Psychology. 4th edition,San Francisco,West Publishing Company, 1991,pp.198-199.
    [26] Minsky M L. Theory of neural analog reinforcement systems and its application to the brain model problem [D]. New Jersey, USA :Princeton University,1954
    [27] Bush R R & Mosteller F. Stochastic Models for Learning [M].New York :Wiley,1955.
    [28] Widrow B & Hoff M E. Adaptive switching circuits [A]. In : Anderson J A and Rosenfeld E. Neurocompating : Foundations of Research[M]. Cambridge, MA : The MIT Press, 1988, 126-134
    [29] Rosenblatt F. Principles of Neuro dymamics:Perceptrons and the Theory of Brain Mechanisms [M]. Washington DC: Spartan Books, 1961
    [30] Waltz M D &Fu K S. A heuristic approach to reinforcement learning control systems[J]. IEEE Trans. Automatic Control,1965,10 (3) :390- 398
    [31] Widrow B, Gupta N K & Maitra S. Punish reward:Learning with acritic in adaptive threshold system[J]. IEEE Trans. on Systems,Man,and Cybernetics,1973,3 (5):455 - 465
    [32] Saridis G N. Self2Organizing Control of Stochastic System[M]. New York: Marcel
    
    Dekker, 1977,319-332
    [33] Barto AG, Sutton R S and Brouwer P S. Associative search network:a reinforcement learning associative memory[J]. Biological cybernetics, 1981,40:201-211
    [34] Barto A G, Sutton S and Anderson C W. Neurallike adaptive elements that can solve difficult learning control problems [J].IEEE Trans. on Systems, Man, and Cybernetics,1983,13(5):834-846
    [35] Sutton R S. Temporal credit assignment in reinforcement learning[D] . Amherst ,MA:University of Massachusetts, 1984
    [36] Sutton R S. Learning to predict by the methods of temporal difference[J]. Machine Learning,1988,3:9 - 44
    [37] Dayan P. The convergence of TD (λ) for general λ [J].Machine Learning, 1992,8:341 - 362
    [38] Wang Lichun and Denbigh P N. Monaural localization using combination of TD (λ) with back propagation [A]. IEEE Int. Conf. on Neural Network[C], San Francisco,USA ,1993,187 - 190
    [39] Cichosz P and Mulawka JJ. Fast and efficient reinforcement learning with truncated temporal differences [A]. Proc. 12th Int. Conf. on Machine Learning [C], Morgan Kaufmann, San Francisco, USA, 1995,99 - 107
    [40] Cichosz P. Truncating temporal differences : on the efficient implementation of TD (λ) for reinforcement learning [J]. J. of Artificial Intelligence Research,1996,12:287 - 318
    [41] Badtke S J and Barto R G. Linear least-squares algorithms for temporal difference learning[J]. Machine Learning, 1996,22:33-57
    [42] Robert E S and Warmuth K. On the worst2case analysis of temporal difference learning algorithms [J]. Machine Learning, 1996,22:95-121
    [43] Watkins J C H and Dayan P.Q2learing [J].Machine Learning,1992,8:279- 292
    [44] Jing Peng and Ronald J W. Increment multi2step Q-Learning [J],Machine Learning, 1996,22:283 - 291
    [45] Szepesvari C. The asymptotic convergence2rate of Q2learning [A].Proceedings of Neural Information Processing Systems [C] , Cambridge , MA : The MIT Press, 1997,1064 - 1070
    
    
    [46] Werbos P J. A menu of designs for reinforcement learning over time[A]. In : Miller T ,Sutton R S ,Werbos P J. Neural Networks for Control [M]. Cambridge,MA : The MIT Press, 1990,25 - 44
    [47] Singh S P. Reinforcement learning algorithms for average payoff Markovian decision processes [A]1Proc. 12th National Conf. on Artificial Intelligence [C], Menlo Park, CA, USA : AAAI Press ,1994,202-207
    [48] Singh S P. Reinforcement learning with replacing eligibility traces[J].Machine Learning,1996,22:159-195
    [49] Schwartz A. A reinforcement learning method for maximizing undisconunted rewards [A]. Proc. 10th Int. Conf. Machine Learning[C], Morgan Kaufmann, San Mateo, CA, USA, 1993,298-306
    [50] Mahadevan S.Average reward reinforcement learning: foundations,algorithms and empirical results [J]. Machine Learing, 1996,22:159-195
    [51] Tadepali P and OK D. Model2based average reward reinforcement learning[J].Artificial Intelligence,1998,100:177-224
    [52] Williams R J. Simple statistical gradient2following algorithms for connectionist[J]. Machine Learning,1992,8:229-256
    [53] Tesauro G. Practical issues in temporal difference learning [J].Machine Learning ,1992,8:257-277
    [54] Sutton R S. The challenge of reinforcement learning [J].Machine Learning,1992,8:225-227
    [55] Winfried Ilg and Karsten Berns . A learning architecture based on for adaptive control of the walking machine LAURON[J]. Robotics and Autonomous System,1995,15:323 - 334
    [56] Sebastian T and Mitchell T M. Lifelong robot learning[J]. Robotics and Autonomous System, 1995,15:25-46
    [57] 阎平凡,再励学习——原理、算法及其在智能控制中的应用[J],信息与控制,1996,25(1):28-34
    [58] 俞星星,阎平凡。强化学习系统及其基于可靠度最优化的学习算法[J].信息与控制,1997,26(1):332-339
    [59] Xu Ningshou, Wu Zhanglei and Chen Liping. A learning modified generalized predictive controller [A]. 1991 IFAC Symposium on Intelligent Tuning and Adaptive
    
    Control (ITACp91) [C], Singapore ,1991,231 - 236
    [60] 杨璐,洪家荣,黄梯云。用加强学习方法解决基于神经网络的时序实时建模问题[J]。哈尔滨工业大学学报,1996,28(4):136-139
    [61] 马莉,蔡自兴。再励学习控制器结构与算法[J]。模式识别与人工智能,1998,11(1):96-100
    [62] 张汝波,顾国昌,张国印。智能机器人行为学习方法研究[A]。中国科协第二届青年学术年会论文集[C],北京,1998,469-471
    [63] 张汝波,周宁,顾国昌,张国印。基于强化学习的智能机器人避碰方法研究[J]。机器人,1999,21(3):204-209
    [64] 张汝波。强化学习研究及其在AUV导航系统中的应用[D]。哈尔滨:哈尔滨工程大学,1999
    [65] 张汝波,顾国昌,刘照德,王醒策。强化学习理论、算法及应用。控制理论与应用.2000,17(5):638-642
    [66] 蒋国飞,吴沧浦。基于Q学习算法和BP神经网络的倒立摆控制[J]。自动化学报,1998,24(5):662-666
    [67] 蒋国飞,高慧琪,吴沧浦。Q学习算法中网格离散化方法的收敛性分析[J]。控制理论与应用,1999,16(2):194-198
    [68] Gullapalli V. A stochastic reinforcement learning algorithm for learning real valued functions [J]. Neural Network, 1992,3(3):671-692
    [69] Tesauro G J. TD-gammon , a self-teaching backgammon program ,achieves master-level play [J]. Neural Computation, 1994,6 (2) :215-219
    [70] Tesauro G J.Temporal difference learning and TD-gammon[J].Communications of the ACM,1995,38 (3):58-68
    [71] Anderson C W. Learning to control an inverted pendulum using neural network [J]. IEEE Control System Magazine, 1989, 30 (4):31-36
    [72] Khan E. Reinforcement control with unsupervised learning [A]. Int. Joint Conference on Neural Network [C],Beijing, 1992,88-93
    [73] Berebji H R. Learning and tuning fuzzy logic controllers through reinforcements [J]. IEEE Trans. on Neural Networks, 1992, 3(5):724-740
    [74] Whitley D ,Dominic S ,Das R and Aanderson C W. Genetic reinforcement learning for neuro control problems [J].Machine Learning, 1993,13:259-284
    
    
    [75] Anderson C W and Hittle D C. Synthesis of reinforcement learning neural network and PI control applied to a simulated heating coil [J].Artificial Intelligence Engineering, 1997,11:421-429
    [76] Krose B J A and Van Dam J W M. Adaptive state space quantisation for reinforcement learning of collision-free navigation [A]. Proc. of the 1992 IEEE Int. Conference on Intelligent Robots and Systems[C],Raleigh,NC,USA. 1992,1327-1332
    [77] Millan J D R and Torras C. A reinforcement connectionist approach to robot path finding in nonlike environments [J ]. Machine Learning,1992,8:363-395
    [78] Dillmann K B R and Zachmann U. Reinforcement learning for the control of an autonomous mobile robot [A]. Proc. of the 1992 IEEE Int. Conference on Intelligent Robots and Systems [C], Raleigh,NC, USA, 1992,1808 - 1814
    [79] Lin Longji. Self-improving reactive agent based on reinforcement learning,planning and teaching [J]. Machine Learning, 1992,8:293-321
    [80] Pushkar P and Abdul S. Reinforcement learning of iterative behavior with multiple sensors [J ]. Journal of Applied Intelligence,1994, 4(5) :381-365
    [81] Touzet C F. Neural reinforcement learning for behavior synthesis [J].Robotics and Autonomous System, 1997,22:251-281
    [82] Caironi PVC, Dorigo M. Training and delayed reinforcements in Q-learning agents [J]. Int. J. of Intelligent Systems, 1997,12:659-724
    [83] Beom H B. A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning[J].IEEE Trans. on Systems, Man,and Cybernetics, 1995,25(3):464- 477
    [84] Crites R H and Barto A G. Improving elevator performance using reinforcement learning[A] . In : Touretzky D S ,Mozer MC , and M EH. Advances in Neural Information Processing Systems [M].Cambridge, MA :The MIT Press, 1995, 1017-1023
    [85] Majal M. A comparative analysis of reinforcement learning methods [R]. Cambridge, MA: Massachusetts Institute of Technology, ADA259893,1991
    [86] Lima p and Beard R. Using neural networks and dyna algorithm for integrated planning, reacting and learning in systems [ R] . Troy, New York : Rensselaer Polytechnic Institute, NASA93-24743-1993
    [87] Baird ⅢLeemon C. Learning with high dimension,continuous action[R]. Washington
    
    DC:Wright Laboratory, ADA280844,1993
    [88] Zeng Dajun and Katia S. Using case2based reasoning as reinforcement learning framework for optimization in changing criteria [R].Pittsburgh Pennsylvania: Carnegie Mellon University, AD2293602,1995
    [89] Anderson C W. Strategy learning with multi-layer connectionist representation[A]. Int. Conference on Machine Learning [C], Morgan Kaufmann, San Mateo, CA, USA, 1987,103 - 114
    [90] Mills P M and Zomaya A Y. Reinforcement learning using back propagation as building block[A]. IEEE Int. Conference on Neural Network[C],Singapore, 1991, 1554-1559
    [91] Kokar M M and Reveliotis S A. Reinforcement learning: architectures and algorithms [J]. Int. of Intelligent Systems, 1993,8:875-894
    [92] Moure A W and Atkson C G. Prioritized sweeping: reinforcement learning with less data and less time [J ]. Machine Learning,1993,13:103-130
    [93] Pack K L . Associative reinforcement learning: function in K-DNF [J]. Machine Learning, 1994, 15:279-293
    [94] 郭茂祖,陈彬,王晓龙。加强学习[J]。计算机科学,1993,25(3):13-15
    [95] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3):279—292,1992.
    [96] Leslie Pack Kaelbling. Learning in Embedded Systems. The MIT Press, Cambridge, MA, 1993.
    [97] Richard S. Sutton. Integrated architectures for learning,planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, Austin,TX, 1990. Morgan Kaufmann.
    [98] J. H. Schmidhuber.Curious model-building control systems. In Proc. International Joint Conference on Neural Networks, Singapore, volume 2, pages 1458—1463. IEEE, 1991.
    [99] Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13,1993.
    [100] David H. Ackley and Michael L. Littman.Generalization and scaling in reinforcement learning.In D. S. Touretzky, editor, Advances in Neural Information
    
    Processing Systems 2, pages 550-557, San Mateo, CA, 1990. Morgan Kaufmann.
    [101] 黄永宣。自动平衡倒置摆系统—一个有趣的经典控制理论教学实验装置[J]。控制理论与应用,1987,4(3):92295。
    [102] Furuta K, T Okutanti , H Sone. Computer Control of a Double Inverted Pendulum[J].Computer and Elect, Engrg ,1978,1(5) : 67284.
    [103] Mori S., Nishihara H, Furuta K. Control of Unstable Mechanical System2control of Pendulum[J]. Int. J. Control, 1976,23 (5) :6732692.
    [104] Furuta K, Kajiwara H, Kosuge K. Digital Control of a Double Inverted Pendulum on an Inclosed Rail [J]. Int. J Control,1980, 32(5):9072924.
    [105] 梁任秋,赵松,唐悦,等。二节倒立摆的数字控制器设计[J]。控制理论与应用,1987,4(1):1152124。
    [106] 尹征琦,冯祖仁,陈辉堂。采用模拟调节器的二级倒立摆的控制[J]。信息与控制,1985,1:6210。
    [107] 李士勇。模糊控制·神经控制和智能控制论[M]。哈尔滨:哈尔滨工业大学出版社,1998。
    [108] 窦振中。模糊逻辑控制技术及其应用[M]。北京:北京航空航天大学出版社,1995.1242127。
    [109] 张乃尧。倒立摆的双闭环模糊控制[J]。控制与决策,1996,11(1):85288。
    [110] Peng J. Efficient dynamic programming-based learning for control [M]. USA :Northeastern University, 1993.
    [111] 张乃尧,阎平凡。神经网络与模糊控制[M]。北京:清华大学出版社,1998.2522261。
    [112] Bererji H R, Khed Kar P. Learning and tuning fuzzy logic controllers through reinforcements[J]. IEEE Trans. on Neural Networks. 1992, 3(5):7242740.
    [113] 邢-,王磊,戴冠中。基于BP算法的自适应模糊控制系统研究[J]。控制理论与应用,1996,13(6):7972801。
    [114] 史晓霞,张振东,李俊芳。杨屹。二阶倒立摆系统数学模型的建立及意义。河北工业大学学报,2001,30(5):48-51
    [115] 黄苑虹,梁慧冰。从倒立摆装置的控制策略看控制理论的发展和应用。广东工业大学学报,2001,18(3):49-53
    [116] Sutton, R., "Generalization in reinforcement learning:Successful examples using
    
    sparse coarse coding", Advances in Neural Information Processing Systems 8, 1996.
    [117] Standfuss, A., Eckmiller, R., "To swing up an inverted pendulum using stochastic real-valued reinforcement learning",Proceedings of the International Conference on Artificial Neural Networks. Part 1 (of 2), p 655, May 26-29 1994, Sorrento, Italy.
    [118] Hirashima,Yoichi,Iiguni,Youji,Inoue,Akira,Masuda,Shiro,"Q-learni-ng algorithm using an adaptive-sized Q-table", Proceedings of the IEEE Conference on Decision and Control, v 2, 1999, p 1599-1604.
    [119] Mustapha, Sidi M., Lachiver, Gerard, "Modified actor-critic reinforcement learning algorithm", Canadian Conference on Electrical and Computer Engineering, v 2, 2000, p 605-609.
    [120] Yan, X.W., Deng, Z.D., Sun, Z.Q., "Genetic Takagi-Sugeno fuzzy reinforcement learning", IEEE International Symposium on Intelligent Control - Proceedings, 2001, p67-72.
    [121] Xin Xu, Han-gen He, Dewen Hu, "Effective Reinforcement Learning Using Recursive Least-Squares Methods", Journal of Artificial Intelligence Search 16,2002, 259-292.
    [122] Craig Hennessey, Derek Young, "modern control of an inverted pendulum", ENSC483 project report.
    [123] Sun Chengyi, Sun Yah, "Mind-Evolution-Based Machine Learning: Framework and The Implementation of Optimization", Proc. of IEEE Int. Conf. on Intelligent Engineering Systems (INES'98), 355-359.
    [124] J.S. Albus, (1975a) "A New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)," Journal of Dynamic Systems, Measurement and Control, American Soc. of Mechanical Engineers, Sept, 1975.
    [125] Danbing Seto,Lui Sha.TECHNICALREPORT.CMU/SEI-99-TR-023,ESC-TR-99-023, Pittsburgh, November, 1999.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700