加速强化学习方法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

加速强化学习方法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

作者：金钊
论文级别：博士
学科专业名称：通信与信息系统
中文关键词：强化学习 ; 链串 ; 关键状态 ; 引导贝叶斯网 ; 多路口交通灯控制
英文关键词：Reinforcement Learning ; State-Clusters ; Critical State ; Shaping Bayesian Network ; Multi-intersection traffic light optimal control
学位年度：2010
导师：刘惟一
学科代码：081001
学位授予单位：云南大学
论文提交日期：2010-05-01

摘要

强化学习以其生物相关性和学习自主性在机器学习领域和人工智能领域引发了极大的关注,并在多个领域体现了其应用价值。但一直以来学习速度慢和学习效率低的问题严重阻碍了强化学习应用于具有大规模状态空间的复杂问题。
     当前有两类方法对加速强化学习比较有效,一是分层强化学习,从任务分解的角度来加速学习;二是引导强化学习,从引导Agent学习减少搜索空间的角度来加速学习。但它们都存在一个共同的缺陷:学习任务的分解和引导信号的提供都依赖于外部观察者。这使得这两类方法加速强化学习的能力都受限于外部观察者对要解决问题的处理能力,如果外部观察者对要解决的问题不能划分出子任务和提供引导信号,那么这两类方法也就失效了。
     本文提出了一种可以利用Agent前期学习知识和经验来为Agent后续学习分解学习任务和提供引导信号的加速强化学习方法:基于引导贝叶斯网的强化学习(SBN-RL),该方法既可以从任务分解的角度加速学习,也可以从引导Agent学习减少搜索空间的角度加速学习,并且学习任务的分解和引导信号的提供都是完全根据Agent前期学习获得的知识和经验来进行的,彻底摆脱了对外部观察者的依赖,解决了传统加速强化学习方法中学习加速能力受制于外部观察者的问题,实现了Agent不仅能够自主学习也能够自主加速学习。
     通过Agent在每一次学习训练情节获得的状态动作转换序列,先求出表示了该次学习所获得的局部状态空间知识和局部状态空间转换知识的链串,然后再利用多次训练情节学习累积起来的链串构建出Agent对整个全局状态空间的认知模型:引导贝叶斯网,来表示和记录Agent在学习过程中累积的知识和经验。通过以引导贝叶斯网中对于Agent到达目标状态是“必经之路”的关键状态作为Agent到达目标状态前的阶段性子目标,整个学习任务可以被分解成一系列较小的学习子任务。达到了和传统分层强化学习一样利用任务分解来加快学习的目的,但这里的任务分解却是Agent通过自身构建起来的引导贝叶斯网进行的,摆脱了对外部观察者的依赖;同样,引导贝叶斯网中按距离目标状态远近分层的关键状态也为Agent提供了从初始状态到目标状态的分布于整个状态空间的全程引导,达到了和传统引导强化学习一样通过减少Agent搜索空间来加快学习的目的,但这里的引导信号完全来自于Agent自身构建起来的引导贝叶斯网,彻底摆脱了对外部观察者的依赖。
     通过从累积链串构建出引导贝叶斯网使Agent能够自主实现任务分解和学习引导来加速强化学习,使Agent不仅能够自主学习也能够自主加速学习,是本文最重要的贡献。具备自主分解学习任务和自主引导学习的能力,是使强化学习可以真正拓展到外部观察者也难以把握和解决的具有大规模状态空间的复杂问题的基本前提条件。
     在实现SBN-RL方法的过程中,本文还进一步研究了如何利用链串来加快值函数收敛而加速强化学习的问题,以及如何利用多个Agent共享链串来加速强化学习的问题;研究了在没有明显“必经之路”关键状态下如何利用引导贝叶斯网中整层关键状态作为阶段性子目标的问题;研究了如何发现关卡状态协同关键状态分隔原始状态空间的问题;证明了从局部状态空间中求出的阶段性最优解合成得到的最优解等价于从原始状态空间求出的最优解;探讨了如何利用引导贝叶斯网来改进和完善现有的一些加速强化学习的研究工作。
     最后在多路口交通灯最优控制问题上验证了SBN-RL方法在有较大规模状态空间的实际问题中的应用效果。为此本文专门开发实现了一个多路口的城市交通网络模拟运行环境MIUTS,然后使用SBN-RL方法解决在MIUTS模拟环境中多路口交通灯最优控制问题,使得进入该城市交通网络的所有车辆在最短的时间内通过并离开该城市交通网络。该问题是一个非常典型的具有较大规模状态空间的多Agent学习问题。从应用SBN-RL方法的试验效果来看,SBN-RL方法可以有效地构建出引导贝叶斯网,清晰地划分出阶段性子任务,为Agent提供全程的引导减少搜索空间。当使用SBN-RL方法把学习任务分解成2个子任务时,对学习得到同一个最优解,SBN-RL方法比传统强化学习方法Q学习减少了至少60%以上的学习时间;与传统的交通灯定时控制对比来看,采用SBN-RL求出的最优解控制交通灯可以使所有车辆离开城市交通网络耗费的时间缩短20-30%,可见SBN-RL方法对处理这种具有较大规模状态空间的多Agent学习问题是非常有效的。
     从Agent可以根据自身学习的知识和经验构建出引导贝叶斯网再用于加快自身后续的学习,本文的工作的确使得Agent能够自主加速学习。
The autonomy and biological relevance of Reinforcement Learning(RL) have attracted considerable interests of researchers worked in Machine Learning literature and Artificial Intelligence literature, and Reinforcement Learning has shown its applicability and effectiveness in many problem domains. But the slow learning process and low learning performance of RL becomes a formidable obstacle to prevent RL from problems with large state space.
     At present, there are two classes approaches work well on speeding up reinforcement Learning. One is Hierarchical Reinforcement Learning (HRL), which speeds up learning from task decomposition; another one is Shaping Reinforcement Learning (SRL), which speeds up learning by limiting state space which agent would search. But both two approaches have a common drawback, that is the task decomposition and the shaping signal are dependent on outside observer, which makes the capability of speeding up learning of two classes approaches are constrained by the capability of outside observer to deal with the problem. If the outside observer can not decompose learning task or provide shaping signal, the two classes approaches lose their function. In this work, we implemented an approach: Shaping Bayesian Network based Reinforcement Learning (SBN-RL), to speed up Reinforcement Learning, which the knowledge an experience of agent acquired from the preceding learning can be used to decompose learning task and shape agent for the subsequent learning. This way not only implements speeding up learning from task decomposition, but also implements speeding up learning from limiting state space which agent would search,and the task decomposition and shaping learning are accomplished only according to the knowledge and experience of agent acquired from the preceding learning, and completely remove the dependence on the outside observer, which solves the problem that the capability of speeding up learning is constraint to the outside observer, and also makes agent not only can learn on itself but also speed up learning on itself.
     In this work, we first compute the State-Clusters from the State-Action Transitions acquired from agent’s training episodes during tlearning process, then these accumulated State-Clusters are used to build up the Shaping Bayesian Network(SBN), which is the reorganization model of agent to the real original state space. The SBN is used to express and record the knowledge and experience of agent acquired from learning. By the Critical State in the SBN, which is also the only way must be passed if agent wants to reach goal state form initial state, to be the phased goal, the whole original learning task would be decomposed to become some smaller learning sub-tasks. This way makes use of the strategy of“separation concern”to speed up learning just like traditional HRL, but here the task decomposition was done by agent itself according to the SBN which is also built up by itself, no any dependence on the outside observer. Concurrently, all these Critical States arranged in different SBN’s structure layers by the distance of them from goal state could be used to provide the more detailed and more complete shaping signal which covers the whole state space. This way speeds up learning by reducing state space which agent would search, is also similar with traditional SRL, but the shaping signal comes from the SBN which is also built up by agent itself, and the shaping signal is no longer dependent on outside observer.
     This is our major contribution to build up SBN form the accumulated State-Clusters, which makes agent can autonomously decompose learning task and shape learning, and makes agent not only can learn on itself but also speed up learning on itself. To process the capability to decompose learning task and shape learning by agent itself, are just the most basic precondition to scale RL to the complex problems with large state space, which are very difficult, even impossible, to be solved by the outside observer. For implementing this approach, we also researched how use State-Clusters to speed up the value function’s convergence, and how use multiple agents to share their State-Clusters to speed up the value function’s convergence more fast. We also researched how use a whole layer Critical States of SBN to be the phased goal of agent, when lack the very obvious single Critical State, which is the only way must be passed for reaching goal state. We also researched how use gate states to combine critical states to isolate the original state space. We also proved the optimal policy combined from phased optimal policy is equivalent to the optimal policy found in the original state space. We also discussed how use SBN to improve some present research works about speeding up RL.
     We verified the SBN-RL approach in a multi-intersection traffic light optimal control problem. For this verification, we developed specially a Multi-Intersection Urban Traffic Simulator(MIUTS) to support the SBN-RL approach to deal with the multi-intersection traffic light optimal control problem, and the goal of optimal control is to make all cars entered the city can pass through and leave the city in shortest time. This is a typical multi-agent learning problem. From the test results, the SBN-RL approach can effectively build up SBN, the phased tasks can be divided clearly, and agent can be shaped to search smaller state space. When the learning task is decomposed into two sub-tasks by SBN-RL approach, the average learning time to find the same optimal policy by SBN-RL approach can be reduced by 60% when compared with the traditional reinforcement learning. The time of all these cars to leave the city can be reduced by 20-30% when adopt the optimal policy computed by the SBN-RL approach, when compared with the traditional fix time interval traffic light control policy. It is very effective that the SBN-RL approach to deal with such a kind of complex multi-agent learning problems with large state space.
     Form the ability of agent can use its own knowledge and experience to build up SBN for speeding up subsequent learning, our works make indeed agent can accelerate autonomously learning.

引文

[1]H.A.Simon, The science of the Artificial[I]. Cambridge: MIT Press, 1996
    [2]T.H.Mitchell, Machine Learning, lcGraw-Hill[l], New York, 1997
    [3]E. L.Tliorndike. Animal Intelligence: Experimental Studies[N]. New York: Macmillan, 1911.
    [4]R.S.Sutton, A.G . Barto, Reinforcement Learning: An Introduction[M], MIT Press, 1998
    [5]G.Tesauro, Temporal difference learning and TD-Gammon. Communications of the ACM[J], 38(3),58-68,1995
    [6]K. Hitomi. T. Shibata, Y. Nakamura, S. Ishii. Reinforcement learning for quasi-passive dynamicwalking of an unstable biped robot[J], Robotics and Autonomous Systems, 54, pp. 982 - 988, 2006
    [7]C. Ordonez, E.G.Collins Jr., The virtual wall approach to limit cycle avoidance for umannedground vehicles[J], Robotics and autonomous Systems, Vol.56, pp.645-657, 2008
    [8]T. Goto, N. Homma. M. Yoshizawa, A phased reinforcement learning algorithm for complex controlprobleBs[J], Artificial Life and Robotics, 11(2), pp. 190-196, 2007.
    [9]L.Monch, N.Stehli.ManufAg: a multi-agent-system framework for production control of complexmanufacturing syste?s[J], ISeB, Vol.4, pp. 159-185, 2006
    [10]B. C. Csaji, L.Monostori, B. Kadara. Reinforcement learning in a distributed market-basedproduction control system[J], Advanced Engineering Informatics, Vol.20, pp.279 - 288, 2006
    [11]V. Conitzer, T. Sandholm. AWESOME: A general multi-agent learning algorithm that converges inself-play and learns a best response against stationary opponents[J], Machine Learning, Vol.67,pp.23-43, 2007
    [12] X. F. Wang, T. Sandholm. Reinforcement Learning to play an optimal Nash Equilibrium in Team MarkovGames[J], Advances in Neural Information Processing Systems, Vol.15, pp.1571-1578, 2002
    [13]M. Ryan, M.Reid, Learning to fly: An Appl ication of Hierarchical Reinforcement Learning[C], In:Proceedings of the 17th International Conference on Machine Learning, pp. 807-814, 2000
    [14]R. S. Sutton, Learning to predict by the methods of temporal difference[J]. Machine Learning.1988(3), 9-44
    [15]O. Jangmin. J. Lee, J.W.Lee, B.T.Zhang, Adaptive Stock trading with dynamic asset allocationusing reinforcement learning[J], Information Sciences, pp.2121-2147, 2006
    [16]Ana. L. C. Bazzan, Opportunities for multi-agent systems and multi-agent reinforcement learningin traffic control[J], Autonomous agent Bulti-agent systems, Vol.18, pp.342-375, 2009
    [17]H. Wiering, Multi-Agent Reinforcement Learning for Traffic Light Control[C], In: Proceedingsof the Seventeenth International Conference on Machine Learning, CA, pp. 1151-1158, USA, 2000
    [18]B. Abdulhai. R. Pringle. G. J. Karakoulas, Reinforcement Learning for True Adaptive Traffic Signal[J], Journal of Transportation Engineering, Vol.129, No.3, pp.278-285 , 2003
    [19]L. P. Kaelbling, Reinforcement Learning: A Survey[J], Journal of art ificial Intelligence Research, 4, 237-285, 1996
    [20]J.C. Bezdek: On the relationship between neural networks, pattern recognition andintelligence[J]. International Journal of Approximate Reasoning, Vol.6(2): 85-107, 1992
    [21]沈晶，分层强化学习理论与方法[M]，哈尔滨工程大学出版社，2007年12月
    [22]H. L Hinsky, Theory of Neural-Analog Reinforcement systems and its application to the brain-model problem[D]. PhD Dissertation, Princeton University, 1954
    [23]H. D. Waltz, K.S.Fu, A heuristic approach to reinforcement learning control system[J], IEEE Transactions on automatic Control, Vol(4):390-398, 1965
    [24]R. E. Bellman. Dynamic programming[M], Princeton, NJ: Princeton University Press. 1957.
    [25]R. E. Bellman. A Markov decision process [J], Journal of mathematical mechanics, 6(5), pp.679-684,1957
    [26]R. A. Howard. Dynamic programming and Markov Processes[M], New York: Wiley, 1960
    [27]C.J. Walk ins, Learning from delayed rewards[D], PhD Thesis, University of Cambridge. England, 1989
    [28]C.J. Walk ins, Q-Learning [J], Machine Learning,8, pp.279-292, 1992
    [29]G. J. Tesauro, Practical issues in temporal difference learning[J], Machine Learning. 8,257-277,1992
    [30]G. J. Tesauro, Temporal difference learning and TD-Gammon[J]. Communications of the ACM, 38(3), 58-68,1995
    [31]S. Russell, P. Norvig, Artificial intelligence: A modern Approach[M], Second edition, Pearson Education, Inc. 2003
    [32]Y. Shoham . R. Powers. Multi-Agent Reinforcement Learning: A Critical Survey[Z]. Tech. Report, Stanford University, 2003
    [33] Y. Shoham , R. Powers, If Bulti agent learning is the answer, what is the quest ion[J], ArtificialIntelligence 171, 2007, 365-377.
    [34]L. Panait, S. Luke, Cooperative Multi-Agent Learning: The State of the Art[J], Autonomous Agentsand Multi-Agent Systems, 11(3), 2005, 387-434
    [35]T. G. Dietterich, Hierarchical Reinforcement Learning with the MAXQ Value Fuctiondecomposition[J], Journal of Artificial Intelligence Research, 13, pp.227-303, 2000
    [36]R. Parr, Hierarchical control and learning for Markov decision processes[D], PhD Dissertation,University of California, Berkeley, 1998
    [37]R. Parr, S.Russell, Reinforcement Learning with Hierarchies of Machines[J], Advances in neuralinformation processing system, 10 . pp. 1043-1049 ,1998
    [38]N. Ghavamzadeh. S. Mahadevan. R.Makar. Hiearchical multi-agent reinforcement learning[J].Autonomous Agents and Multi-Agent Systems, 13(4), pp. 197-229, 2006
    [39]N.Mehta. P.Tadepalli. Multi-agent shared hierarchy reinforcement Learning[C], Proceedings ofthe ICML05 Workshop on Richer Representations in Reinforcement Learning, Bonn, Germany, 2005
    [40]P.Stone, Layered learning in multi-agent system: A winning approach to robotic SOCCer[N], MITPress, Cambridge, Massachusetts,2000
    [41]A. G. Barto, S. Mahadevan, Recent advances in hierarchical Reinforcement learning[J], DiscreteEvent Dynamir System: Theory and Applications, 13(4), pp. 41-77, 2003
    [42].B.F.Skinner, The Behavior of Organisms: An Experimental Analysis [M]. Prentice Hall, EnglewoodCliffs, New Jersey, 1938
    [43].B.F.Skinner, Science and human behavior[M]. Collier-Macmillian. New York, 1953
    [44]Adam Laud, Gerald Dejong, Reinforcement Learning and Shaping: Encouraging intended behaviors[C],In:Proceedings of the Nineteenth International Conference on Machine Learning, pp.355-362, 2002
    [45]Andrew Y. Ng. Daishi Harada, Stuart Russell, Policy invariance under reward transformations:theory and application to reward shaping[C]. In: Proceedings of the 16th International Conferenceon Machine Learning, Bled, Slovenia, pp. 278-287, 1999
    [46]Naja J Mataric, Reward functions for accelerated learning[C], In: Proceedings of the EleventhInternational Conference on Machine Learning, pp. 181—189, 1994.
    [47]John Asmuth, Michael L. Littman, Robert Zinkov, Potential-based Shaping in Model-basedreinforcement learning[C], In: Proceedings of the 23 rd national conference on Artificialintelligence ,pp.604-609 ,2008
    [48]jette Randlov, Preben Alstrom, Learning to Drive a Bicycle using Reinforcement Learning andShaping[f]. In: Proceedings of the fifteenth International Conference on Machine Learning, Madison,pp. 463-471, 1998
    [49]Takakuni Goto, Noriyasu Homma. Makoto Yoshizawa, Kenichi Abe, A phased reinforcement learningalgorithm for complex control problems[J], Artificial Life and Robotics, 11(2), pp.190-196. 2007
    [50]Narek Grzes, Daniel Kudenko, Multigrid Reinforcement Learning with Reward Shaping[C],Proceedings of the 18th international conference on Artificial Neural Networks, Prague, pp.357-366, 2008
    [51]R. Karl hi J.W.Shavlik, Creating advice-taking reinforcement learners[J], Machine learning, Vol.22, pp.251-282,1996.
    [52]Staddon, Adaptive behavior and learning[M], Cambridge University Press, 1983.
    [53]D. Herkerman. M. P. Wellman. Bayesian Networks [J]. Communications of ACM, 1995, 38 (3) : 27-30.
    [54]J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. SanMateo, CA: Morgan Kaufmann. 1988
    [55] J. Cheng, R. Greiner, J. Kelly, D. Bell. W. Liu. Learning Bayesian network from data: Aninformal:ion-theory based approach [J]. Artificial Intelligence, 2002, 137(2): 43-90.
    [56]张连文，郭海鹏，贝叶斯网引论[M]，科学出版社，2006年11月
    [57]刘惟一，李维华，岳昆，智能数据分析[M]，科学出版社，2007年9月
    [58]A. G. Barto, R. S. Sutton, Anderson C. W., Neuronlike elements that can solve difficult learningcontrol problems [J], IEEE Transactions on Systems, Man. and Cybernetics, 13(5), pp. 835-846, 1983
    [59]R.S. Sutton, Learning to predict by the methods of temporal difference[J]. Machine Learning.3. pp. 9-44, 1988
    [60]G.Tesauro, Temporal difference learning and TD-Gammon[J]. Communications of the ACM, 38(3),pp. 58-68, 1995,
    [61] J. D. Mil Ian. Rajid. safe, and incremental learning of navigation strategies[J], IEEETransactions on Systems, Man, and Cybernetics, 26(3), pp. 408 - 420, 1996
    [62]S. Arai, K. P. Sycara, R. P. Terry, Experience-based Reinforcement Learning to Acquire EffectiveBehavior in a Multi-agent Domain[C], Proceedings of the 6th Pacific Tim International Conferenceon Artificial Intelligence, pp. 125-135, 2000
    [63]C. Claus, C. Boutilier, The dynamics of reinforcement learning in cooperative multiagentsystems[C], in: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp.746-752, 1998
    [64]M. L. Littman. Markov games as a framework for multi-agent reinforcement learning[C], In:Proceedings of the 11th International Conference on Machine Learning, pp. 157-163,1994
    [65]M. Tan, Multi-agent reinforcement learning: Independent vs. cooperative agents[C], In:Proceedings of the 10th International Conference on Machine Learning, Amherst, USA, pp. 330-337,1993
    [66]Gerald Tesauro, Extending Q-learn ing to general adaptive multi-agent systems[J], In Advancesin Neural Information Processing Systems, volume 16, 2004.
    [67]P. Hoen, E. D. de Jong, Evolutionary multi-agent systems[C], In: Proceedings of the 8thInternational Conference on Parallel Problem Solving from Nature, pp. 872-881,2004,
    [68]S. Kapetanakis, D. Kudenko, Reinforcement learning of coordination in heterogeneous cooperativemulti-agent systems[C], In: Proceedings of the Third Autonomous Agents and Multi-Agent SystemsConference, 2004.
    [69]M. Lauer, M. Riedmiller, An algorithm for distributed reinforcement learning in cooperativemulti-agent systems[C], In: Proceedings of the 17th International Conference on Machine Learning.Morgan Kaufman, pp. 535-542, 2000
    [70]P. Raicevic, Parallel reinforcement learning using multiple reward signals[J], Neurocomputing.Vol.69, pp.2171-2179, 2006.
    [71]R. B. Ollington, P.W.Vamplew, Concurrent Q-Learning: Reinforcement Learning for Dynamic Goalsand Environments[jL International Journal of Intelligent Systems, VOL.20, pp. 1037 - 1052, 2005
    [72]刘次华，随机过程[M]，第二版，华中科技大学出版社，2001年
    [73]何盈捷．刘惟一，由Narkov网到Bayesian网[J]，计算机研究与发展，Vo1．39(1)，pp．87-99，2002
    [74]T. Balch, R. Ark in, Avoiding the past: A simple but effective strategy for reactive navigation [C], In: Proceedings of the IEEE International Conference on Robotics and Automation, 1993, pp. 678-685.
    [75]E. W. Di jkstra, A note on two problems in connexion with graphs. Numerische Mathemat ik. Vol. (1), pp. 269-271, 1959
    [76]Scott Davies, Andrew Y. Ng, Andrew Moore, Applying online search techniques to reinforcementlearning[C], In Proceedings of AAAI-98, Nenlo Pork, CA: AAAI Press, 1998, pp. 753-760
    [77]D. Andre, S.J.Russell, State Abstraction for Programmable Reinforcement Learning Agents [C],AAAI-02, pp. 119-125, 2002
    [78] A. I to, M. Kanabuchi. Speeding Up Mill ti agent Reinforcement Learning by Coarse-Graining ofperception: The hunter Gane[J], Electronics and Communications in Japan, Part2, Vol.84, No. 12, pp.37-45, 2001.
    [79]T. Balrh. R. Arkin, Avoiding the past: A simple but effective strategy for reactive navigation[C].In: Proceedings of the IEEE International Conference on Robotics and Automation. 1993, pp. 678-685.
    [80]S. Arai, K.P. Sycara, R. P. Terry, Experience-based Reinforcement Learning to Acquire EffectiveBehavior in a Multi-agent Domain[C], Proceedings of the 6th Pacific Tim International Conferenceon Artificial Intelligence, pp: 125-135, 2000
    [81]H. Haaref, C. Barnet, Sensor-based navigation of a mobile robot in an indoor environment[J],Robotics and Autonomous Systems, Vol.38, pp. 1- 18, 2002
    [82]黄卫，路小波，智能运输系统(ITs)概论[M]，人民交通出版社，2008年
    [83]T. L. Thorpe, C. W. Anderson, Traffic Light Control Using SARSA with Three State Representations[Z],Technical report, IBM Corporation, Boulder, 1996
    [84]A. Salkham. R.Cunningham. A Collaborative Reinforcement Learning Approach to Urban TrafficControl 0ptimization[C], In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on WebIntelligence and Intelligent Agent Technology,pp.560-566,2008
    [85]G. Tesauro, Extending Q-learning to general adaptive multi-agent systems[J], In Advances inNeural Information Processing Systems, Vol. 16, 2004
    [86]j.Wahle, A. L. C. Bazzan, Anticipatory traffic forecast using multi-agent techniques [J]. In D.Helbing. H. J. Hermann, M. Schreckenberg, ft D. Wolf (Eds.), Traffic and granular flow, pp. 87-92.New york: Springer,2000
    [87]D. I. Robertson, TRANSYT method for area traffic control, Traffic Eng. Control, Vol.10,pp:276-281, 1969
    [88]P. B. Hunt, D.I.Robertson, and R. D. Bretherton. The SCOOT on-line traffic signal optimizationtechnique. Traffic Eng. Control, 23:190-192, 1982.
    [89]P. Lowrie, SCATS: The Sydney co-ordinated adaptive traffic system-principles, methodology, algorithms. In Proceedings of the IEE International Conference on Road Traffic Signalling, pages67-70, 1982.
    [90] A. Salkham, R.Cunningham, A. Garg, andV.Cahill, A Collaborative Reinforcement Learning Approach to Urban Traffic Control Optimization. In: The 2008 International Conference on Web Intelligence and Intelligent Agent Technology, Vol.2, pp:560-566, 2008
    [91]赵晓华，李振龙，陈阳舟，基于Q学习的城市交通信号灯混杂控制，系统仿真学报，Vol．18，No．10，2006．10．
    [92]周力，李炜，我田智能交通控制系统的发展及展望，自动化与仪器仪表，No.2，pp：，2009．02

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700