CNCC技术论坛 | 面向人工智能芯片的编程语言和编译器
清华大学计算机系⻓聘副教授，博士生导师。ACM中国高性能计算专家委员会秘书⻓、北京智源⻘年科学家。主要研究领域为高性能计算、编译优化等。相关研究成果发表在高性能计算等领域重要国际会议和期刊SC、PPoPP、ICS、MICRO、ASPLOS、ATC、CGO、NSDI、IEEE TPDS、IEEE TC等。其中SC14论文入选会议Best Paper Finalist，是大陆学者首次入围该奖项。担任NPC 2018程序委员会主席、SC 2018/2019/2020、PPOPP 2019/2020/2021程序委员会委员、国际期刊IEEE TPDS编委、FCS和JCST⻘年编委等。担任清华大学学生超算团队教练，指导的团队共九次获得世界冠军。在2015年和2018年包揽了SC、ISC、ASC三大国际超算竞赛的总冠军，实现“大满贯”。获教育部科技进步一等奖、CCF优秀博士学位论文奖、国家自然科学基金优秀⻘年科学基金。
清华大学计算机系教授，博士生导师。CCF杰出会员和杰出讲者，CCF副秘书⻓，CCF YOCSEF荣誉委员。主要研究领域为操作系统、程序设计语言与并行计算。多次担任高性能计算和并行计算重要国际会议如OSDI、PPoPP、CGO、SC、ICS、 PLDI、ASPLOS和APSYS的程序委员会委员。同时担任ACM中国理事会主席，ACM中国操作系统分会ChinaSys主席。获国家科技进步二等奖、国家教委科技进步二等奖和北京市科技进步二等奖各一次。国家杰出⻘年基金获得者。
北京大学讲席教授，北京大学信息科学技术学院副院⻓、计算机科学技术系主任。1996年在日本东京大学信息工学专业获博士学位。曾担任东京大学情报理工学研究科教授，日本国立信息学研究所教授/系主任, 北京大学⻓江讲座教授。胡振江教授⻓期从事程序设计语言和软件科学与工程的研究，在程序语言设计、结构化函数式程序设计、程序的自动综合和优化、并行程序设计、双向变换语言的设计和实现、以及软件的演化和维护等方面做出了一系列开创性工作，曾获全日本最佳博士论文奖和日本软件科学会基础研究成就奖、日本工学会会士、欧洲科学院院士，IEEE Fellow、ACM杰出科学家。
Tianqi Chen is currently an Assistant Professor at the Machine Learning Department and Computer Science Department of Carnegie Mellon University. He received his PhD. from the Paul G. Allen School of Computer Science & Engineering at the University of Washington, working with Carlos Guestrin on the interp of machine learning and systems. He has created three major learning systems that are widely adopted: XGBoost, TVM, and MXNet (co-creator). He is a recipient of the Google Ph.D. Fellowship in Machine Learning.
演讲题目TVM：An automated deep learning compiler
摘要：Data, models, and computing are the three pillars that enable machine learning to solve real- world problems at scale. Making progress on these three domains requires not only disruptive algorithmic advances but also systems innovations that can continue to squeeze more efficiency out of modern hardware. Learning systems are in the center of every intelligent application nowadays. However, the ever-growing demand for applications and hardware specialization creates a huge engineering burden for these systems, most of which rely on heuristics or manual optimization. In this talk, I will present a new approach that uses machine learning to automate system optimizations. I will describe our approach in the context of deep learning deployment problems. I will first discuss how to design invariant representations that can lead to transferable statistical cost models, and apply these representations to optimize tensor programs used in deep learning applications. I will then describe the system improvements we made to enable diverse hardware backends. TVM, our end-to-end system, delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned deep learning frameworks.
Zhihao Jia is an incoming Assistant Professor of Computer Science at CMU (starting Fall 2021). He obtained his Ph.D. at Stanford working with Alex Aiken and Matei Zaharia. His research interests lie in the interp of computer systems and machine learning, with a focus on building efficient, scalable, and high-performance systems for ML computations.
演讲题目：Automated Discovery of Machine Learning Optimizations
摘要：As an increasingly important workload, machine learning (ML) applications require different performance optimization techniques from traditional runtimes and compilers. In particular, to accelerate ML applications, it is generally necessary to perform ML computations on heterogeneous hardware and parallelize computations using multiple data dimensions, neither of which is even expressible in traditional compilers and runtimes. In this talk, I will describe my work on automated discovery of performance optimizations to accelerate ML computations. TASO, the Tensor Algebra SuperOptimizer, optimizes the computation graphs of deep neural networks (DNNs) by automatically generating potential graph optimizations and formally verifying their correctness. TASO outperforms rule-based graph optimizers in existing ML systems (e.g., TensorFlow, TensorRT, and TVM) by up to 3X by automatically discovering novel graph optimizations, while also requiring significantly less human effort. FlexFlow is a system for accelerating distributed DNN training. FlexFlow identifies parallelization dimensions not considered in existing ML systems (e.g., TensorFlow and PyTorch) and automatically discovers fast parallelization strategies for a specific parallel machine. Companies and national labs are using FlexFlow to train production ML models that do not scale well in current ML systems, achieving over 10x performance improvement. I will also outline future research directions for further automating ML systems, such as codesigning ML models, software systems, and hardware backends for end-to-end ML deployment.
摘要：以寒武纪平台为代表的高性能智能处理器提供了一个通用的深度学习平台，其目标是为当前和未来的智 能应用提供强大的计算能力。由于未来应用的多样性和不可预测性，提供基础的高级编程语言是其生态 构建和推广中不可缺少的一个环节。我们针对这一需求，以C语言为基础，面向应用和平台设计了通用 的高级编程语言Bang语言，解决了用户自定义算子的灵活开发问题。并进一步，利用深度的编译优化技 术来充分发挥芯片的处理能力。
Wei Lin is currently Senior Director of Platform of Artificial Intelligence (PAI) and Chief Architect of Big-data computation platform in Alibaba. 15+ years’ experience specializing in backend/infrastructure, distributed system development, storage and a large-scale computation system include batch, streaming and machine learning.
演讲题目：AI Compiler at Alibaba
摘要：With the emerging AI workloads and diversity of executing computing hardware, AI compiler plays a vital role to bridge the gap between model expressive flexibility and underlying high- performance system implementation. In this talk, we will share our experiences of applying AI compiler into Alibaba’s production environment, including: 1.Large-scale deployment of our AI Compiler into PAI (Platform of Artificial Intelligence) production clusters running stably for more than 6 months with tens of thousands of GPU hour saving. We will talk about our aggressive fusion and co-design strategy in which a cost-based approach is exploited to find the optimal fusion plan to boost hardware efficiency. In addition, lots of experiences to ensure that our compiler can be enabled by default in a large-scale production cluster will be shared. 2.Automatic code generation framework named as Ansor. This work has already been accepted by OSDI 2020 and deployed into our production environment. Compared with existing search strategies, Ansor explores much more optimization combinations and thus can find high-performance programs that are outside the search space of existing state-of-the-art approaches. Our evaluation shows that Ansor improves the execution performance of deep neural networks on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3:8x, 2:6x, and 1:7x, respectively. 3.Our thoughts about the future direction of AI compiler from industry perspective, such as the inter-play between compiler, runtime, resource scheduling and distributed execution. Also, we would like to raise some questions looking forward to the potential interaction between academia and industry.
Shin-Ming Liu is the Chief Architect@Xcalibyte.com. Shin-Ming started as compiler developer since early ’80. He has participated in compilation systems from scratch in various companies in Silicon Valley and established wide influence in modern day compilation systems including gcc and llvm design. Besides the in-depth compiler development work, Shin-Ming has been the Director for Java C/C++ ToolChain Lab. of HP-UX Server, Director for HP Kernel Development Lab for HP 3PAR Storage System, and developed extensive insight about computer ecosystem for high performance computing and software development productivity.
演讲题目：Matrix multiply: from 1 to 62806 X speedup Bridging the gap between productivity and performance
摘要：John Hennessy in his Stanford lecture discussed a new era of computing with the challenge to improve GEMM by ~63,000 times. We will further elaborate his vision with a deep dive into the compilation and runtime techniques needed. We also suggest a possible roadmap to bring productivity and performance into his vision . We argue the need for an open source platform that enables multiple languages co-exist in the compilation and runtime while allow individual chip/accelerator vendors to specialize for their target domain. We will analyze the technical challenges ahead and possible directions moving forward for a thriving industry in AI and data science.
— 完 —