Author: Sanyang, Li Baozhu, Li Weidong, Yudi, xixi

Editor: Li Baozhu

In the era of big models, machine learning systems are undergoing unprecedented changes. The rapid expansion of model size has allowed us to witness a huge improvement in AI capabilities. However, this improvement has not only brought new opportunities to various fields, but also led to a series of new technical challenges and practical difficulties.

On December 16, 2023 Meet TVM · Year-end Party was successfully held at the Shanghai Entrepreneurs Public Training Base.Apache TVM PMC and Shanghai Jiao Tong University Ph.D. Feng Siyuan served as the host and had an all-round and multi-angle exchange and discussion with four guests on the theme of "Machine Learning Systems in the Era of Big Models".

The four guests of this roundtable discussion are:

* Wang Chenhan, founder and CEO of OpenBayes Bayesian Computing

* Wu Zhao, head of NIO’s autonomous driving AI engine

* Jin Lesheng, Machine Learning System Engineer at OctoML

* Zhu Hongyu, Machine Learning System Engineer at ByteDance

From left to right: Feng Siyuan, Wang Chenhan, Wu Zhao, Jin Lesheng, Zhu Hongyu

We have summarized this conversation as follows without violating the original intention. Come and listen to the wonderful insights of the guests.

Machine Learning Systems in the Era of Big Models

Stage 1: Discussion Speech

At this stage, big models are an absolute hot topic in all fields, whether it is the cloud, the end side or the vehicle (Tesla FSD V12). All guests will encounter system optimization problems in the training and deployment of big models in actual work or discussions. Please take turns to introduce the main challenges and solutions you have encountered.

Wang Chenhan:OpenBayes Bayesian computing started training single modal models in June this year. SuperCLUE It ranks fifth in the list of domestic large-scale model startups.From the perspective of large model training technology, the core problem we are facing now is network latency. Basically, no chip can run at full capacity on its own cluster.

according to OpenAI According to the "scaling kubernetes to 2500 nodes" on the official website, the peak utilization of the GPU when training GPT-3 should not exceed 18%, and the average utilization is about 12-15%. This means that if you spend 100 million to build a cluster, only 12-15 million of the investment in this cluster will be useful.From a financial perspective, maximizing data parallelism, pipeline operation, and vector parallelism is actually the biggest challenge in training.

The challenges in deployment/inference in China are mainly complex engineering problems. If the video memory bandwidth is not very good, PCIE optimization is actually quite troublesome. OpenBayes Bayesian computing and many upstream and downstream manufacturers use vLLM, which saves a lot of engineering work and greatly reduces the workload of reasoning.

Jinlesheng:The challenges we encountered can be divided into two main points:

1. Because TVM and MLC-LLM run at 7B speed, sometimes a card cannot store a larger model, such as 70B.Last quarter, we tried to use Tensor Parallelism to solve this problem. This solution has now been open sourced. If you are interested, you can try it.

2. There is another requirement.We currently only support batch size = 1.It is more suitable for individual use, but if you think of a Serving, you will find that it is far inferior to vLLM, which we are also developing now.

Feng Siyuan:I would also like to add that the main trend in reasoning is still unclear. Although Transformer is the mainstream architecture used by large models, overall, there are still many changes in the methods. In this case, it is still a question whether Transformer can unify large models.Therefore, in scenarios where there is uncertainty in the upper and lower layers, customizability and agile development may be more important than traditional TVM end-to-end compilation.In my opinion, there is still a lot of room for improvement in reasoning and training of large models.

Stage 2: Targeted Questioning

As the US ban has been tightened, the restrictions have expanded from the original ban on training cards to large model inference cards. In the short term, what is the most cost-effective solution for large model cloud inference? (When game cards and graphics cards are allowed), domestic NPU 、How long will it take for GPU to fill the gap in the field of inference?

Wang Chenhan:The sizes of training and inference models are different, and the usage scenarios and business loads are different, so it is difficult to come up with a unified answer.

From the perspective of edge selection, the domestic chip Rockchip 3588 is a good option. It has good performance and cost-effectiveness, a relatively universal technology stack, and is relatively cheap and easy to obtain.In addition, Nvidia Orin is equivalent to a castrated version of Ampere GPU.If the q4f 16 budget specification is followed, Orin will not be under much pressure to run 7B, 14B or even 34B models from video memory to inference.

For cloud selection, NVIDIA subsequently disclosed three chips: H20, L20 and L2. According to official information from NVIDIA, the actual reasoning level of large models is probably L40's 70%-80%. Although A6000 was later added to the banned list, the inventory is relatively large. The advantage of A6000 is that it has a large video memory, 48 GB, with NVLink. If you install a pair of them, you can get a 130% A100.

From our contacts with domestic chip manufacturers, we know that everyone is indeed trying to optimize the single Attention Backbone as much as possible.

Stage 2: Targeted Questioning

In the field of domestic chips, how long do you think it will take for a company to truly achieve success on the inference side and be able to take away Nvidia's market share?

Wang Chenhan:I think domestic chip companies will be able to take over more than 20% of Nvidia's market share within 18 months.The main basis for this judgment is that my country's favorable policies and the continued sanctions by the United States have promoted the increase in the localization rate. Moreover, as far as I know, there are already domestic manufacturers that are compatible with NVIDIA CUDA instructions and APIs up to 92% or more. Therefore, I am still very confident in the 18-month cycle prediction.

Stage 2: Targeted Questioning

Why did NIO choose TVM? What are the advantages of TVM in the field of autonomous driving?

Wu Zhao:First of all, it is definitely because I have a TVM technical background, so when building a team, I will give priority to TVM. Secondly,In actual projects, an important criterion for considering whether a technology is reasonable is whether its architecture can meet business needs.

Autonomous driving is a very complex application scenario, and the requirements for architecture are more stringent. When choosing a technical route, it is necessary to comprehensively consider project requirements and project cycles.For NIO's autonomous driving business, the first model ET7 is scheduled to start delivery in March 2022. At that time, our team only had half a year to deal with the complex model of autonomous driving, so we must choose an end-to-end solution. At that time, many friendly companies were using TensorRT. The problem with TensorRT is that the models will become more and more complex, and the requirements will become more and more strange, which is not suitable in the long run.

The first issue to be considered in the field of autonomous driving is how to fully control performance, accuracy and other metrics on the vehicle side.Because autonomous driving needs to solve many special cases, the algorithm team mostly trains models in the cloud and then deploys them to the vehicle. In this process, if the TensorRT black box is used, it is actually impossible to fully master its quantization algorithm, and quantization is very important to us.

In addition, MLIR is very suitable for traditional compilers, but it requires a relatively large amount of time in the early stages. Considering that we had a relatively strict time limit at the time, and we had to choose an End-to-End solution, we gave up MLIR after evaluation.

at last,For autonomous driving, the stability of the overall deployment and low CPU usage are crucial.Therefore, we need to choose a solution that can be fully controlled and can reduce CPU usage, which is something that a black box cannot achieve.

In summary,The all-white-box TVM was the best option for us at the time.

Stage 3: Discussion Speech

At present, both large models and autonomous driving models are bound to each other. In this case, the model's algorithm, system and even chip will have a common evolution. Teachers can share their views on this.

Wang Chenhan:I think DSA and GPGPU are likely to be interdependent, and neither can do without the other. In the future, the architecture of chips will not only be in the form of Attention. Recently, many new technologies and products have emerged in the community, such as Mistral 7B MoE, Microsoft's RetNet , the rise of multimodality, etc., the unification of the entire architecture by large language models may be just a short illusion from March to October this year. It is very likely that the future architecture of AI and the paradigm defined by NVIDIA will have to continue for a while.But NVIDIA may not always be able to maintain its lead in this regard. There is no doubt that Attention will shorten the gap between other followers and NVIDIA.For example, AMD MI300X and other domestic chips whose names are not convenient to mention publicly.

From more trends,The evolution of architecture centered on GPGPU will continue to be a long-term trend.

Wu Zhao:In real project experience, small changes are possible, but big changes are difficult. In other words, under the premise of basically meeting business needs, fine-tuning and adaptation can be done for the hardware.However, if Transformer is required to achieve good results, but certain hardware has very poor support for Transformer, from a business perspective, we will not deploy it on this hardware. This is the current situation in the industry.

When it comes to challenges, I think there will definitely be challenges, including the ones mentioned above. R Or RNN, it is no longer the quadratic complexity of Attention, but linear complexity.There is also a problem here. This alone is not enough to succeed in the challenge, because we can use some compression or other means to meet the requirements for effects in limited scenarios. In this case, the ecology and effects of RWKV are not as good as Transformer, and users have no reason to abandon Transformer and adopt RWKV.

So in my opinion,Algorithms are the most important driving force.If the algorithm effect can be achieved, we may consider other system chips in consideration of cost performance.

Jinlesheng:My idea is very similar to Professor Wu’s. I have worked on Machine Learning before and published some papers on AI. I found that people who do ML rarely pay attention to latency or system-related indicators. They are more concerned about improving accuracy and whether they can reach SOTA.So I think if a new model emerges whose performance completely surpasses Transformer, it will definitely become mainstream, and all hardware manufacturers and software stacks will adapt to it. Therefore, I think the algorithm will still dominate.

Wang Chenhan:We have previously estimated that the training cost of RWKV can be reduced to about 1/3 when the parameter scale is large.For example, when building a large-scale machine learning model, everyone relies on communication tools and communication. After reducing from the exponential level to the linear level, its communication requirements will decrease.

Although the Attention mechanism began to attract people's attention in 2017, by crawling and analyzing global machine learning-related papers, we found that the number of papers published in 2022 alone exceeded the total of previous years.

There is no doubt that GPT-3 or even ChatGPT is this milestone.Even before the birth of ViT, almost no one believed that Attention could be used for visual tasks. We always need an event to prove the effectiveness of a model structure, either the parameter scale is huge and effective, or the mechanism is SOTA in a certain type of task. Looking back at RWKV, the reason why RWKV has not yet shown the potential to surpass Attention is probably because of the huge gap in investment budget. The potential of RWKV is far from being proven.

I think we should predict the Backbone after Attention based on the existing Backbone. At present, it seems that RWKV and Microsoft's RetNet have this potential.

Stage 3: Discussion Speech

Will the future deployment of large models be mainly on the client side or on the cloud?

Wu Zhao:I think the focus will be on the end side in the next 3-5 years. First of all, the product form of the large model will definitely not be based on Chat only. In the future, there will definitely be a lot of vertical large models.For example, self-driving cars, mobile phones, and micro robots are all terminal devices, and the demand and computing power of such infer are huge. It is unlikely that there is such a cloud to support so many scenarios and devices. At the same time, for applications with high latency sensitivity such as self-driving, the latency from end to cloud is also a factor that must be considered.

Wang Chenhan:It may take longer than we thought for large models to be deployed in the cloud. Previously, it was generally believed that the cloud would be the main deployment platform within 1-2 years, and the model would be deployed on the client side in about 5 years.My own judgment is that it will be cloud-based within 3-4 years, and end-to-end testing within 5-8 years.

Take GPT-3.5 (20B) as an example. It occupies 10 GB+ in Q4 FP16. Putting aside the power consumption, using 10 GB+ to store a model on a mobile phone is not something that everyone can accept now.In addition, the development of chip manufacturing processes is slowing down, and chip architecture will no longer advance as rapidly as it did in the past 20 years, so I don’t think the cloud model can be decentralized to the end side soon.

Feng Siyuan:closeRegarding the expected development of Transformer, I agree with Chen Han’s view that it is basically unlikely to be completely separated from the cloud within 5 years.However, if a new model comes out, it may solve part of the computing power problem. If you want to deploy a large model on the mobile phone, there is no shortage of computing power. Take an Android phone as an example. It has a 35T matrix unit, but this matrix unit is a single batch, so it is completely useless when reasoning with a large model. If there is a model that can solve this kind of reasoning problem on the end, it is likely to be solved within half a year after the model is released. As for when this model will be released, it is not easy to make a conclusion.

The production method of models, especially terminal-side models, is completely different from that of models deployed in the cloud. It must be company-led. For example, manufacturers such as Qualcomm and Apple will design a model specifically for deployment on mobile phones or terminals.If you want your model to have its effect, you don't need to surpass Transformer, just get close to Transformer. This is more suitable on the client side, and it must be related to the design, training and task differences of the model.

Wu Zhao:The current mainstream approach is to derive a large model in the cloud and then distill a small model.From a practical point of view, we are more concerned about how to support business development of some vertical applications. There is no need to deploy a model as large as LLaMA. In vertical scenarios, the number of parameters may be 1-3B.

Wang Chenhan:Today we discussed architecture and Backbone but did not consider data scale.Based on the information science principles of Shannon's predecessors, under certain matrix conditions, the amount of data carried is limited, and more efficient compression methods will inevitably bring losses.Therefore, if we want a certain performance, assuming that this performance is based on GPT-3.5, we just mentioned 10 GB+, then even if there is a more efficient Backbone, we have to believe that it will not be less than 7 GB. In order to respond to this level of model, the storage of the device can be expanded, but its computing power will not be reduced.

I mentioned earlier that the iteration speed of the process is slowing down.It is possible that in another 5-10 years, the performance we can squeeze out of a single-sized chip may not be as good as it was in the past 3 years. This is a fact we can see now.

2024 Meet TVM · The future is promising

From Q1 to Q4 of 2023, we successfully held four offline meetups in Shanghai, Beijing and Shenzhen. We are very happy to gather engineers who are concerned about AI compilers in different cities and provide everyone with a platform for learning and communication. In 2024, we will continue to expand the TVM city map and sincerely invite all businesses and community partners to participate in co-creation in various forms. Whether it is recommending lecturers or providing venues and tea breaks, we welcome them all.

Let us work together to create the most active AI compiler community in China!

Friends who haven't watched the guest's wonderful speech can clickEvent Review (Part 1) | 2023 Meet TVM series of events concluded successfullyView the full recording~
Follow the WeChat public account "HyperAI 超神经元", reply to the keyword "TVM year-end party" in the background, and get the complete PPT of the guests.
You can also note "TVM Year-End Party", scan the QR code to join the event group, and get the latest event information~

Organizers and partners

As the organizer of this event, the MLC.AI community was established in June 2022. Led by Chen Tianqi, the main inventor of Apache TVM and a well-known young scholar in the field of machine learning, the team launched the MLC online course, which systematically introduced the key elements and core concepts of machine learning compilation.

In November 2022, with the joint efforts of MLC.AI community volunteers, the first complete TVM Chinese documentation was launched and successfully hosted on the HyperAI official website, further providing domestic developers interested in machine learning compilation with the basic settings for accessing and learning a new technology - documentation.
MLC Online Courses:https://mlc.ai/TVM Chinese Documentation:https://tvm.hyper.ai/

HyperAI is China's leading artificial intelligence and high-performance computing community, dedicated to providing high-quality public resources in the field of data science to domestic developers.So far, it has provided domestic download nodes for more than 1,200 public data sets, supported 300+ artificial intelligence and high-performance computing related term queries, and now has included hundreds of industry terms and cases, launched thousands of public data sets and tutorials including large models, and hosted the complete TVM Chinese documentation.
Visit the official website:https://hyper.ai/

OpenBayes Bayesian Computing is a leading high-performance computing service provider in ChinaBy grafting classic software ecosystems and machine learning models onto new-generation heterogeneous chips, it provides industrial enterprises and university scientific research with faster and easier-to-use data science computing products. Its products have been adopted by dozens of large industrial scenarios or leading scientific research institutes.
Visit the official website:https://openbayes.com/

Centimeter Space (Xiamen) isChina Merchants GroupIts professional innovation park management company operates the "CM Space" professional incubator in Xiamen. Rooted in the southeast coast, relying on the three main business advantages of China Merchants Group in transportation, urban and park comprehensive development and finance, it focuses on providing artificial intelligence startups with the most urgently needed application scenarios, model verification, seed-stage customers and other resource support in the early stages of development, and assists artificial intelligence companies in efficient incubation.

superior Haiyun BaseShanghai Cloud Computing Innovation Base and Shanghai Big Data Innovation Base are national professional incubators that started early in China, promoting the development of the cloud computing industry from 0 to 1. With the model of fund + base + platform, with the digital economy industry as the core, focusing on cloud computing, cloud native, big data and artificial intelligence, digital healthcare and other sub-sectors, it has gathered and incubated nearly a thousand outstanding companies at home and abroad. By connecting the four ecosystems of technology, users, capital, and services, it continues to hold "Scenario Innovation Lab" and "Digital Economy Listing Preparation Camp" to build a digital economy.Industry Accelerator.

Homevalley - a one-stop cross-border service platform for global enterprises, is committed to building a market-oriented enterprise service platform with entrepreneurial incubation bases, Homevalley talents, Homevalley enterprise services, and Homevalley cultural communication as its core content. Linking overseas think tanks and market resources in North America, Europe, and Asia, it provides services such as industrial park and incubation base operations, entrepreneurial training, corporate consulting services, investment and financing, overseas talent return development, and global innovation and entrepreneurship activities, while helping Chinese entrepreneurial companies go overseas. Homevalley aims to discover talents, cultivate talents, and achieve talents, helping outstanding young talents realize their dreams and forming a place of home for overseas returnees to start businesses and cultivate talents.

Activity Review (Part 2) | Analysis of Machine Learning System Trends, Summary of Quotes From Experts

Machine Learning Systems in the Era of Big Models

Stage 1: Discussion Speech

Stage 2: Targeted Questioning

Stage 2: Targeted Questioning

Stage 2: Targeted Questioning

Stage 3: Discussion Speech

Stage 3: Discussion Speech

2024 Meet TVM · The future is promising

Organizers and partners