From Computer Vision to Medical AI, a Conversation With Xie Weidi From Shanghai Jiao Tong University: Defining the Problem Is More Important Than Solving It

In 2012, the legendary "year of doom", the mobile Internet ushered in a period of explosive growth. With the popularization of 3G networks, the price of smart phones has dropped, and the rapid rise of communication-like applications such as WeChat and Mi Chat, as well as e-commerce and payment, this field has achieved a new round of growth. As the foundation of various innovative applications, the development prospects of the communications industry are very promising.
"My understanding at the time was that communication technology was already very mature, and China was at the forefront of the world in terms of technology. The main disputes between countries were more about communication protocols, which had gone beyond the scope of technology," said Xie Weidi, who had completed four years of undergraduate studies at Beijing University of Posts and Telecommunications. Standing at a crossroads in his life, he frankly admitted, "I didn't really like this major. Of course, it's also possible that I didn't understand it well."
Immediately afterwards, he chose to study abroad and change his career path. He completed his master's, doctoral and postdoctoral studies and work in the field of computer vision at University College London (UCL) and the University of Oxford. In 2022, he returned to China and joined Shanghai Jiao Tong University, bringing his accumulation in the field of computer vision into medical artificial intelligence, trying to open up a new battlefield.
It can be said that Professor Xie Weidi’s two shifts from communications to computer vision and from computer vision to medical artificial intelligence are also two important nodes. The hesitation in decision-making, the challenges of tackling new fields, and the sense of accomplishment after the results are published are all highlights in his resume.
Recently, HyperAI had the honor of conducting an in-depth interview with Professor Xie Weidi. Based on his personal experience, he shared with us his experience in transforming from computer vision to AI for Healthcare, and also made an in-depth analysis of the development trends of the industry.
General medical AI systems can generate “intelligence emergence”
"Many people don't understand why I want to develop a general medical artificial intelligence system, when specific disease diagnosis and treatment models are obviously more practical." Today, when big models empower all walks of life, the difference between specialization and generalization has always been the focus of discussion in the industry. Specialized models can show higher accuracy and practicality in specific fields, but their generalization ability is limited. The broad knowledge of general models can connect different fields, but their ability in specific fields is often not as good as that of proprietary models.
In Xie Weidi's view, both specialized models and general models have their own advantages and disadvantages, "but developing a general medical AI system is something we must do." He believes thatUniversality means that the model can establish hidden connections between data of different modalities, thereby generating so-called "intelligence emergence", which is crucial for disease diagnosis, especially diseases with unclear causes.For example, for the classification problem of pneumonia A and pneumonia B, if images and texts are used for training, these multimodal data can be connected in series at the bottom layer to identify the similarities and differences between the two pneumonia symptoms and achieve the classification purpose. However, if only images are used for training, the network may not be able to learn this relationship. "So, from the perspective of Science Discovery, the universal model is of great value."
To build a multimodal universal medical model, it is necessary to inject medical knowledge into it as comprehensively as possible. However, data in the medical field is affected by many factors such as ethics, safety, and quality, and is generally difficult to obtain and use. In order to meet this challenge,Xie Weidi chose to migrate the data collection method in computer vision to the medical field, that is, to crawl data from the Internet."Of course, we know that the large models trained by this approach cannot be used clinically, but it can better cultivate talents and train the team's ability to process big data, such as collecting, organizing, and cleaning data."
For example, the team has collected more than 30,000 medical books, crawled 4 million medical articles from PubMed Central, and collected medical papers and books in eight languages including Chinese, English, Russian, and Japanese on the Internet, and converted them into corpus that can be used to train language models.

Furthermore, the team mined the image-text data publicly available on the Internet, gathering more than 250,000 3D scans and more than one million 2D medical paper images. In addition, in order to train a general segmentation model, the team also standardized nearly 120 publicly available segmentation datasets of radiological images on the market, including more than 30,000 2D/3D images and millions of pixel-level annotations, covering various common radiological imaging modalities, such as MR, CT, and PET.Knowing the critical role of medical datasets in medical AI research, the team will open source most of the datasets it obtains.
When building a general model, the team hopes to jointly train all the multimodal data obtained, including images, text, genomics, ECG signals, etc., and use image lesion localization, text-level diagnosis and reporting as the most basic output form.The embedding of medical knowledge is also an essential part of realizing general functions."This is because the tasks of many departments in the hospital are different, and doctors tend to focus more on their own part. We hope that the universal model can cover all the examination information, form a step-by-step thinking chain when handling tasks, and complete tasks such as differential diagnosis," Xie Weidi introduced.

When the mentor is "indifferent", quietly accumulate strength
As mentioned above, when developing a general medical AI system, Xie Weidi applied computer vision methods to the medical field. This is becausePrior to this, he had been engaged in computer vision research for nearly 10 years and had accumulated profound knowledge.However, his initial choice of this major was a coincidence.
When he was an undergraduate, Xie Weidi studied at Beijing University of Posts and Telecommunications. "Because I was not interested in communications, my undergraduate grades were very poor. I was worried that I would not be able to find a job, so I chose to study abroad," he said with a smile.
In 2012, Xie Weidi entered the University of London to pursue a master's degree in computer vision. This time, he found a direction that interested him and was very serious about his studies. "My supervisor thought I was quite suitable for doing research in this area, so he suggested that I pursue a doctorate." The problem he faced at that time was whether to choose to pursue a doctorate at his own expense in order to continue his studies, as there were very few doctoral scholarships in the UK. "My supervisor recommended me to Oxford University, so even if I had to pay for it myself, the investment would be more valuable."
Fortunately, in 2014, in order to better promote the AlphaGo project, DeepMind decided to increase talent training in the field of AI and cooperated with Oxford University to offer scholarships. Xie Weidi was the winner of the first Oxford-Google DeepMind full scholarship.Although the nearly 1 million yuan scholarship from DeepMind solved his financial pressure in a timely manner, the real problem he faced was that the laissez-faire attitude of his two mentors almost prevented him from graduating.
"When I was doing my doctorate, I had two very strong mentors. One was Professor Andrew Zisserman in the field of computer vision, who is a member of the Royal Society and one of the founders of the CV field; the other was Professor J Alison Noble, who studied medical imaging and is a member of both the Royal Society and the Academy of Engineering. At the time, they both thought that I would be more involved in each other's research, which put me in a situation where neither of them cared about me." The Visual Geometry Group (VGG) at the University of Oxford, where Xie Weidi was at the time, attracted much attention for developing the convolutional neural network VGGNet. The members of the group generally enjoyed a very high reputation in the international academic community. He not only had to face the gap with the rapid improvement of his peers, but also had to constantly explore new research topics.
Influenced by AlphaGo, deep learning was once very popular at that time, and Xie Weidi also developed a strong interest in generative models, etc. However, his mentor, Professor Andrew Zisserman, preferred to do "non-hot but more valuable" research. "During the weekly meeting, my classmates can report the weekly work progress to AZ, but I often bring a bunch of papers in and come out with a bunch of new papers to read." At the same time, due to the strict control of medical imaging data in the UK, research cannot be carried out without data, and he could not get feedback from another mentor, J Alison Noble. "As of the year before graduation, I only published one workshops paper. I gave feedback to my two mentors. I am afraid that I will not be able to graduate if this continues."
As the saying goes, "A blessing in disguise". Since many of my selected topics were rejected by my supervisor and could not be implemented,In his spare time, he read almost all the papers in the field of computer vision of that era. This accumulation also laid a solid foundation for his future scientific research.As he said, "I thought at the time that as long as my instructor could determine my topic, I could finish it in a few days."
In 2018, with the support of his two mentors, Xie Weidi published 7 papers in computer vision, medical imaging and other fields, and successfully graduated. AZ also recognized his strength and invited him to continue his postdoctoral studies, specializing in computer vision research, until he returns to China in 2022.

Knowledge is the most essential difference between computer vision and medicine
The balance between family and work troubles countless people, including Xie Weidi."The decision to return to China was a sudden one. Although I had stayed in Oxford and had seen an offer for an assistant professorship, I gradually realized that the environment there was not suitable for me to continue in-depth research. On the other hand, as a new father, I did not have the financial and energy to support my family at the time."
In my opinion,Xie Weidi has a unique and distinctive personality. In addition to the humility and pragmatism valued in scientific research, he is also bold.As soon as he decided to return to China, he immediately contacted domestic universities. He did not consider titles such as "Outstanding Overseas Young Scholar" or "comparing prices from three universities". He only sent his resume to Shanghai Jiao Tong University and was successfully hired.

Interestingly, Professor Zhang Ya of Shanghai Jiaotong University played the role of "HR" during his onboarding process, and he met Professor Zhang Ya because of a published journal article. "In 2018, Professor Zhang Ya and her students wanted to reproduce the medical imaging-related papers I had published, so they added me on WeChat." It was this opportunity that paved the way for his subsequent return to China. After sending his resume to Professor Zhang Ya, he quickly received a reply, "Fortunately, the school quickly advanced the whole process."
After joining Shanghai Jiao Tong University, in addition to continuing his original computer vision research, he began to delve into medical artificial intelligence."At that time, I wanted to try my hand at AI for Science research. Since I had a lot of exposure to medical health and was interested in it, I chose this direction."
It is worth mentioning that in 2022, when ChatGPT appeared, Xie Weidi decided to start with language and abandon the medical imaging input that was very popular at the time. “I think the most fundamental difference between medicine and computer vision is knowledge, because medicine is more about finding evidence and has systematic and standardized knowledge, but it is difficult to embed knowledge into the model of medical images in the visual field.”In his vision, the team can embed medical knowledge into the language model, and then align the visual model with the language model to pass the medical knowledge to the visual model.
The author believes that perhaps it was influenced by Professor Andrew Zisserman.In Xie Weidi, we can deeply feel his keen intuition for scientific research.As he commented on his mentor: "Many topics of AZ do not pursue short-term hot spots, but focus on long-term value." For example, when developing the visual-language model PMC-CLIP, since many studies were conducted for the first time, the students in the team could not fully understand the significance of the project - why do we need to crawl all the papers on the Internet? Why do we need to extract images and annotations to train the model... "Even when we submitted the paper, MICCAI almost rejected it."
However, after a while, the vision-language model suddenly became popular, and the PMC-CLIP model was also rated as the "Young Scientist Publication Impact Award, Final List" by MICCAI, and its achievements were also recognized. "At first, it was difficult for me to convince my students what the research was actually useful for. Maybe I was lucky because the topic I chose happened to be something that everyone was interested in later."
During the interview, Professor Xie Weidi mentioned "luck" many times - being admitted to Oxford University was luck; being the first batch of recipients of the Oxford-Google DeepMind Scholarship was luck; being successfully employed by Shanghai Jiao Tong University after returning to China was luck; the choice of research direction and technical path was also luck... But in my opinion, most of the luck is not groundless, perhaps it is the foreshadowing of a previous action, or perhaps it is the accumulation of power over time that has promoted the correct choice at the moment.
Defining the problem is more important than solving it
It is worth mentioning that Xie Weidi once felt lucky that "the topic he chose happened to be what everyone was interested in later." However, the author believes that the choice of research topic just reflects the team leader's unique observation in the field, and Xie Weidi called it "definition problem". In his opinion,Defining a problem is more important than solving it. As long as a meaningful problem is defined, countless people will follow up and solve it.Therefore, we need to think about what problems are most worthy of being solved by the model at this stage. This is very important.
Furthermore, when we solve problems, "talent-data-computing power" is indispensable.
At present, the development of AI4S is still in its early stages. AI practitioners have more advantages in model building and framework optimization, while Science practitioners are better at accurately locating scientific problems in vertical fields. Both parties have been exploring a universal cooperation model. In this regard, Xie Weidi's team chose to cooperate with many teachers and students from Shanghai Jiaotong University School of Medicine, making full use of their professional knowledge in the medical field, letting them serve as consultants to help the team determine whether the research direction has practical medical value. In addition, they also act as "quality inspectors", responsible for the quality of sampled data, and ensure that the cleanliness of the data reaches 90% or above.
At the same time, as the team building gradually improves, the students have mastered the technology of web data crawling. The next problem they face is that Internet data resources are close to exhaustion. In this regard, the team hopes to cooperate with hospitals to obtain higher-quality medical data and try to implement the model. Xie Weidi emphasized,"Knowledge-driven" or "data and knowledge-driven" is more important than simply "data-driven".Therefore, the team hopes to put medical knowledge at the core and work with teammates to solve more practical problems.
It is worth mentioning that the explainability of medical AI has long been a major concern for doctors.If AI is powerful enough to surpass top doctors in diagnostic accuracy, explainability will no longer be an issue.For example, the Med-PaLM 2 model launched by Google has achieved a high score of 86.5 in the USMLE medical qualification examination. In addition, their team has successively launched medical large language models PMC-LLaMA, MMed-LLaMA, visual-language models MedVInT, RadFM, general segmentation model SAT, etc. Many models are regarded as baselines by the industry and have been published in well-known journals/top conferences such as NPJ Digital Medicine, Nature Communications, ICCV, ECCV, NeurIPS, MICCAI, etc. The iteration speed of these results is gradually changing doctors' views on AI, and establishing high-quality cooperative relationships will be expected in the future.
In terms of computing resources and financial support, Shanghai Jiao Tong University has also provided all-round support for the team's preliminary research and future transformation of results. Different teams in the college are also actively exploring cooperation opportunities, and the academic atmosphere is strong.
Do valuable research
During his communication with Professor Xie Weidi, he mentioned many times that he hoped to do some valuable research.In his opinion, the team's previous research can only be regarded as "a toy prototype in the academic world", and the small model must be further scaled up if it is to be implemented. He hopes that these prototypes can provide references for other researchers and even the industry, telling everyone what kind of data to use, how to process data, how to build and train models, and how to set instructions.
In the future, the team plans to build a clinical-oriented super instruction, integrating more than 100 tasks that doctors are interested in, so that the model can focus on solving actual clinical needs. In this regard, he commented: "Traditional language models are mostly evaluated by multiple-choice questions, but when communicating with doctors, they find that they don't care how high the multiple-choice questions score is, but are more concerned about whether the model can solve practical problems, such as being competent for clinical tasks."
In addition, the team has begun to delve into related research at the levels of genomics, DNA, RNA, and amino acids, breaking through the limitations of past reliance on images and texts. They hope to create more possibilities for rare disease diagnosis and new drug development, and we look forward to their future results.
For more details, please see Xie Weidi's Google Scholar:
https://scholar.google.com/citations?user=Vtrqj4gAAAAJ&hl=zh-CN