Zhaoxin Unveils KX-7000 CPU: New "Century Avenue" Architecture Aims for Higher Performance and x86-64 Compatibility
Zhaoxin, a Chinese x86 CPU designer formed through a joint venture between VIA Technologies and the Shanghai municipal government, has recently unveiled its latest CPU, the KaiXian KX-7000. This new processor features a novel architecture called "世纪大道" (Century Avenue), named after a major road in Shanghai. The KX-7000 aims to bridge the performance gap that plagued its predecessor, the LuJiaZui-implemented KX-6640MA, which struggled to run modern applications due to its narrow 2-wide core, sub-3 GHz clock speeds, and limited reordering capacity. New Architecture Highlights Century Avenue Architecture: - Wider Execution: Century Avenue is a 4-wide, AVX2-capable core designed to handle more instructions simultaneously, reflecting the performance goals of high-end Intel and AMD CPUs from the early 2010s. - Clock Speeds: The KX-7000 operates at 3.2 GHz, with Zhaoxin claiming it can reach 3.5-3.7 GHz, although this hasn't been consistently observed. - Cache Setup: The KX-7000 incorporates a chiplet design similar to AMD's Ryzen, with all eight cores sharing 32 MB of L3 cache. A separate IO die manages DRAM and other I/O functions. Frontend Design Instruction Fetch and Decode: - Instruction Cache: A 64 KB, 16-way instruction cache delivers 16 bytes per cycle, feeding a 4-wide decoder. This setup is conventional but can become constrained if average instruction length exceeds 4 bytes, especially in AVX2 workloads. - Branch Prediction: The branch target buffer (BTB) with 4096 entries creates pipeline bubbles after taken branches, leading to a 3-cycle latency. While the direction predictor has improved, this latency is still considered primitive by modern standards. - Memory Disambiguation: The load/store unit can perform Core 2-style memory disambiguation, allowing loads to execute ahead of stores with unknown addresses, which improves memory pipeline utilization. Backend and Execution Units Execution Capabilities: - Integer Operations: Century Avenue has three ALU pipes, including two with integer multipliers, offering two-cycle latency for 64-bit multiplies. - Floating Point and Vector: The FP/vector unit is surprisingly powerful, capable of executing two 256-bit vector FMA instructions per cycle. However, 256-bit instructions are split into two 128-bit micro-ops for internal tracking, which can limit performance. - Schedulers: The core employs a semi-unified scheduler setup with large schedulers for ALU, memory, and FP/vector operations, providing more capacity than predecessors like Haswell and Skylake. Memory Subsystem Cache Latency and Bandwidth: - L1 and L2 Caches: The L1D cache is 32 KB, 8-way associative with 4-cycle load-to-use latency. The L2 cache, while large (32 MB), has poor latency at 15 cycles. - L3 Cache: Despite an eight-fold increase in L3 capacity to 32 MB, L3 latency remains high at over 80 core cycles. This can be a bottleneck in performance-critical applications. - DRAM Performance: The memory subsystem struggles with high latency (over 200 ns) and limited bandwidth (struggling to reach 12 GB/s). Non-temporal writes can achieve up to 23.35 GB/s, indicating potential issues in the memory controller's training capabilities. Performance Benchmarks Single-Threaded Performance: - SPEC CPU2017 Results: The KX-7000 achieves performance comparable to AMD's Bulldozer in the integer suite, lagging slightly. In the floating-point suite, it outperforms Bulldozer by 10.4%, showcasing its strong FP/vector capabilities. - High-IPC Workloads: Tests like 500.perlbench, 548.exchange2, and 525.x264 highlight Century Avenue’s improved execution resources, giving it an edge. - Low-IPC Workloads: The KX-7000 struggles with workloads like 505.mcf and 520.omnetpp, which rely heavily on branch prediction and memory performance. Multithreaded Performance: - Multicore Strengths: The KX-7000’s eight cores offer advantages in certain multithreaded tasks like Y-Cruncher and OpenSSL RSA2048 signatures, where AVX2 and core compute power play crucial roles. - Weaknesses: In multithreaded benchmarks such as libx264 and 7-Zip, the KX-7000 often falls behind Bulldozer and Intel’s Skylake, primarily due to its memory subsystem limitations and low cache bandwidth. Evaluation by Industry Insiders Industry experts note that while the KX-7000 marks significant progress for Zhaoxin, it still trails behind more modern x86-64 processors. The core’s approach to AVX2, while aiming for high execution throughput, lacks the balance and efficiency seen in designs from Intel and AMD. The limited frontend sophistication and high memory latencies further hinder its performance, making it feel like a 2005-era core rather than a cutting-edge 2025 offering. However, the KX-7000's ability to meet basic performance requirements without reliance on foreign chips is a critical achievement in the context of China’s push for domestic semiconductor development. This makes it a viable option for local government and enterprise applications, even if it doesn’t fully match the performance of Western counterparts. Company Profile Zhaoxin is leveraging VIA’s x86-64 license and significant government support to develop CPUs for a broad range of applications, from web browsing to gaming. While the company’s focus was initially on low-power, niche markets, the KX-7000 demonstrates a shift towards high-performance computing to meet the demands of a growing domestic tech sector. Despite some shortcomings, the KX-7000 is a solid step forward for Zhaoxin, positioning the company to continue advancing its technology and reducing dependency on Western chipmakers.