New Libraries Boost Data Science Efficiency: NumExpr, Qwen3, AutoRound

NumExpr vs. NumPy: A Performance Breakthrough in Python Numeric Computation Recently, a data scientist discovered a library called "NumExpr," claiming that it can perform certain complex numerical computations up to 15 times faster than NumPy, the cornerstone of Python's numerical computing ecosystem. NumPy is widely used in data science, machine learning, data exploration, and model training. This discovery has sparked significant interest, leading to practical performance tests to validate the claims. What is NumExpr? NumExpr is a fast numerical expression evaluator designed specifically for NumPy arrays. It optimizes memory usage and supports multithreading, making array operations more efficient. According to its GitHub page, NumExpr can significantly reduce the time and memory required for computations, particularly on multi-core CPUs. Setting Up the Development Environment To test NumExpr, the author recommends creating an isolated Python environment using conda or Miniconda to install the necessary libraries. The steps are as follows: Create a new development environment and install the required packages: (base) $ conda create -n numexpr_test python=3.12 -y (base) $ conda activate numexpr_test (numexpr_test) $ pip install numexpr (numexpr_test) $ pip install jupyter Start Jupyter Notebook: Run jupyter notebook in the command line, which will automatically open the Jupyter Notebook interface in your browser. If it doesn't open automatically, you can find the URL in the command line output. Performance Comparison Tests Example 1: Simple Array Addition The first test involved performing simple large array additions 5000 times: NumPy Version: python time_np_expr = timeit.timeit(lambda: 2*a + 3*b, number=5000) print(f"NumPy execution time: {time_np_expr} seconds") Result: 12.036807 seconds NumExpr Version: python time_ne_expr = timeit.timeit(lambda: ne.evaluate("2*a + 3*b"), number=5000) print(f"NumExpr execution time: {time_ne_expr} seconds") Result: 1.807596 seconds NumExpr showed a remarkable improvement, with a speed increase of about 6 times. Example 2: Monte Carlo Simulation to Compute π Next, the Monte Carlo method was used to compute the value of π 1000 times: NumPy Version: python time_np_expr = timeit.timeit(lambda: monte_carlo_pi_numpy(num_samples), number=1000) print(f"NumPy execution time: {time_np_expr} seconds") Result: 10.642844 seconds NumExpr Version: python time_ne_expr = timeit.timeit(lambda: monte_carlo_pi_numexpr(num_samples), number=1000) print(f"NumExpr execution time: {time_ne_expr} seconds") Result: 8.077501 seconds While NumExpr didn't achieve a massive speed boost, it still improved performance by about 20%. This is due to its SUM() function being less optimized compared to NumPy. Example 3: Sobel Image Filter A Sobel filter, used for edge detection in images, was then tested: NumPy Version: python time_np_sobel = timeit.timeit(lambda: sobel_filter_numpy(image), number=100) print(f"NumPy execution time: {time_np_sobel} seconds") Result: 8.093792 seconds NumExpr Version: python time_ne_sobel = timeit.timeit(lambda: sobel_filter_numexpr(image), number=100) print(f"NumExpr execution time: {time_ne_sobel} seconds") Result: 4.938702 seconds NumExpr’s performance was nearly double that of NumPy, demonstrating significant improvements in this scenario. Example 4: Fourier Series Approximation Finally, the test included computing the Fourier series approximation for complex periodic functions: NumPy Version: python start_time = time.time() approx_np = np.zeros_like(t) for n in range(1, n_terms + 1, 2): approx_np += (4 / (np.pi * n)) * np.sin(2 * np.pi * n * 5 * t) numpy_time = time.time() - start_time print(f"NumPy Fourier series time: {numpy_time:.6f} seconds") Result: 7.765800 seconds NumExpr Version: python start_time = time.time() approx_ne = np.zeros_like(t) for n in range(1, n_terms + 1, 2): approx_ne = ne.evaluate("approx_ne + (4 / (pi * n)) * sin(2 * pi * n * 5 * t)", local_dict={"pi": pi, "n": n, "approx_ne": approx_ne, "t": t}) numexpr_time = time.time() - start_time print(f"NumExpr Fourier series time: {numexpr_time:.6f} seconds") Result: 1.553160 seconds NumExpr again outperformed NumPy, with a speed improvement of approximately 5 times. Summary Through multiple performance tests, it is evident that NumExpr is indeed faster than NumPy in various numerical computation tasks, especially in multi-core CPU environments. While it did not fully meet the claimed 15x speed increase, achieving several times to up to 10x performance boosts is still highly significant. For data scientists and researchers requiring high-performance numerical computing, NumExpr is a valuable tool to consider. Despite its limited support for all NumPy operations, the performance gains make it a compelling choice. Industry Insights and Company Background NumExpr is primarily developed and maintained by members of the Python scientific computing community, which continuously aims to enhance Python's performance and functionality. Its optimizations in multithreading and memory management have earned it positive reviews. Senior data scientists highlight its effectiveness in handling large datasets without compromising code readability and maintainability, making it a strong contender in the scientific computing field. Qwen3: A New Generation of Large Language Models The Qwen team recently announced the release of their latest large language model, Qwen3, marking a significant advancement in the Qwen series. The flagship model, Qwen3-235B-A22B, excels in coding, mathematics, and general capabilities, rivaling top models like DeepSeek-R1, Grok-3, and Gemini-2.5-Pro. Additionally, the smaller Mixture of Experts (MoE) model, Qwen3-30B-A3B, performs comparably to or even better than Qwen2.5-32B, despite having only a fraction of the active parameters. Even the smallest model, Qwen3-4B, matches the performance of Qwen2.5-72B-Instruct. Key Features Dual-Thinking Modes Qwen3 supports two operational modes: - Thinking Mode: The model incrementally reasons through problems, providing a detailed and thorough analysis. This mode is suitable for complex questions. - Non-Thinking Mode: The model quickly generates responses for simpler queries, prioritizing speed over depth. This design allows users to balance computational costs and quality effectively based on their specific needs. Multilingual Support Qwen3 supports 119 languages and dialects, spanning various linguistic families such as Indo-European, Sino-Tibetan, and Afro-Asiatic. This broad language coverage enhances its global applicability. Pre-training Process Compared to its predecessor, Qwen2.5, Qwen3's pre-training data volume has nearly doubled to around 36 trillion tokens. This includes web content, text from PDF documents, and synthetic data generated by Qwen2.5-Math and Qwen2.5-Coder. Pre-training is divided into three stages: 1. Initial Stage (S1): The model is trained on over 30 trillion tokens, developing foundational language skills and common knowledge. 2. Enhancement Stage (S2): The model undergoes further training on 5 trillion tokens of knowledge-dense data, such as STEM, coding, and reasoning tasks. 3. Expansion Stage: High-quality, long-context data is used to extend the context length to 32K tokens, enabling the model to handle longer inputs. Post-Training Process To develop a model capable of both incremental reasoning and quick responses, Qwen3 employs a four-stage post-training procedure: 1. Cold Start with Long-Chain Reasoning: The model is fine-tuned on diverse long-chain reasoning data, covering mathematics, coding, logic puzzles, and STEM questions. 2. Rule-Based Reinforcement Learning: Additional resources are allocated to improve the model’s exploration and exploitation capabilities. 3. Mode Fusion: Both long-chain reasoning and regular instruction fine-tuning data are used to ensure seamless integration of the two thinking modes. 4. General Reinforcement Learning: The model is further enhanced across over 20 general domain tasks, correcting bad behaviors and improving overall performance. Usage Guidelines Qwen3 weights are available on multiple platforms, including Hugging Face, ModelScope, and Kaggle, all under the Apache 2.0 license. Users can deploy the model using frameworks like SGLang and vLLM, or locally with tools like Ollama, LMStudio, lla.cpp, and KTransformers. To control the model’s behavior in multi-turn dialogs, users can add /think or /no_think tags to prompts or system messages. Proxy Functionality Qwen3 also excels in tool invocations, and the team recommends using Qwen-Agent to maximize its proxy capabilities. Qwen-Agent encapsulates invocation templates and parsers, reducing programming complexity. Users can define available tools to expand Qwen3’s functionality, supporting both built-in and custom tools. Community Support The development of Qwen3 relies heavily on community contributions. The team expresses gratitude to all participants and encourages more individuals and organizations to join in advancing the project. Future Outlook Qwen3 represents a critical step towards Artificial General Intelligence (AGI) and Artificial Super Intelligence (ASI). The team plans to continue enhancing the model from various dimensions, such as expanding data scale, increasing parameter counts, extending context lengths, and adding modal support. This transformation signals a shift from mere model training to comprehensive agent training, promising more meaningful assistance to users. Industry experts praise Qwen3 for its innovation and practicality. As a star project of Alibaba Cloud, Qwen3 is poised to lead in the competition among large language models. AutoRound: Advancements in Post-Training Quantization With the growing scale and complexity of large language models (LLMs) and visual language models (VLMs), efficient deployment has become a major challenge. Intel’s AutoRound, a newly released quantization tool, addresses this issue by reducing model size and inference latency while maintaining high accuracy. What is AutoRound? AutoRound is a weight-only post-training quantization (PTQ) method developed by Intel. It uses signed gradient descent to optimize weight rounding and clipping, enabling high accuracy in low-bit quantization (such as INT2 to INT8) with minimal performance loss. For example, in 2-bit precision, AutoRound outperforms existing mainstream quantization baseline methods by 2.1 times in relative accuracy. Key Advantages High Accuracy in Low-Bit Quantization Evaluation results show that AutoRound performs exceptionally well across various tasks. In 2-bit precision, it maintains superior accuracy compared to other popular methods, as demonstrated by benchmarks. At 4-bit precision, it retains competitive advantages, such as those shown in the low-bit open LLM leaderboard. Wide Compatibility Supported Models: AutoRound is compatible with almost all popular LLM architectures, including Qwen, LLaMA, and DeepSeek. Quantized versions of these models can be found in collections from Hugging Face, Kaitchup, and fbaldassarri. VLMs: It supports over 10 VLMs, like Mistral-Small-3.1 and Gemma3, and can be applied via the RTN method without fine-tuning, although there may be some accuracy degradation. Supported Devices: AutoRound runs seamlessly on various devices, including CPUs, Intel GPUs, and CUDA devices, ensuring broad application support. Flexible and Efficient Quantization AutoRound achieves high accuracy with just 200 tuning steps and a small calibration dataset of at least 128 samples, significantly reducing computational time and resource consumption. For instance, quantizing a 720 billion parameter model takes only 37 minutes (on a Nvidia A100 GPU using PyTorch 2.6.0). Using AutoRound Installation To use AutoRound, simply install it via pip: bash pip install auto-round Command Line Usage AutoRound offers three configuration options: auto-round (default), auto-round-best (highest accuracy), and auto-round-light (fastest quantization). Choose the appropriate configuration based on the model size and precision requirements. bash auto-round \ --model Qwen/Qwen3-0.6B \ --bits 4 \ --group_size 128 \ --format "auto_round,auto_awq,auto_gptq" \ --output_dir ./tmp_autoround For 2-bit precision, auto-round-best or auto-round is recommended. API Usage AutoRound can be integrated into your workflow using Python: ```python from transformers import AutoModelForCausalLM, AutoTokenizer from auto_round import AutoRound model_name = "Qwen/Qwen3-0.6B" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) bits, group_size, sym = 4, 128, True autoround = AutoRound( model, tokenizer, bits=bits, group_size=group_size, sym=sym, ) output_dir = "./tmp_autoround" autoround.quantize_and_save(output_dir, format='auto_round,auto_awq,auto_gptq') ``` Inference AutoRound automatically selects the most suitable backend for inference, suggesting better alternatives if they exist. This flexibility ensures optimal performance across different devices. Converting GPTQ/AWQ Models to AutoRound Format Most GPTQ/AWQ models can be converted to AutoRound format to improve compatibility with Intel devices and support. Note that the quantization configuration will change during conversion. ```python from transformers import AutoModelForCausalLM, AutoTokenizer, AutoRoundConfig model_name = "ybelkada/opt-125m-gptq-4bit" quantization_config = AutoRoundConfig() model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu", torch_dtype="auto", quantization_config=quantization_config) tokenizer = AutoTokenizer.from_pretrained(model_name) text = "There is a girl who likes adventure," inputs = tokenizer(text, return_tensors="pt").to(model.device) print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0])) ``` Conclusion AutoRound represents a significant leap in post-training quantization, combining high accuracy, efficiency, and wide compatibility. This makes low-bit quantization both practical and powerful, suitable for large-scale LLM deployments and edge computing VLM inference. We encourage users to try AutoRound and contribute to its evolving community, pushing the boundaries of efficient AI deployment. Industry Evaluation Experts in the AI field have lauded AutoRound for its innovative approach to quantization. They note that it not only enhances model precision but also drastically reduces the resource consumption associated with the quantization process. This is particularly crucial for large-scale AI deployments and edge computing. Intel’s continued leadership in AI technology, bolstered by the release of AutoRound, solidifies its position at the forefront of the industry.

New Libraries Boost Data Science Efficiency: NumExpr, Qwen3, AutoRound

Related Links