HyperAI
Back to Headlines

AI Tools Revolutionize Data Processing and Excel Automation

il y a 3 jours

Despite the emergence of more modern data processing tools, Microsoft Excel has maintained its relevance since its release in 1985. It remains a go-to tool in many workplaces, especially during meetings where quick data calculations and chart generation are needed. Excel's ease of sharing and user-friendly interface make it indispensable for many users. However, one major complaint from data teams is Excel's lack of comprehensive documentation, particularly for column names and data types. To address this issue, a developer has created a solution using artificial intelligence (AI) to automatically generate data dictionaries, enhancing Excel files' readability and maintainability. The solution involves the following steps: Convert Excel to CSV: First, the Excel file is converted to CSV format, making it easier for large language models (LLMs) to read and process. Create AI Agent: An AI agent is then created using the Agno framework. This agent reads the CSV file and generates a data dictionary, which includes each column's name, data type, and description. Add Comments to Header: Finally, the generated data dictionary is added as comments to the header of the Excel file, improving its documentation. To implement this, a virtual environment is set up and several libraries are installed, including Streamlit, OpenPyXL, Pandas, Agno, and Google Gemini 2.0 Flash. The specific commands for setting up the environment are: bash uv init data-docs cd data-docs uv venv uv add streamlit openpyxl pandas agno mcp google-genai Several key functions are written to facilitate the process: Convert to CSV: python def convert_to_csv(file_path: str): df = pd.read_excel(file_path).head(10) st.write("Converting to CSV... :leftwards_arrow_with_hook:") return df.to_csv('temp.csv', index=False) Create AI Agent: python def create_agent(api_key): agent = Agent( model=Gemini(id="gemini-2.0-flash", api_key=api_key), description=""" You are an agent that reads the temp.csv dataset and determines the data types and descriptions of each column. If you can't determine these details, return 'N/A'. """, tools=[FileTools(read_files=True, save_files=True)], retries=2, show_tool_calls=True ) return agent Add Comments to Header: python def add_comments_to_header(file_path: str, data_dict: dict = "data_dict.json"): data_dict = json.load(open(data_dict)) wb = load_workbook(file_path) ws = wb.active for n, col in enumerate(ws.iter_cols(min_row=1, max_row=1)): for header_cell in col: header_cell.comment = Comment(f""" ColName: {data_dict[str(n)]['ColName']}, DataType: {data_dict[str(n)]['DataType']}, Description: {data_dict[str(n)]['Description']} """, 'AI Agent') st.write("Saving File... :floppy_disk:") wb.save('output.xlsx') with open('output.xlsx', 'rb') as f: st.download_button( label="Download output.xlsx", data=f, file_name='output.xlsx', mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' ) A Streamlit user interface is also developed to allow users to input their API keys and upload Excel files, running the AI agent and displaying the process in real-time. ```python if name == "main": st.set_page_config(layout="centered", page_title="Data Docs", page_icon=":paperclip:", initial_sidebar_state="expanded") st.title("Data Docs :paperclip:") st.subheader("Generate a data dictionary for your Excel file.") st.caption("1. Enter your Gemini API key and the path of the Excel file on the sidebar.") st.caption("2. Run the agent.") st.caption("3. The agent will generate a data dictionary and add it as comments to the header of the Excel file.") st.caption("ColName: | DataType: | Description: ") st.divider() with st.sidebar: api_key = st.text_input("API key: ", placeholder="Google Gemini API key", type="password") input_file = st.file_uploader("File upload", type='xlsx') agent_run = st.button("Run") progress_bar = st.empty() progress_bar.progress(0, text="Initializing...") st.divider() if st.button("Reset Session"): st.session_state.clear() st.rerun() if agent_run: convert_to_csv(input_file) progress_bar.progress(15, text="Processing CSV...") agent = create_agent(api_key) st.write("Running Agent... :runner:") progress_bar.progress(50, text="AI Agent is running...") agent.print_response(""" 1. Use FileTools to read the temp.csv and create a data dictionary. 2. Save the data dictionary to 'data_dict.json'. """, markdown=True) st.write("Generating Data Dictionary... :page_facing_up:") with open('data_dict.json', 'r') as f: data_dict = json.load(f) st.json(data_dict, expanded=False) add_comments_to_header(input_file, 'data_dict.json') st.write("Removing temporary files... :wastebasket:") os.remove('temp.csv') os.remove('data_dict.json') if os.path.exists('output.xlsx'): st.success("Done! :white_check_mark:") os.remove('output.xlsx') progress_bar.progress(100, text="Done!") ``` Industry experts have praised Excel's durability due to its simplicity and flexibility. Despite advanced tools, Excel remains widely used for daily office tasks. Integrating AI to enhance Excel's documentation quality and streamline data team workflows reflects the growing trend of incorporating AI into traditional office tools, making everyday tasks more efficient. The developer chose the Agno framework and Gemini 2.0 Flash model, highlighting their strong performance and reliability. The solution's code and documentation are available on GitHub at https://github.com/gurezende/Data-Dictionary-GenAI, and more about the author’s work can be found at https://gustorsantos.me. Choosing the right data processing tool is crucial as data volumes grow. For small datasets (<1GB), Pandas is often the best choice due to its ease of use and rich ecosystem. For medium datasets (1GB to 50GB), Polars or DuckDB is recommended based on programming preferences and workflow needs. For large datasets (>50GB), PySpark is essential for distributed computing capabilities. Small Datasets (<1GB) For datasets smaller than 1GB, Pandas excels because of its comprehensive documentation and wide user base. It handles exploratory data analysis and visualization efficiently. Example: python import pandas as pd df = pd.read_csv("small_data.csv") # Handles data under 1GB effectively Medium Datasets (1GB to 50GB) Medium-sized datasets require more performant and memory-efficient tools. Polars is ideal for Python users needing high performance. Example: python import polars as pl df = pl.read_csv("medium_data.csv") # Fast and memory-optimized DuckDB suits SQL users for rapid analysis queries without explicit loading. Example: python import duckdb df = duckdb.query("SELECT * FROM 'medium_large_data.csv' WHERE value > 100").df() # Zero-copy architecture for fast queries Large Datasets (>50GB) Large datasets necessitate distributed computing, making PySpark the preferred tool. It processes data from hundreds of GB to PB scales across multiple machines. Example: python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("BigDataAnalysis").getOrCreate() df = spark.read.csv("really_big_data.csv", header=True, inferSchema=True) # Distributed reading and schema inference Real-World Examples Server Log Analysis (10GB): Extracting error patterns. python import duckdb error_counts = duckdb.query(""" SELECT error_code, COUNT(*) as count FROM 'server_logs.csv' GROUP BY error_code ORDER BY count DESC """).df() E-commerce Data Analysis (30GB): Analyzing customer purchase behavior. ```python import polars as pl import duckdb # Load and transform data using Polars df = pl.scan_csv("transactions.csv") df = df.filter(pl.col("purchase_date") > "2023-01-01") # Perform complex aggregation using DuckDB duckdb.register("transactions", df.collect()) customer_segments = duckdb.query(""" SELECT customer_id, SUM(amount) as total_spent, COUNT() as num_transactions, AVG(amount) as avg_transaction FROM transactions GROUP BY customer_id HAVING COUNT() > 5 """).df() ``` IoT Sensor Data Analysis (100GB+): Processing temperature data from multiple devices. ```python from pyspark.sql import SparkSession from pyspark.sql.functions import window, avg spark = SparkSession.builder.appName("SensorAnalysis").getOrCreate() sensor_data = spark.read.parquet("s3://sensors/data/") # Calculate average temperature hourly hourly_averages = sensor_data \ .withWatermark("timestamp", "1 hour") \ .groupBy( window(sensor_data.timestamp, "1 hour"), sensor_data.sensor_id ) \ .agg(avg("temperature").alias("avg_temp")) ``` Summary As data scales increase, selecting appropriate tools becomes vital. Pandas is optimal for small datasets, Polars and DuckDB for medium datasets, and PySpark for large datasets. Modern workflows often combine these tools for better performance and scalability. For instance, using Polars for quick data cleaning, DuckDB for lightweight analysis, and PySpark for heavy tasks. This approach ensures efficient data handling and adaptability to data growth. Industry Reaction The decision framework proposed has been well-received in the data science community. Experts agree that a systematic approach to tool selection improves efficiency and reduces performance bottlenecks. Pandas is likened to a "Swiss Army knife" for small-scale data, while Polars and DuckDB offer efficient solutions for medium datasets. PySpark's distributed computing power makes it indispensable for large datasets. Future articles will compare the performance of DuckDB and Polars on medium-sized datasets, providing further guidance. NVIDIA's Agent Intelligence Toolkit is an open-source library designed to help developers quickly build, evaluate, configure, and speed up complex AI workflows involving multiple agents. The toolkit integrates existing agents, tools, and processes into a unified framework, offering performance analysis, optimization, scalability, and observability features. To enhance flexibility and scalability, NVIDIA has released detailed guides on integrating new agent frameworks, such as Agno. Key Steps Step 0: Prerequisites - Running workflows with the toolkit usually doesn’t require specific GPUs, but self-hosting NVIDIA NIM microservices does. - Installation instructions are available on the NVIDIA/AIQToolkit GitHub page. Step 1: Create a New Package - Create a new folder named agentiq_agno in AgentIQ/packages/. - Configure the pyproject.toml file to define the new package and dependencies. - Register Agno’s LLM client and tool wrapper for compatibility with the toolkit. Step 2: Install the New Package - Use pip install . to install the newly created Agno package and plugins. Step 3: Create Custom Workflows - Suppose we create a personal finance assistant that generates tailored financial plans, including budgeting, investment strategies, and savings goals. - Use the command aiq init workflow agno_personal_finance to generate necessary files and directory structures. Step 4: Optimize Workflows with Reusable Functions - Register the Serp API search capability as a reusable function for various workflows. - Modify the config.yml file to declare and inject the new function serp_api_tool. Step 5: Install and Run the New Workflow - After installation, set the environment variables AGENTIQ_NIM_API_KEY and AGENTIQ_SERP_API_KEY. - Use aiq run --workflow=agno_personal_finance to start the workflow. - The sample response provides personalized financial advice, covering retirement savings, investment strategies, savings rates, expense management, and tax optimization. Conclusion NVIDIA's Agent Intelligence Toolkit simplifies the construction of multi-agent systems and allows for customizable solutions through its scalable design. Introducing Agno integration further showcases the toolkit’s flexibility and potential, providing a robust foundation for innovation. Agno's founder noted that this integration significantly enhances the performance and user experience of personal finance assistants. As a leading computing platform company, NVIDIA’s toolkit is poised to become a standard in enterprise-level AI agent applications. Currently, NVIDIA is hosting an “Agent Toolkit Hackathon,” encouraging developers to explore creative uses of the toolkit, with prizes including the NVIDIA GeForce RTX 5090 GPU. For more details and documentation, visit NVIDIA's official resources. industry experts laud the update, highlighting the enhanced performance and user experience it brings to personal finance assistants. The integration of Agno and NVIDIA's toolkit underscores the growing synergy between AI and traditional office tools, making workflows more efficient and adaptable.

Related Links