HyperAIHyperAI

Command Palette

Search for a command to run...

AI Agents and OpenPyxl: New Tools for Data Engineering Beyond Pandas

Despite many companies adopting more modern data processing tools, Microsoft Excel continues to demonstrate remarkable resilience since its debut in 1985. In work environments, particularly during meetings where quick data calculations and chart generation are essential, Excel remains the go-to tool. Its ease of sharing and user-friendly nature have solidified its status as an indispensable part of many office workflows. One of the major complaints from data teams about Excel is the lack of proper documentation, especially regarding column names and data types. To address this issue, the author has developed an AI-powered solution that automatically generates data dictionaries to improve the readability and maintainability of Excel files. Solution Steps: Convert Excel to CSV: The first step involves converting Excel files to CSV format, which makes them easier for large language models (LLMs) to read and process. Create AI Agent: Using the Agno framework, an AI agent is created to read the CSV file and generate a detailed data dictionary. This dictionary includes the column name, data type, and a brief description for each column. Add Data Dictionary to Header: Finally, the generated data dictionary is added as comments to the headers of the Excel file, enhancing its documentation. Implementation Details: To implement this solution, the author recommends setting up a virtual environment and installing necessary libraries such as Streamlit, OpenPyXL, and Pandas. The process begins with converting the Excel file to CSV: python def convert_to_csv(file_path: str): df = pd.read_excel(file_path).head(10) st.write("Converting to CSV... :leftwards_arrow_with_hook:") return df.to_csv('temp.csv', index=False) Next, an AI agent is created using the Agno framework and the Gemini 2.0 Flash model: python def create_agent(api_key): agent = Agent( model=Gemini(id="gemini-2.0-flash", api_key=api_key), description=''' You are an agent that reads the temp.csv dataset presented to you and based on the name and data type of each column header, determine the following information: - The data types of each column - The description of each column If you are unable to determine the data type or description of a column, return 'N/A' for that column. ''', tools=[FileTools(read_files=True, save_files=True)], retries=2, show_tool_calls=True ) return agent Finally, the data dictionary is added to the Excel file header: python def add_comments_to_header(file_path: str, data_dict: str = "data_dict.json"): data_dict = json.load(open(data_dict)) wb = load_workbook(file_path) ws = wb.active for n, col in enumerate(ws.iter_cols(min_row=1, max_row=1)): for header_cell in col: header_cell.comment = Comment(f''' ColName: {data_dict[str(n)]["ColName"]}, DataType: {data_dict[str(n)]["DataType"]}, Description: {data_dict[str(n)]["Description"]} ''', 'AI Agent') st.write("Saving File... :floppy_disk:") wb.save('output.xlsx') with open('output.xlsx', 'rb') as f: st.download_button( label="Download output.xlsx", data=f, file_name='output.xlsx', mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' ) A user interface is built using Streamlit to streamline the process. Users can input their API key, upload an Excel file, and run the AI agent. The entire process is displayed with a progress bar, allowing users to monitor the status in real-time. ```python if name == "main": st.set_page_config(layout="centered", page_title="Data Docs", page_icon=":paperclip:", initial_sidebar_state="expanded") st.title("Data Docs :paperclip:") st.subheader("Generate a data dictionary for your Excel file.") st.caption("1. Enter your Gemini API key and the path of the Excel file on the sidebar.") st.caption("2. Run the agent.") st.caption("3. The agent will generate a data dictionary and add it as comments to the header of the Excel file.") st.caption("Example: ColName: | DataType: | Description: ") st.divider() with st.sidebar: api_key = st.text_input("API Key: ", placeholder="Google Gemini API key", type="password") input_file = st.file_uploader("Upload Excel File", type='xlsx') agent_run = st.button("Run") progress_bar = st.empty() progress_bar.progress(0, text="Initializing...") st.divider() if st.button("Reset Session"): st.session_state.clear() st.rerun() if agent_run: convert_to_csv(input_file) progress_bar.progress(15, text="Processing CSV...") agent = create_agent(api_key) st.write("Running Agent... :runner:") progress_bar.progress(50, text="AI Agent is running...") agent.print_response(''' 1. Use FileTools to read the temp.csv as input to create the data dictionary for the columns in the dataset. 2. Using the FileTools tool, save the data dictionary to a file named 'data_dict.json'. ''', markdown=True) st.write("Generating Data Dictionary... :page_facing_up:") with open('data_dict.json', 'r') as f: data_dict = json.load(f) st.json(data_dict, expanded=False) add_comments_to_header(input_file, 'data_dict.json') st.write("Removing temporary files... :wastebasket:") os.remove('temp.csv') os.remove('data_dict.json') if os.path.exists('output.xlsx'): st.success("Done! :white_check_mark:") os.remove('output.xlsx') progress_bar.progress(100, text="Done!") ``` Industry Evaluation and Company Background: Experts in the data science community commend Excel's persistence due to its simplicity and flexibility. Despite the availability of more advanced data processing tools, Excel's user-friendly nature makes it widely used for everyday office tasks. By integrating AI technology, the author has not only improved the quality of Excel file documentation but also streamlined data team workflows. This underscores the trend of incorporating AI into traditional office tools to enhance productivity and efficiency. The choice of Agno framework and Gemini 2.0 Flash model highlights the robust performance and reliability of these technologies in practical applications. With the increasing volume of data, choosing the right data processing tools becomes crucial. This article provides a decision framework to help users select appropriate tools based on data size, performance requirements, and workflow preferences. Data Size and Tool Selection Small Datasets (<1GB) For datasets under 1GB, Pandas is often the best choice. It is easy to use, well-supported, and rich in features within the Python ecosystem. Pandas excels in preliminary exploratory analysis and visualization tasks at this scale: python import pandas as pd df = pd.read_csv("small_data.csv") Pandas' success in small datasets is attributed to its extensive ecosystem, detailed documentation, and large user base. While other tools may offer faster performance, the learning curve often outweighs the benefits. Medium Datasets (1GB to 50GB) When working with datasets between 1GB and 50GB, tools like Polars or DuckDB are more suitable due to their higher performance and efficiency: Polars: Ideal for Python users requiring high performance and memory optimization: python import polars as pl df = pl.read_csv("medium_data.csv") DuckDB: Prefered by users who favor SQL and need fast analytical queries: python import duckdb df = duckdb.query("SELECT * FROM 'medium_large_data.csv' WHERE value > 100").df() Large Datasets (>50GB) For datasets exceeding 50GB, distributed computing capabilities are essential. PySpark is the go-to choice, designed for parallel processing across multiple machines, making it suitable for handling data from hundreds of gigabytes to petabytes: python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("BigDataAnalysis").getOrCreate() df = spark.read.csv("really_big_data.csv", header=True, inferSchema=True) Other Considerations While data size is a primary factor in tool selection, other considerations include team expertise, project needs, and specific performance requirements. A systematic assessment of these factors ensures more informed decisions. Real-World Examples Example 1: Log File Analysis (10GB) Extracting error patterns from server logs: python import duckdb error_counts = duckdb.query(""" SELECT error_code, COUNT(*) as count FROM 'server_logs.csv' GROUP BY error_code ORDER BY count DESC """).df() Example 2: E-commerce Data Analysis (30GB) Analyzing customer purchase behavior: ```python import polars as pl import duckdb df = pl.scan_csv("transactions.csv") df = df.filter(pl.col("purchase_date") > "2023-01-01") duckdb.register("transactions", df.collect()) customer_segments = duckdb.query(""" SELECT customer_id, SUM(amount) as total_spent, COUNT() as num_transactions, AVG(amount) as avg_transaction FROM transactions GROUP BY customer_id HAVING COUNT() > 5 """).df() ``` Example 3: IoT Sensor Data Analysis (100GB+) Handling IoT sensor data from multiple devices: ```python from pyspark.sql import SparkSession from pyspark.sql.functions import window, avg spark = SparkSession.builder.appName("SensorAnalysis").getOrCreate() sensor_data = spark.read.parquet("s3://sensors/data/") hourly_averages = sensor_data \ .withWatermark("timestamp", "1 hour") \ .groupBy( window(sensor_data.timestamp, "1 hour"), sensor_data.sensor_id ) \ .agg(avg("temperature").alias("avg_temp")) ``` Summary Choosing the right data processing tool is critical as data sizes grow. Pandas remains the best for small datasets, but for medium-sized datasets, Polars or DuckDB are recommended. For large datasets, PySpark is indispensable. Modern data workflows often combine these tools, leveraging Polars for quick data cleaning, DuckDB for lightweight analysis, and PySpark for heavy-duty tasks. This approach not only boosts performance but also ensures scalability as data volumes increase. Industry Evaluation The decision framework proposed in this article has been widely embraced by the data science community. Experts agree that a systematized tool selection method enhances productivity and reduces performance bottlenecks caused by inappropriate tool choices. Pandas is hailed as the "Swiss Army knife" of data processing for small-scale tasks, while Polars and DuckDB are effective for medium datasets. PySpark's powerful distributed computing capabilities make it a vital tool for handling large datasets. Future articles will delve deeper into comparing the performance of DuckDB and Polars for medium-sized datasets to guide users further in their tool selection. GitHub Repository: https://github.com/gurezende/Data-Dictionary-GenAI Author’s Website: https://gustavorsantos.me References: 1. Agno Documentation: https://docs.agno.com/introduction/agents 2. OpenPyXL Documentation: https://openpyxl.readthedocs.io/en/stable/index.html 3. Streamlit Documentation: https://docs.streamlit.io/ 4. Data Documentation Web App: https://excel-datadocs.streamlit.app/ 5. UV Installation Guide: https://docs.astral.sh/uv/getting-started/installation/ 6. Windsurf Coding Assistant: https://windsurf.com/vscode_tutorial 7. Google Gemini API Key: https://ai.google.dev/gemini-api/docs/api-key

Related Links

AI Agents and OpenPyxl: New Tools for Data Engineering Beyond Pandas | Trending Stories | HyperAI