- author: Dave Ebbelaar
Incorporating Data Analysis Capabilities into Large Language Model Apps with Pandas Dataframe Agents
In today's article, we will explore how to incorporate data analysis capabilities into large language model apps using the pandas dataframe agents toolkit from the Link Chain Library. While language models excel at handling text-based questions, handling large tables and databases with hundreds or even millions of records can be challenging due to token limitations. Sending all the data to the API quickly exceeds these limits. However, with the pandas dataframe agents toolkit, we can load Excel files or CSV files and ask questions about the datasets using text, without the need for extensive coding for data analysis.
Why pandas dataframe agents?
Pandas dataframe agents provide a powerful tool for analyzing structured data, whether it be your own datasets or company datasets (with necessary permissions). By integrating this toolkit into large language model apps, you can perform data analysis on a wide range of datasets without the need for specialized coding skills. Whether you are a developer working on your own project or building an app for a client, pandas dataframe agents offer a versatile solution for incorporating data analysis capabilities.
To follow along with the examples in this article, you will need a basic understanding of how Python works and have a grasp of Lang Chain. If you are new to Lang Chain, I recommend checking out my previous video introduction to get up to speed. Additionally, you will need an OpenAI API key, but GPT-4 is not required for the examples we will cover.
To begin, clone the Lang Chain Experiments repository and navigate to the pandas agent directory. In this directory, you will find all the necessary source materials and links mentioned in the article. It is recommended to set up the experiments as outlined in the instructions to follow along seamlessly. The examples in this article will be demonstrated using VS Code, but you can adapt the code to suit your preferred environment.
An Example Use Case: Data Science Salaries Dataset
Throughout this article, we will explore a specific use case to put the pandas dataframe agent to the test. We will analyze a data science salaries dataset obtained from Kaggle. This dataset includes information such as job titles, employment types, salaries, and currency conversions, making it an interesting dataset for analysis. You can find the dataset in the data folder of the Lang Chain Experiments repository, or you can download it directly from the official source.
To begin, we will establish a baseline by using a single tool and conducting a Google search along with some calculations to determine the median salary of a senior data scientist in 2023. This baseline will serve as a reference point to demonstrate the capabilities of pandas dataframe agents. We will use the SERP API tool to conduct the Google search and perform the necessary calculations.
Loading and Analyzing Data with Pandas Dataframe Agents
Now, let's dive into the core functionality of pandas dataframe agents. We will start by loading our own data into the agents. You can load either CSV files or Excel files using the pandas library in Python.
To load a CSV file, use the
pd.read_csv() function. Similarly, to load an Excel file, use the
pd.read_excel() function. Both functions will generate a pandas dataframe. In the example code, you can see how we load both a CSV and an Excel file, resulting in identical data frames.
Once the data is loaded, the next step is to initialize the pandas dataframe agent. This can be accomplished by creating a
bundle.PandasDataframeAgent object and passing the large language model as defined previously. Additionally, you can provide any additional parameters and configurations required.
With the agent initialized, you now have a powerful combination of the large language model and the loaded data frame at your disposal. By calling
agent.run() and posing questions about the data set, you can perform basic data exploration and obtain insights. The results will be returned by the agent.
To showcase the inner workings of the agent, you can set the
verbose_output parameters to
True. This will display all the intermediate steps and thought processes of the agent, providing transparency into its decision-making process.
The pandas dataframe agent empowers you to conduct complex data analysis tasks with minimal code, leveraging the capabilities of large language models intelligently. It enables you to explore your data, make observations, and gain insights using straightforward text-based queries.
Exploring Data with the Pandas DataFrame Agent
Working with data can sometimes be a complex and time-consuming task, especially when it comes to data manipulation and exploration. However, with the introduction of the Pandas DataFrame Agent, performing data analysis tasks has become much simpler and more efficient. In this article, we will explore the capabilities of the Pandas DataFrame Agent and how it can be used to interactively explore and analyze data.
Basic Data Exploration
The Pandas DataFrame Agent is capable of running Python code, making it an ideal tool for performing data exploration tasks. One of the key features of the agent is its ability to understand and execute code based on the
df.shape input. This input represents the dimensions of a DataFrame and allows the agent to provide an interpretable answer to basic data exploration questions.
For example, when using the
df.shape input, the agent can provide information about the number of rows and columns in the DataFrame. This is particularly useful when presenting data to end users in applications, as it provides them with a clear understanding of the structure of the dataset. In addition, the agent can also identify missing values in the DataFrame and provide a summary of the columns present.
Multi-Step Data Exploration
The capabilities of the Pandas DataFrame Agent extend beyond basic data exploration. It can also handle multi-step data exploration tasks that require chaining multiple operations on a DataFrame. This means that complex data analysis tasks, which would typically require careful consideration and knowledge of the Pandas library, can now be done more efficiently using the agent.
For instance, determining the top five jobs with the highest median salary involves several steps, including grouping the data by job, calculating the median salary, and sorting the values in ascending order. The Pandas DataFrame Agent can seamlessly execute these operations and provide the desired result.
Working with Multiple DataFrames
Another noteworthy feature of the Pandas DataFrame Agent is its ability to work with multiple DataFrames simultaneously. By providing the agent with an array or list of DataFrames, it can make sense of the data and execute relevant operations across all the provided DataFrames.
To illustrate this capability, let's consider a scenario where we have multiple DataFrames and we want to determine the number of rows and columns in each DataFrame. The Pandas DataFrame Agent can handle this task by executing the
shape method on each DataFrame and providing the corresponding results.
Observations and Limitations
During our experiments with the Pandas DataFrame Agent, we observed several key points:
- The agent's output is generally accurate and the thought process, which represents the necessary code to obtain the answer, is correct.
- The agent is able to recognize context and interpret values based on descriptions. For example, it understands that "ft" refers to a full-time position.
- The agent's output can be inconsistent or non-deterministic, especially when working with external APIs or when there are multiple steps involved in the prompt.
- The agent's handling of mathematical calculations is not always accurate, indicating a limitation in its ability to perform complex arithmetic operations.
Enhancing Data Analysis Efficiency with AI Assistants
As data analysts and data scientists, we often encounter challenges that require us to find solutions quickly and efficiently. One common mistake we may make is losing track of the conversation or struggling to understand complex mathematical concepts. However, there is a solution that can save us time and streamline our data manipulation process - AI assistants.
An AI assistant acts as a valuable tool that can assist us in finding the right answer or piece of code without the hassle of searching through Google or relying on chatbots. By leveraging these intelligent agents, we can directly obtain the desired outcome, such as generating the exact pandas code needed for our analysis. This not only improves our productivity but also helps us avoid potential errors in data manipulation.
Furthermore, the application potential of these AI assistants is vast. Imagine a scenario where users can upload their data and ask questions. This integration could offer a smooth and interactive experience, similar to what we can anticipate with the release of co-pilot for the Microsoft Office stack, including Excel. With this, users can expect similar functionality, allowing them to perform complex tasks directly within the spreadsheet application.
Currently, using AI assistants is the most effective approach in achieving these results. In addition, if you are interested in implementing this technology into your own application, the methods demonstrated in this article serve as a suitable starting point.
Thank you for taking the time to read this article. As always, your support by liking and subscribing to our channel will be immensely appreciated. It also encourages us to provide more content like this in the future. We're also open to covering specific topics related to AI, data science, and large language models. Please feel free to leave your suggestions in the comments section below.
Additionally, we would like to invite you to check out our email newsletter focused on data science and artificial intelligence. It is tailored for individuals who are serious about expanding their knowledge in these fields. Another valuable resource is our Data Freelancer Mastermind community, where we emphasize launching and scaling your freelancing business as a data professional. If this interests you, we encourage you to explore the links provided in the description.
Thank you once again for your support, and we look forward to bringing you more insightful content in the future.
[Music outro]In this article, we explored the incorporation of data analysis capabilities into large language model apps using the pandas dataframe agents. by combining the power of pandas with the intelligence of large language models, developers can seamlessly analyze structured data, perform data exploration, and gain valuable insights. this toolkit eliminates the need for extensive coding and empowers users to leverage the capabilities of language models to manipulate and analyze data effectively. whether you are working on your own data analysis project or building an app for a client, pandas dataframe agents offer a versatile solution that streamlines the process of data analysis.
The pandas dataframe agent is a valuable tool for conducting question-and-answer interactions with data. it simplifies the process of data exploration and analysis, reducing the need for manual code generation and lookup. however, it is important to double-check the agent's answers and be aware of its limitations, particularly when working with complex mathematical calculations.
next, we will delve deeper into the experiments conducted with the pandas dataframe agent and analyze the results obtained. stay tuned for the continuation of our exploration of this powerful data analysis tool.