- author: Chad Skelton
Using ChatGPT and Notable Plugin for Data Analysis
Introduction
Hi, I'm Chad Skelton and in this video, I will show you how to use ChatGPT and the Notable plugin for data analysis. If you are new to using ChatGPT and the Notable plugin, please refer to my first video for the basics and setup instructions. In order to use these tools, you will need the paid version of ChatGPT (ChatGPT Plus) and the Notable notebook plugin. Additionally, you will need to apply for a free Notable account. Once you have these set up, you will be ready to proceed.
Unlocking Powerful Tools with ChatGPT and Notable
One of the remarkable things about using ChatGPT in combination with Notable notebooks is that it provides access to tools that were previously inaccessible to individuals without advanced coding skills. As a data journalist and data analysis instructor, I have been aware of tools like machine learning, fuzzy matching, and natural language toolkits, but lacked the confidence to use them due to my limited coding background. However, I have discovered that ChatGPT allows me to leverage these powerful tools without the need for extensive coding knowledge.
An Interesting Example: Political Donations Dataset
To illustrate the capabilities of ChatGPT and Notable, I will use a dataset on political donations in British Columbia. This dataset contains information on contributors to political parties, classified as unions, corporations, or individuals. I have manually matched 300 donations, with 100 each for unions, corporations, and individuals. Now, my goal is to utilize machine learning to classify the remaining unmatched donations. This exercise will help determine whether ChatGPT can effectively handle this kind of task and produce accurate results.
Setting Up the Notable Notebook
To begin, I have uploaded the relevant files, "donorsmatched.csv" and "donorsunmatched.csv," to my testing project in Notable. I have also set this project as my default project in ChatGPT. Setting a default project is a good practice as it ensures that the required files are easily accessible throughout various chat sessions.
Seeking Assistance from ChatGPT
Now, I will seek assistance from ChatGPT for the classification task. I will pretend to have limited knowledge of machine learning and ask for help in classifying the donors in the remaining dataset. After providing the necessary details, ChatGPT responds with a plan that includes loading the data, preprocessing, feature extraction, model training, evaluation, classification, and review.
Exploratory Data Analysis
Before proceeding with the plan, ChatGPT suggests conducting exploratory data analysis to gain insights into the dataset. It examines the distribution of donation amounts and provides some initial conclusions based on example names. For instance, it observes that corporate names often include "LTD" or "Inc," while union names tend to include "Union" or "Association." This information suggests that these naming patterns could be valuable features for our classification model.
Preprocessing and Feature Extraction
In order to prepare the data for classification, ChatGPT recommends preprocessing and feature extraction. One common approach is to use TF-IDF, which transforms text into a numerical representation suitable for model training. Additionally, the donation amount can be included as a feature. At this point, ChatGPT stops and requests permission to continue with the preprocessing and feature extraction steps.
Feature Extraction and Model Training
In order to classify the donors, we need to extract features from the data and train a model. One common approach for feature extraction is to use TF-IDF (Term Frequency-Inverse Document Frequency), which transforms the text into a numerical representation that can be used by our model. Additionally, we will include the donation amount as a feature.
We start by preprocessing the data and extracting the TF-IDF values. The TF-IDF values represent the importance of each word in the text. We then split the data into a training set and a test set. The training set consists of 241 samples, while the test set consists of 61 samples.
Next, we train a logistic regression model on this data. However, the model did not perform very well. It only predicted the union class and did not make any predictions for the corporation or individual classes. This is likely because our features are not providing enough information for the model to distinguish between the classes.
To improve the model's performance, we could try a few things:
Use a more complex model: A random forest or gradient boosting machine might be able to capture more complex patterns in the data.
Perform feature engineering: We could try to engineer more informative features from the available data. For example, we could create features based on the length of the name or the presence of certain keywords.
Collect more data: Increasing the size of the dataset could provide the model with more information to make accurate predictions.
Explore advanced text classification techniques: Since the main feature is the name of the contributor, we could explore techniques such as word embeddings to enhance the model's performance.
Given these options, we decide to try a more complex model. We train a random forest model on the data and find that it performs significantly better than the logistic regression model. It achieves high precision and recall for the corporation and union classes, as well as reasonable precision and recall for the individual class. The overall accuracy of the model on the test set is 89%.
With this improved performance, we decide to classify the unmatched donors using this random forest model. After fixing an error caused by blank values, we successfully classify the unmatched donors. The results can be checked in the notebook. However, it is important to note that the accuracy of the model may be affected by the relatively small size of the training data.
To save the classified data, we export it to a CSV file named "classified_donors.csv". This file can be downloaded from the project files in Notable.
During this exercise, we checked the actual accuracy of the machine learning model generated by ChatGPT. We compared the predicted classifications with the correct classifications from the original data set. It was observed that the model made some correct classifications, but also made some mistakes.
By analyzing the data in a pivot table, we found that the model performed reasonably well in identifying corporations and individuals, but had some difficulties distinguishing between corporations and unions. It correctly identified about 90% of the corporations and almost 100% of the individuals and unions in the dataset.
In the next sections, we will further analyze the results and consider potential improvements to the model.
Exploring the Accuracy of Machine Learning Models
In this section, we will delve into the accuracy of machine learning models and the implications they have on the classification of data. While it is commonly believed that machine learning models are infallible, the reality is slightly different. However, the errors are not significant until we reach two or three decimal places.
A Conservative Approach
When dealing with machine learning models, it is crucial to understand that they make decisions based on predefined categories or buckets. In situations where there are multiple classifications, the model can only determine which bucket to assign the data to. Sometimes, one bucket may be worse than the others, and this can vary based on the idiosyncrasies of the model's workings.
Due to this conservative approach, the model tends to be more cautious in identifying individuals. Consequently, more individuals may end up being classified as corporations if the model were tuned slightly differently. The reverse can also happen, with more corporations being mistakenly identified as individuals. These nuances can be better understood by examining the pivot table.
Identifying Mistakes
Analyzing the cases where the machine learning model misclassified data offers valuable insights. For instance, some individuals were incorrectly identified as unions. Upon further investigation, it was discovered that these individuals had actually left money to political parties in their wills. The confusion arose because their names were associated with the term "estate."
Similarly, many corporations were inaccurately identified as individuals. This happened because the corporations' names contained individual names, such as "James M Cody Law Corp" or "Patricia Taylor Law." While some cases are more surprising, like "Big Kahuna Sports" or "The Innovation Resource Center," which got misclassified, these instances highlight the complexities of classification.
Success Rate and Training Data
Despite the occasional misclassifications, the overall success rate of the model is commendable. Keep in mind that these results were achieved with relatively limited training data. In general, providing more training data consistently improves the model's performance. The difficulty of classifying data into different categories depends on the subject matter itself.
Additionally, it is noteworthy that notable, the platform used for analysis, records all the code used in a Jupiter notebook. This allows coders to examine the libraries and specific code employed for classification. Consequently, even if you are not familiar with machine learning, notable's integration with Chachi PT simplifies the process, enabling anyone to utilize advanced models effortlessly.
Unlocking the Power of Explanation
Chat GPT, the language model utilized in this study, offers a valuable feature - the ability to explain. It can not only use various tools but also provide explanations on how those tools function. For instance, if asked about the random forest model, it can offer a high-level overview of its purpose and the principles underlying its operation.
Moreover, Chat GPT can further elaborate on specific concepts related to machine learning, such as bootstrap sampling. Therefore, users can leverage Chachi PT to deepen their understanding of machine learning models and related techniques.
To access these functionalities, a paid version of Chat GPT Plus is required along with the installation of the notable plug-in. If you are interested in learning more about the basics of data analysis using notable and Chat GPT, I recommend referring to my first video on this topic. Stay tuned for additional videos exploring the capabilities of Chachi PT and notable.
We hope you found this article helpful. Feel free to reach out with any questions or feedback.In this article, we have explored how chatgpt and the notable plugin can be utilized for data analysis tasks. the ability to leverage powerful tools such as machine learning without extensive coding knowledge opens up new possibilities for users like myself. we have also walked through an example of classifying political donation data using chatgpt and notable. as we progress further with this task, we will continue to evaluate the effectiveness of this approach and iteratively refine our results.