- author: Chad Skelton
How to Clean Up Data with Chachi PT Plus and Notable Plugin
If you are working with a dataset that has annoying typos, Chachi PT Plus and the notable plugin can help you clean up your data. In this article, we'll walk you through the steps to deal with typos in your dataset.
For this demonstration purposes, we will be using a dataset on auto crimes that was previously worked on by a reporter at the Vancouver Sun. The dataset contains information such as the category of the auto crime, the year it took place, the city it took place in, the latitude and longitude coordinates, and the number of incidents that took place at that location in that year.
The problem with this dataset is that the city data is about 95% clean and 5% riddled with typos. If we take a look at some of the major cities such as Vancouver and Surrey, we can see that the dataset has dozens of misspelled cities. We need to clean up this data to make it more manageable and useful.
Using Chachi PT Plus and Notable Plugin
Here's how you can use Chachi PT Plus and the notable plugin to clean up your data:
- Load the dataset into a pandas dataframe.
- Identify the unique city names in the dataset.
- Use a fuzzy matching algorithm to identify potential typos and suggest corrections.
- Apply the corrections to the dataset.
Identify Unique City Names
We need to identify the unique city names in the dataset to determine which cities have typos. To do this, we can use Chachi PT Plus and the notable plugin to create code that will identify the unique city names in the dataset. From here, we can see that there are indeed a number of typos and variations in the city names.
Fuzzy Matching Algorithm
To identify potential typos and suggest corrections, we can use a fuzzy matching algorithm. Chachi PT Plus and the notable plugin use a library called fuzzy wuzzy, which uses Levenstein distance to calculate the differences between sequences in this case city names.
We can assume that the corrected city names are Vancouver, Burnaby, Richmond, Surrey, Langley, and Kelowna. These are the cities that have thousands of correct responses in the dataset. Once we create a function to test that the fuzzy matching algorithm works, we can apply this function to the city column. We'll create a new column called "corrected City" to store the corrected city names and compare the original and corrected names.
Saving the Cleaned Data
Once we have corrected the city names, we can replace the original city names in the dataset with the corrected ones. Cachi PT Plus and the notable plugin can then save the clean dataset to a new CSV file while keeping the city and corrected city fields so we can review the work ourselves.
In conclusion, using Chachi PT Plus and the notable plugin can help you clean up your dataset and deal with annoying typos, making the data more manageable and useful. While the process may take some time and patience, the results are worth it.
Using CHAT GPT and Notable for Data Cleaning
When working with data sets, cleaning the data is an essential step in the process. In this article, we will explore how to use CHAT GPT and the Notable plugin to clean up a data set. Here is a step-by-step guide:
- Save the cleaned data set to a new CSV file:
- You can choose to overwrite the existing one or save it to a new file, depending on your preference.
- It's recommended to save the cleaned data set to a new file but keep the city and corrected city fields so that you can review the work yourself.
- Use CHAT GPT to correct city names:
- Sometimes, CHAT GPT might get confused and forget that it can actually ask Notable to do things.
- If this is the case, you might need to explicitly say "do it now".
- Once the fuzzy matching process is complete, check the status to see if the corrected city names have been added to the data frame.
- If everything looks good, save the data frame to a new CSV file.
- Review the work and make corrections:
- Check the saved CSV file for the corrected city names and fields.
- Keep in mind that CHAT GPT and the Notable plugin are not perfect, and you may need to make manual corrections, especially if dealing with a larger data set.
- Be prepared for technical errors or timeouts when working with this tool.
- Keep in mind that CHAT GPT can utilize knowledge about the world:
- For example, it can recognize common city names in a particular region or country.
- This can be helpful when cleaning data sets with typo errors in city names.
It's important to note that the Notable plugin allows you to review the work done by CHAT GPT. In some instances, Notable and CHAT GPT may not group the largest cities. It's essential to check the work and make manual corrections if necessary.
In conclusion, CHAT GPT and Notable are powerful tools to clean up data sets. However, they are not perfect and require careful attention and manual corrections. But with some critical eye and manual tweaking, you can produce a clean data set that is ready for analysis.