- author: Python and Pandas with Reuven Lerner

# Exploring D-types in Pandas

In this article, we will delve into the topic of D-types in the popular data analysis library, Pandas. Specifically, we will discuss an interesting aspect that was brought up during one of my recent corporate Pandas training classes. This prompted a deeper investigation into the current and future behavior of Pandas.

## Understanding the Relationship with NumPy

To grasp the concept of D-types in Pandas, it is essential to first understand its relationship with NumPy. Pandas can be seen as a wrapper around NumPy, with NumPy serving as the underlying manual transmission, while Pandas acts as the convenient automatic transmission.

Let's take a closer look at NumPy to establish a foundation. Consider the following code snippet:

`importnumpyasnpa=np.array([10,20,30,40,50])print(a.dtype)# Output: int64`

In this example, we have an array `a`

consisting of integers. The `dtype`

of `a`

is determined as `int64`

, implying that each element in the array takes up 64 bits of memory.

Now, if we assign a floating-point value to a specific index of `a`

, NumPy automatically truncates the decimal part and converts the `dtype`

of the array to `int64`

. For example:

`a[2]=1.23# Converted to integer, dtype remains int64`

This behavior is intuitive for NumPy users, but it can be unexpected for those new to Pandas.

## D-types in Pandas

Now let's dive into Pandas and explore how D-types work within this library. Consider the following code snippet:

`importpandasaspds=pd.Series([10,20,30,40,50])print(s.dtype)# Output: int64`

In this case, we have created a Pandas Series `s`

using the `pd.Series()`

function. Behind the scenes, this results in an underlying NumPy array. As a result, the `dtype`

of `s`

is `int64`

, just as we saw before.

However, the behavior changes when we assign a floating-point value to an index of `s`

:

`s[2]=1.23`

Instead of truncating the decimal part and converting the `dtype`

to `int64`

as NumPy did, Pandas retains the float value and changes the `dtype`

of the entire series to `float64`

. This variance in behavior often surprises users who are accustomed to NumPy's behavior.

Furthermore, if we assign a non-numeric value like a string to an index of `s`

, Pandas will change the `dtype`

to `object`

. For instance:

`s[2]="hello"# D-type becomes object`

This means that each element in the series becomes a separate Python object, leading to different behavior compared to NumPy arrays.

## Dealing with NaN Values

An interesting aspect of Pandas is its treatment of NaN (Not a Number) values, especially when compared to NumPy. Let's explore this further.

Consider the code snippet that assigns a NaN value to an index in a Pandas Series:

`s[2]=np.nan`

In this case, Pandas retains the dtype as `float64`

, enabling the inclusion of NaN values within the series. This behavior aligns with NumPy's requirement of converting the dtype to float when NaN is present.

To further exemplify this, let's consider the following scenario:

`x=pd.Series(['a','b','c'])s.loc[1]=np.nan# Can include NaN in object dtype`

Here, we have a series `x`

consisting of strings. When we assign a NaN value to the series, Pandas converts the dtype to `object`

, allowing the inclusion of NaN without changing the dtype to float, as observed in the previous example.

## Introducing Nullable Types in Pandas

Currently, Pandas does not offer a straightforward way to prevent dtype changes when assigning different values to a series. However, there are interesting developments in Pandas that provide a solution through the introduction of nullable types.

To demonstrate this, let's revisit the code snippet where we created a series with integers:

`s=pd.Series([10,20,30,40,50])`

Instead of relying on the default NumPy dtype, we can leverage Pandas' extension types to specify a nullable integer dtype:

`s=pd.Series([10,20,30,40,50],dtype='Int64')`

By using the `'Int64'`

dtype, we guarantee that the series will only consist of integers and eliminate unexpected dtype changes. Additionally, NaN values can be included without forcing a dtype change.

To accomplish this, Pandas introduces a new concept called `pd.NA`

. Unlike `np.NaN`

, `pd.NA`

can work with multiple data types without the need to change the underlying dtype. This flexibility is a significant advantage over previous approaches.

By enhancing the code snippet above, we can explore this capability:

`s[3]=pd.NA# Nullable integer series with NaN value`

Here, the dtype remains `'Int64'`

, and we successfully include a NaN value using `pd.NA`

. Consequently, we benefit from both the guarantee of integer-only values and the ability to include NaN values.

## Working with DataFrames

The ability to define nullable types extends beyond Series and can be applied to DataFrames as well. Consider the following example:

`importpandasaspdimportnumpyasnpdf=pd.DataFrame({'A':[10,20,30],'B':[1,2,3],'C':['hello','out','there']})print(df.dtypes)# Output:# A int64# B int64# C object# dtype: object`

In this case, we have a simple DataFrame with columns consisting of integers, floats, and strings. The `dtypes`

attribute reveals that the column types align with the NumPy dtypes: `int64`

, `float64`

, and `object`

.

Suppose we aim to assign NaN values to column `A`

while ensuring it remains as an integer dtype. We can achieve this by calling the `convert_dtypes()`

method on the DataFrame:

`df=df.convert_dtypes()df['A'].loc[2]=pd.NAprint(df.dtypes)# Output:# A Int64# B Int64# C string# dtype: object`

By applying the `convert_dtypes()`

method, we convert column `A`

to a nullable integer type (`Int64`

). We can now confidently assign a NaN value using `pd.NA`

, and the dtype remains intact. This approach allows for more control in working with DataFrames and prevents unexpected changes in dtypes.

## Understanding Nullable Types in Pandas

In pandas, nullable types allow us to enforce specific data types on our DataFrame columns. This is particularly useful when working with heterogeneous data that includes strings, floats, and other data types.

To illustrate this concept, let's consider a DataFrame, `DF`

, consisting of columns for ends, floats, and strings. By default, the `DF.dtypes`

attribute returns the data types of the columns. In this case, we have `N64 float64 object`

as the numpy types.

If we want to convert a column, say column A, to NaN, we can use the `DF.convert_dtypes()`

method. This method returns the same DataFrame, but with the desired type modifications. For instance, if we set column A to NaN using the `DF['A'] = NaN`

syntax, the old behavior would have converted it into a float. However, by using `DF.convert_dtypes()`

, the type of column A is now `Int64`

, which is a nullable version of the numpy float.

Similarly, we can modify column B, which is of type float64, to have the same nullable behavior. This allows us to work with nullable types fluidly while preserving the intended data types.

Additionally, there is a new concept introduced in pandas - nullable string type. In the above example, if we set `DF.loc[2, 'C'] = pd.NA`

, the type of column C remains as a string. This nullable string type is a significant breakthrough as it guarantees that the types of our data will remain intact, alleviating the need to convert everything to floats.

This nullable string type also explains why when we use `DF.dropna()`

, the result is `NA`

instead of `NaN`

. The former is the modern, preferred way to represent missing values.

Furthermore, we can utilize the `DF.fillna()`

method to fill missing values with specific values. However, we need to ensure the value matches the column's data type. For instance, if we try to fill the missing value in a string column with an integer value like `10`

, an error will occur. To work around this, we can use `DF.fillna('hello')`

to replace the missing value with the string 'hello'.

It's worth noting that nullable types are still considered experimental features but show great promise in pandas. These developments are part of pandas' efforts to diverge from numpy and establish its own identity.

InUnderstanding d-types in pandas and their relationship with numpy is crucial

, by understanding nullable types in pandas, we can enforce specific data types on our DataFrame columns, preserving the integrity of our data. As pandas continues to evolve, we can expect more exciting developments in this area.

If you have any questions or want to learn more about python pandas, please leave a comment below. Don't forget to subscribe to get the latest updates on python pandas. Stay tuned for more insightful content on this topic.