Data Cleansing – Data Types
Data Cleansing is the process of correcting and adjusting your data (and data types) in the attempt to improve it’s useability. In other words, the way your data is collected, stored, and read directly affects its’ quality. As such, all Data Scientists perform some level of data cleansing after reading in the data in the attempt to increase its’ usability. Below we describe some of the most common areas to watch out for.
Numeric Data
Numeric Data are commonly represented as either Integers or Floating point numbers. Additionally, the Data Type influences the number’s precision. When working with numeric data, it is important to understand the nature of the attribute you are working with to determine which is the best fit. As an example, suppose you are counting the number of students in a classroom at any point in time. In this case, since the number of students are always whole number, representing this as an Integer makes a lot of sense. On the other hand, suppose you are measuring the temperature outside at any given point in time. On this occasion, using a floating point number gives more precision.
Boolean
Boolean data types by definition can take on one of two values, in other words True or False. Consequently, the representation of these two concepts can vary. As an illustration, in some occasions these values can be stored as a text “True” or “False”. In other times, such as during One Hot Encoding, it could also be stored as a 1 (for True) and 0 (for False). Moreover other notations could also arise, such as Yes vs No, Y vs N, etc. You get the point. For this reason it is important to know when your data is a boolean and how it is stored. As it turns out you may need to clean and convert the data into your desired representation.
Date & Time
Anybody who has worked with Dates will know globally there is no unified way to write dates and times. Some countries use MM/DD/YYYY whilst others prefer YYYY-MM-DD. Some places use a 12 hour clock, in contrast others may use a 24 hour clock. This variability all makes working with dates & times slightly tricky.
String
Data written in the form of text are commonly stored as “Strings”. These can be any information represented by alphanumeric characters and symbols. Some of the most common text strings include Names, Addresses, Places, etc. Right out of the box, text stored in Python are either an “Object” datatype, or a String datatype. Using the latter has several benefits such as being able to leverage functions for text manipulation. For example, you could make use of pre-built functions to remove leading & trailing blank spaces.
Data Cleansing – Data Types Example
In our example I have an address book with information about some friends and would like to further analyze the data. Upon reading in my address book the data looks like below
import pandas as pd
mydata = pd.read_csv("addressbook.csv")
mydata.head()
mydata.dtypes
Name object
Phone Number object
Email Address object
Number of Siblings float64
Height float64
Birthday object
Married? int64
dtype: object
Notice in our address book, there are different information about each contact. Not only do we have the names of our contact, we also have their corresponding phone number, email addresses, etc.
First off we look at the “Number of Siblings” field and quickly realize the values are stored as a floating point number. Since we know siblings are always whole numbers, we can convert these from floating point numbers to integers.
# Convert FLOAT to INT
mydata['Number of Siblings']=mydata['Number of Siblings'].astype("int8")
mydata.head()
Secondly, we look at the Birthday and see it is currently stored as an “Object”. Converting the data type into a Date would allow us to further process dates such as time zone conversions, etc. Making use of the format attribute, we can give the proper formatting. In the example below we use the YYYY-MM-DD format.
mydata['Birthday']=pd.to_datetime(mydata['Birthday'], format="%Y-%m-%d")
mydata.dtypes
Name object
Phone Number object
Email Address object
Number of Siblings int8
Height float64
Birthday datetime64[ns]
Married? int64
dtype: object
Thirdly, we look at the “Married?” column and decide this is a boolean stored as 0s and 1s. Converting this into a boolean datatype can be done as below:
mydata['Married?']=mydata['Married?'].astype("boolean")
print(mydata.dtypes)
mydata.head()
Name object
Phone Number object
Email Address object
Number of Siblings int8
Height float64
Birthday datetime64[ns]
Married? boolean
dtype: object
Notice that our “Married?” column is now converted into True/False values as opposed to 0s and 1s.
Finally, we also want to clean up the Name, Phone Number, and Email Addresses. These are stored as text and hence we want to convert these into string data type.
mydata[['Name','Phone Number','Email Address']]=mydata[['Name','Phone Number','Email Address']].astype('string')
mydata.dtypes
Name string
Phone Number string
Email Address string
Number of Siblings int8
Height float64
Birthday datetime64[ns]
Married? boolean
dtype: object
Summary – Data Cleansing Data Type
In summary we have gone over four of the most common data types and cleansing you will encounter. As a recap, we looked at Floating Point numbers and Integers, Text/String data, Booleans, as well as Dates. Stay tuned for our next posting where we will cover more topics related to Data Cleansing.
About Alan Wong… Alan is a part time Digital enthusiast and full time innovator who believes in freedom for all via Digital Transformation. 兼職人工智能愛好者,全職企業家利用數碼科技釋放潛能與自由。 |