Tip Tuesday | Quick CSV Data Analysis with Python and Pandas
If you have a CSV file and quickly need to do a bit of analysis on it, you do not always have to reach for the big guns like loading it into a relational database table or using Apache Spark. The Pandas Python library allows for quick analysis without the overhead.
Let us take this CSV as an example:
"FirstName","LastName","Age","AgeGroup","OrderDate","OrderAmountGBP","ProductName","ProductGroup"
"Emily","Johnson","22","17-25","2023-12-14","35","Wireless Earbuds","Electronics"
"Michael","Brown","45","36-45","2024-01-12","89","Coffee Maker","Kitchen Appliances"
"Sophia","Davis","14","0-16","2023-11-28","20","Graphic Novel","Books"
"Olivia","Wilson","31","26-35","2024-05-19","125","Handbag","Fashion Accessories"
"William","Taylor","58","46-65","2023-08-13","50","Gardening Set","Home & Garden"
"Ava","Jones","9","0-16","2024-03-21","25","Board Game","Toys & Games"
"James","Miller","38","36-45","2024-06-04","120","Laptop","Electronics"
"Isabella","Anderson","21","17-25","2023-09-23","75","Sneakers","Sports & Outdoors"
"Ethan","Moore","52","46-65","2023-12-30","150","Yoga Mat","Health & Wellness"
"Mia","Garcia","28","26-35","2024-04-15","175","Suitcase","Travel"
...
In this CSV file, we have Customers grouped by age and their Orders, which are grouped by product. Let us establish, what is the most popular Product Group for each Age Group. Let us see how that can be done with Python and Pandas:
import pandas as pd
# read the CSV file into a Pandas DataFrame
df = pd.read_csv("./data/orders.csv")
# Group by AgeGroup and ProductGroup, then count the occurrences
age_product_grouped_data = df.groupby(['AgeGroup', 'ProductGroup']).size().reset_index(name='count')
# Find the most popular product group for each age group
most_popular_for_age_group = age_product_grouped_data.loc[age_product_grouped_data.groupby('AgeGroup')['count'].idxmax()]
print(most_popular_for_age_group)
As we can see, 5 lines of code, including the import and print, are sufficient.
Line 4 of the code reads the CSV and creates a Panda DataFrame based on it. Panda DataFrames are much like Spark DataFrames or relational database tables. If all goes well, the CSV file will be parsed and, to an extent, data types recognised. If we want to validate the DataFrame created, we can run an additional command after it is created, df.info()
, to get this schema summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404 entries, 0 to 403
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 FirstName 404 non-null object
1 LastName 404 non-null object
2 Age 404 non-null int64
3 AgeGroup 404 non-null object
4 OrderDate 404 non-null object
5 OrderAmountGBP 404 non-null float64
6 ProductName 404 non-null object
7 ProductGroup 404 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 25.4+ KB
Line 7 of the code creates a Series object that is grouped by both the AgeGroup
and ProductGroup
columns. An extra Count column is created, based on the .size()
method, and named count
. The resulting object age_product_grouped_data
is also a DataFrame.
Line 10 of the code identifies the rows within each AgeGroup
in the age_product_grouped_data
DataFrame that have the highest count
value. It then filters the DataFrame to keep only those rows.
The end result looks like this:
AgeGroup ProductGroup count
2 0-16 Toys & Games 62
3 17-25 Electronics 38
8 26-35 Fashion Accessories 31
14 36-45 Kitchen Appliances 25
18 46-65 Home & Garden 32
Admittedly, it is not quite as intuitive as writing a SELECT statement for those of us well versed in SQL. But the above example shows that analyses, especially simpler ones, can easily be done with Python and Pandas.