Tip Tuesday | Quick CSV Data Analysis with Python and Pandas

If you have a CSV file and quickly need to do a bit of analysis on it, you do not always have to reach for the big guns like loading it into a relational database table or using Apache Spark. The Pandas Python library allows for quick analysis without the overhead.

Let us take this CSV as an example:

"FirstName","LastName","Age","AgeGroup","OrderDate","OrderAmountGBP","ProductName","ProductGroup"
"Emily","Johnson","22","17-25","2023-12-14","35","Wireless Earbuds","Electronics"
"Michael","Brown","45","36-45","2024-01-12","89","Coffee Maker","Kitchen Appliances"
"Sophia","Davis","14","0-16","2023-11-28","20","Graphic Novel","Books"
"Olivia","Wilson","31","26-35","2024-05-19","125","Handbag","Fashion Accessories"
"William","Taylor","58","46-65","2023-08-13","50","Gardening Set","Home & Garden"
"Ava","Jones","9","0-16","2024-03-21","25","Board Game","Toys & Games"
"James","Miller","38","36-45","2024-06-04","120","Laptop","Electronics"
"Isabella","Anderson","21","17-25","2023-09-23","75","Sneakers","Sports & Outdoors"
"Ethan","Moore","52","46-65","2023-12-30","150","Yoga Mat","Health & Wellness"
"Mia","Garcia","28","26-35","2024-04-15","175","Suitcase","Travel"
...

In this CSV file, we have Customers grouped by age and their Orders, which are grouped by product. Let us establish, what is the most popular Product Group for each Age Group. Let us see how that can be done with Python and Pandas:

import pandas as pd

# read the CSV file into a Pandas DataFrame
df = pd.read_csv("./data/orders.csv")

# Group by AgeGroup and ProductGroup, then count the occurrences
age_product_grouped_data = df.groupby(['AgeGroup', 'ProductGroup']).size().reset_index(name='count')

# Find the most popular product group for each age group
most_popular_for_age_group = age_product_grouped_data.loc[age_product_grouped_data.groupby('AgeGroup')['count'].idxmax()]

print(most_popular_for_age_group)

As we can see, 5 lines of code, including the import and print, are sufficient.

Line 4 of the code reads the CSV and creates a Panda DataFrame based on it. Panda DataFrames are much like Spark DataFrames or relational database tables. If all goes well, the CSV file will be parsed and, to an extent, data types recognised. If we want to validate the DataFrame created, we can run an additional command after it is created, df.info(), to get this schema summary:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404 entries, 0 to 403
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   FirstName       404 non-null    object 
 1   LastName        404 non-null    object 
 2   Age             404 non-null    int64  
 3   AgeGroup        404 non-null    object 
 4   OrderDate       404 non-null    object 
 5   OrderAmountGBP  404 non-null    float64
 6   ProductName     404 non-null    object 
 7   ProductGroup    404 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 25.4+ KB

Line 7 of the code creates a Series object that is grouped by both the AgeGroup and ProductGroup columns. An extra Count column is created, based on the .size() method, and named count. The resulting object age_product_grouped_data is also a DataFrame.

Line 10 of the code identifies the rows within each AgeGroup in the age_product_grouped_data DataFrame that have the highest count value. It then filters the DataFrame to keep only those rows.

The end result looks like this:

   AgeGroup         ProductGroup  count
2      0-16         Toys & Games     62
3     17-25          Electronics     38
8     26-35  Fashion Accessories     31
14    36-45   Kitchen Appliances     25
18    46-65        Home & Garden     32

Admittedly, it is not quite as intuitive as writing a SELECT statement for those of us well versed in SQL. But the above example shows that analyses, especially simpler ones, can easily be done with Python and Pandas.