Tip Tuesday | Use Python For Querying JSON Files

JSON files are prevalent in the IT industry, commonly used in web service API data exchange. Python offers a streamlined alternative to manual analysis or database importation, often proving as efficient and concise as SQL queries.

Let us have a look at an example of a JSON file that consists of a long list of customers like this:

{
  "customers": [
    {"firstName": "Sophia", "lastName": "Miller", "age": 32, "addressCountry": "United States", "addressCity": "New York", "productsPurchased": 5},
    {"firstName": "Liam", "lastName": "Johnson", "age": 28, "addressCountry": "Canada", "addressCity": "Toronto", "productsPurchased": 3},
    {"firstName": "Olivia", "lastName": "Brown", "age": 41, "addressCountry": "United Kingdom", "addressCity": "London", "productsPurchased": 8},
    {"firstName": "Noah", "lastName": "Davis", "age": 22, "addressCountry": "Australia", "addressCity": "Sydney", "productsPurchased": 2},
    {"firstName": "Emma", "lastName": "Wilson", "age": 37, "addressCountry": "Germany", "addressCity": "Berlin", "productsPurchased": 6},
    {"firstName": "Ethan", "lastName": "Taylor", "age": 55, "addressCountry": "France", "addressCity": "Paris", "productsPurchased": 1},
    {"firstName": "Ava", "lastName": "Anderson", "age": 29, "addressCountry": "Japan", "addressCity": "Tokyo", "productsPurchased": 4},
    {"firstName": "Mason", "lastName": "Thomas", "age": 48, "addressCountry": "Brazil", "addressCity": "Rio de Janeiro", "productsPurchased": 7},
    {"firstName": "Isabella", "lastName": "Moore", "age": 34, "addressCountry": "South Africa", "addressCity": "Cape Town", "productsPurchased": 9},
    {"firstName": "James", "lastName": "Harris", "age": 31, "addressCountry": "China", "addressCity": "Shanghai", "productsPurchased": 3},
    ...
  ]
}

To load the file in Python, only a few lines of code are required:

import json
with open("./data.json","r") as json_file:
    data = json.load(json_file)
    customers = data['customers']

The resulting "customers" variable is a list of dictionaries, analogous to a relational table, enabling SQL-like querying operations:

Getting the total number of customers is a one-liner:

total_number_of_customers = len(customers)

Number of customers from the United States:

numer_of_customers_from_us = len([c for c in customers if c['addressCountry'] == 'United States'])

We are using the powerful Python list comprehension to filter the customer list.

The age of the youngest customer:

youngest_customer = min([c['age'] for c in customers])

The age of the oldest customer from the UK:

oldest_customer_from_uk = max([c['age'] for c in customers if c['addressCountry'] == 'United Kingdom'])

The list comprehension allows us to do column selection and filtering just like in a SELECT statement.

Let us try aggregation by querying the total number of products purchased by United States customers:

nr_products_purchased_by_us_customers = sum([c['productsPurchased'] for c in customers if c['addressCountry'] == 'United States'])

List comprehension excels at selecting and filtering data, then applying aggregation functions, often as easily as equivalent SQL operations.

However, for more intricate queries, such as identifying the top 10 countries by product purchases, native Python may require more verbose solutions:

customers.sort(key=itemgetter('addressCountry'))
products_by_country = {}
for key, group in groupby(customers, key=itemgetter('addressCountry')):
    products_by_country[key] = sum(list( [g['productsPurchased'] for g in group] ))
top_10_countries_by_product_purchased = sorted(products_by_country.items(), key=lambda x: -x[1])[:10]

The above code is by no means concise or easy to read.

However, the Python Pandas package offers an elegant solution, rivalling the ease of SQL:

import pandas as pd

With Pandas, the aggregation is concise and easy to read:

df = pd.DataFrame(customers)
country_product_totals = df.groupby('addressCountry')['productsPurchased'].sum()
top_10_countries_by_product_purchased = country_product_totals.sort_values(ascending=False).head(10)

Python, particularly when augmented with specialised libraries, proves to be a highly efficient tool for rapid data analysis, often rivalling the user-friendliness and succinctness of SQL.