Using Python to ‘Wrangle’ Your Messy Data
or How to Organize Your Comic Book Collection Based on Issue Popularity
In addition to being a product manager at Rittman Mead, I consider myself to be a nerd of the highest order. My love of comic books, fantasy, sci-fi and general geekery began long before the word ‘Stark’ had anything to do with Robert Downey Jr or memes about the impending winter. As such, any chance to incorporate my hobbies into my work is a welcomed opportunity. For my next few blog entries, I’ve decided to construct a predictive classification model using comic book sales data whose eventual goal will be to build a model that can accurately predict whether a comic will rocket off the shelves or if it will be a sales dud. The first blog of the series shows some of the pitfalls that can come up when preparing your data for analysis. Data preparation, or data wrangling as it has come to be known, is an imperfect process that usually takes multiple iterations of transformation, evaluation and refactoring before the data is “clean” enough for analysis.
While the steps involved in data wrangling vary based on the state and availability of the raw data, for this blog I have chosen to focus on the gathering of data from disparate sources, the enrichment of that data by merging their attributes and the restructuring of it to facilitate analysis. Comic book sales data is readily available on the interwebs, however finding that data in a usable format proved to be a more difficult task. In the end, I had to resort to dreaded process of screen scraping the data from a comic research site. For those of you who are lucky enough be unfamiliar with it, screen scraping is the process of programmatically downloading HTML data and stripping away that formatting to make it suitable for use. This is generally used as a last resort because web sites are volatile creatures that are prone to change their appearance as often as my teenage kids do preparing to leave the house. However, for the purposes of this series, as my friend Melvin the handyman would often say, “We works with what we gots.”
The heavy lifting in this process was performed by three different Python modules: urllib2, bs4 and csv. Urllib2, as the name implies, provides functions to open URLs. In this case, I found a site that hosted a page containing the estimated issue sales for every month going back to the early 1990’s. To extract each month without manually updating the hardcoded URL over and over, I created a script that accepted MONTH and YEAR as arguments, month_sales_scraper.py
The response from the urlopen(url) function call was the full HTML code that is typically rendered by a web browser. In that format, it does me very little good, so I needed to employ a parser to extract the data from the HTML. In this context, a parser is a program that is used to read in a specific document format, break it down into its constituent parts while preserving the established relationships between those parts, and then finally provide a means to selectively access said parts. So an HTML parser would allow me to easily access all the <TD> column tags for a specific table within an HTML document. For my purposes, I chose BeautifulSoup, or bs4.
BeautifulSoup provided search functions that I used to find the specific HTML table containing the sales data and loop through each row, while using the column values to populate a Python object.
This Python object, named data, contains fields populated with data from different sources. The year and month are populated using the arguments passed to the module. The format field is dynamically set based on logic related to the rankings and the remaining fields are set based on their source’s position in the HTML table. As you can see, there is a lot of hard coded logic that would need to be updated, should the scraped site change their format. However, for now this logic gets it done.
The final step of this task was to write those Python objects to a CSV file. The python module, CSV, provides the function writerow(), which accepts an array as a parameter and writes each of the array elements as columns in the CSV.
My first pass raised the an exception because the title field contained unicode characters that the CSV writer could not handle.
To rectify this, I had to add a check for unicode and encoded the content as UTF-8. Unicode and UTF-8 are character encodings; meaning they provide a map computers use to identify characters. This includes alphabets and logographic symbols from different languages as well as common symbols like ®.
Additionally, there was the matter of reformatting the values of some of the numeric fields to allow math to be performed on them later(ie stripping ‘$’ and commas). Other than that, the data load went pretty smoothly. A file named (MONTH)_(YEAR).CSV was generated for each month. Each file turned out like so:
While this generated tens of thousands of rows of comic sales data, it was not enough. Rather, it had the volume but not the breadth of information I needed. In order to make an accurate prediction, I needed to feed more variables to the model than just the comic’s title, issue number, and price. The publisher was not relevant as I decided to limit this exercise to only Marvel comics and passing in the the estimated sales would be cheating, as rank is based on sales. So to enhance my dataset, I pulled metadata about each of the issues down from “the Cloud” using Marvel’s Developer API. Thankfully, since the API is a web service, there was no need for screen scraping.
Even then there were a few minor issues, as some letters like ‘S’ had more than 100 titles. To get around that I had to pull the list for ‘S’ titles sorted ascending and descending then combine the results, making sure to remove duplicates. So my advice on this one is be sure to read up on the limitations of any API you are using. It may enforce limits but you may be able to work around the limits using creative querying.
At this point I have my list of Marvel series titles, stored in some CSV files that I eventually combined into a single file, MarvelSeriesList.csv, for ease of use. Actually, I have more than that. While retrieving the series titles, I also pulled down the ID for the series and an appropriateness rating. Searching the API by ID will be much more accurate than name and the appropriateness rating may be useful when building out the prediction model. The next step was to iterate through each row of the CSVs we created from the sales data, find the matching ID from MarvelSeriesList.csv and use that ID to retrieve its metadata using the API.
Again, it took a few passes to get the configuration to work effectively. The first time through about only about 65% of the titles in the sales file found a match. That was throwing away too much data for my liking so I had to look at the exceptions and figure out why the matches were falling through. One issue that I found was titles that tacked on the publication years in the Marvel database, like “All-New X-Men (2012)”, had a match score in the 80’s when matched against a sales title like, “All New X-Men”. This was a pretty consistent issue, so rather than lowering the match percentage, which could introduce some legitimate mismatches, I decided to strip the year, if present, on mismatches and run it through that matching process again. This got me almost there. The only other issue I ran into was Fuzzy Wuzzy had trouble matching acronyms/acrostics. So ‘S.H.E.I.L.D.’ had a match score in the 50’s when matching ‘SHIELD’. That’s because half the characters (periods) were missing. Since there were only two titles affected, I built a lookup dictionary of special cases that needed to be translated. For the purposes of this exercise, I would still have had enough matches to skip that step, but by doing it brought us up to 100% matching between the two sources. Once the matching function was working, I pulled out urllib2 and retrieved all the metadata I could for each of the issues.
The resulting files contained not only sales data (title, issue number, month rank, estimated sales), but information about the creative team, issue descriptions, characters, release dates and associated story arcs. This would be enough to get us started with building our predictive classification model.