App Store Data Mining Techniques Revealed – Part 1
The App Store is a treasure trove of data. App Store data can help you pick a category/segment, track trends, find the right price point, chart the total number of apps, track the rate of app approval and much more.
App Store data mining isn’t magic. It’s about finding data that’s exposed in iTunes, extracting it in a machine parse-able format, and doing something with it. This article will demonstrate each of those steps; further articles will expand on this topic.
As is almost always the case, this is best explained with an example. I’ll use a straightforward example to: let’s calculate the average selling price (ASP) for the top grossing apps.
iTunes
The App Store in iTunes contains, in its various views, the superset of the available data. The first step is to find places inside the iTunes App Store that expose the data you want to crunch.
Finding a place in iTunes that presents data that we’ll use to calculate the ASP for the top grossing apps is straightforward:
From the App Store home screen, click See All in the Top Grossing segment of the Top Charts panel shows the full list of top grossing apps and their prices.
XML
The iTunes store works like a browser. It uses HTTP like a browser, but instead of parsing HTML it consumes XML.
When iTunes shows the page of Top Grossing apps each app is represented by a block of XML that contains the title, price, update-date, a link to an icon, etc.
We’ll use this XML to calculate the ASP for the Top Grossing apps, but first we need to grab the XML and write it to disk:
Proxy
The easiest way to get iTunes XML data is to use a proxy as a man in the middle. Put a proxy in that can also write what it sees to disk and you’ll be in business.
I highly recommend, and for this article will be using, Karl von Randow’s Charles Proxy. Charles is designed for exactly this kind of task: as you use iTunes (or a browser) it records all of the headers and content that pass through it and provides you with tools to manipulate, search, filter, display and export the data.
The alternative to using the Charles proxy is to roll your own. Before becoming a Charles convert I wrote a proxy in Ruby. I was only interested in the XML data, so I wrote code to filter on content type. Then I needed to decompress the gzipped content, so I wrote code for that. Then I wrote code to name the files in a useful way. Etc.
Charles is $50. Easily worth it vs. the code you’ll have to write to do this yourself, especially when you’re in the exploring phase where Charles let’s you quickly pin down exactly where the data you’re looking for came across the wire. Needless to say, I’ve no commercial interest in Charles, I’m recommending it on it’s merits. You can see for yourself with a free, 30 day trial.
Locating And Saving The Data
Fire up your proxy and open iTunes. Charles automatically configures OS-X to place itself inline as a proxy after you grant it permission to do so.
In iTunes click the iTunes Store item in the left column, select the App Store from the iTunes Store’s top menu bar and click See All in the Top Grossing segment of the Top Charts panel.
Take a look at what’s crossed the wire: iTunes makes HTTP requests to a number of different web hosts. The two that’ll likely be of most interest are: a1.phobos.apple.com and ax.itunes.apple.com. The former serves app icon images and the latter serves what we’re after: XML data.
We’re interested in the XML data that iTunes used to render the Top Grossing screen. Finding the right file amongst the lot of them is easiest accomplished by searching for some bit of text that’ll only show up in the file we’re after.
Best bet here is to pick the title of one of the apps at the tail end of the list — those aren’t likely to be featured and won’t show up in top-10 list that’s in the App Store’s top level page. Charles makes quick work of searching: Command-F to bring up the search dialog, enter the text to find, search across all the files by choosing the Session scope and click Find.
If you picked your search term wisely it’ll show up several times in one file. Double click any row in the search results to view the item. Right-click or Control-click inside the content pane, choose Save Response… and store the results to disk.
Manipulating The Data
The XML data is large — 22,000 lines for our sample — and hard to comprehend.
Rather than trying to fully understand its format, simplify things by searching for prices.
Prices show up in a number of places, e.g., in the alt-attributes for some images. Parsing it out of those isn’t ideal. However, prices show up almost unadorned nested in this structure:
<TextView topInset="0" truncation="right" leftInset="0" styleSet="basic11" textJust="left" maxLines="1"> <SetFontStyle normalStyle="matrixTextFontStyle"> <b>$9.99</b> </SetFontStyle> </TextView>
XPath exists to make it easy to pluck out values from structures in XML. This XPath search will pluck out the prices our document:
//textview[@styleset="basic11"]/setfontstyle/b
Using Ruby’s Hpricot library, this 14 line script does the work of grab each of price, removing the leading dollar-sign and then calculating the average price:
#!/usr/bin/env ruby require 'rubygems' require 'hpricot' doc = Hpricot(File.read("topgrossing.xml")) total = 0.0; doc.search("//textview[@styleset="basic11"]/setfontstyle/b").each do |i| total += i.inner_text[1..-1].to_f end puts "Top Grossing Apps' ASP: $#{total / 100.0}"
And we’ve arrived at our goal!
More To Come
This example is straightforward. All of the data is in contained as the response to one HTTP request. In a future post I’ll talk about techniques for scripting a series of requests to gather pieces of a complete data set. Stay tuned!
Great stuff! HTTPScoop is an alternative to Charles and easier to use.
So don’t hold out, whats the average?!?
Ben: $10.31 as of a week or so ago when I started working on this 🙂
Thanks for this article! This inspired me to write a quick little Python script to send me a daily update email with the top apps in the category I’m targeting. For others who may be interested in doing the same, I used HTTPScoop (per HiQuLABS suggestion) to see the traffic, EditiX to grok the XML and figure out the pieces I needed to strip out, and Python’s elementtree library to parse the XML. When I was implementing this I first ran into a snag when trying to get the data from the URL that showed in HTTPScoop – it gave me a web page rather than the XML – but then I found I needed to set the user agent as if I was iTunes as described on the following web page: https://blogs.oreilly.com/iphone/2008/08/scraping-appstore-reviews.html
Thanks again for the article!
Great article, can’t wait for the next part. The more real data we gather about this business the better.
How do i get XML for an App Store application by its AppID? I saw the example for reviews, but i wanted to get the name, description and others. Is it possible to get this ?
This tool grabs the descriptions with screenshots available
https://www.appstoresdk.com
Everything is now https. Do you know a new way to extract the charts? The Apple RSS feed only goes up to 200. I heard this feed goes much deeper.
Hi Ric, I am working at AppTweak. We have been able to bypass the SSL encryption and get the full feed which indeed goes to 1000 result.
We provide an API access to those data with a REST JSON API available at : https://apptweak.io
Send us a message if you still need those 🙂