App Store Data Mining Techniques Revealed – Part 2: Scripting App Store XML Downloads

On December 8, 2009

Welcome back. The first article in this series introduced App Store data mining fundamentals, principally that iTunes works essentially like a browser, except that instead of rendering HTML iTunes uses XML data to generate its views.

In part one, we used a proxy as a man in the middle to save a copy of some interesting data from an iTunes session to disk. Using a proxy is handy for ad-hoc data mining tasks. However, for recurring tasks, it’s handier to leave the proxy and iTunes behind and grab the XML directly. This article will show you how.

We’ll modify our earlier calculate the average selling price of the top-grossing apps example to automatically pull down the XML data it needs:


As was the case before, I’ll be using the Charles Proxy to aid my exploration. As you did in part-1, with the proxy running, open up iTunes and navigate to the full list of the Top Grossing apps.

Previously, I used Charles’ search capabilities to to find the HTTP request that resulted in the XML for the Top Grossing apps. That’d work here too, but this time I’ll take a different tact:

Charles’ lets you interact with a browsing session by looking at requests in either a directory structure like way — showing folders, sub-folders and files in a Finder like hierarchy — and in an ordered sequence.

Using the Sequence view and a bit of filtering will quickly get us to the data we’re after. The XML requests to iTunes are served by URLs that contain WebObjects in their paths. Filter on WebObjects and then hunt for the most recent relatively large request and you’ll find the URL is:

User Agent

Paste that URL into a browser or fetch it with Curl and you’ll get a web page back. The iTunes HTTP servers match the request’s User Agent and return HTML unless the requestor appears to originate from iTunes. This is what makes it possible to email iTunes URLs (e.g., to your app) not leave users stranded with a browser trying to render unfamiliar XML.

So, we need include a User-Agent header identifying ourselves as iTunes. According to Charles, the User-Agent provided by iTunes was:

iTunes/9.0.2 (Macintosh; Intel Mac OS X 10.6.2) AppleWebKit/531.21.8

We’ll include that in our request.

Putting It All Together

Starting with the code from the previous article in this series, I’ve modified it to make the HTTP request with the correct User-Agent header and run the resulting XML through the parser to pluck out the prices and calculate the ASP. Here’s the result:

#!/usr/bin/env ruby

require 'rubygems'
require 'hpricot'
require 'net/http'

Net::HTTP.start( '', 80 ) do |http|
  doc = Hpricot(http.get('/WebObjects/MZStore.woa/wa/viewTopLegacy?id=25204&popId=38&genreId=36', "User-Agent" => "iTunes/9.0.2 (Macintosh; Intel Mac OS X 10.6.2) AppleWebKit/531.21.8" ).body)

  total = 0.0;"//textview[@styleset="basic11"]/setfontstyle/b").each do |i|
    total += i.inner_text[1..-1].to_f

  puts "Top Grossing Apps' ASP: $#{total / 100.0}"

Of note to a few: iTunes’ servers used always serve the content up gzipped, requiring an additional step to gunzip it before it could be used. I forgot to include the gunzip code and, when I stopped to think about it, was surprised to see that it worked without it.

One More Coming Up

I’ve got one more article in me on this topic. Stay tuned.

Holiday Giving: Support Mobile Orchard and receive 1,000,000 push notifications from Urban Airship, the iPhone app sales tracking application AppViz, the code generating Xcode helper Accessorizer, a 160 iPhone UI icon set and more. Details…

0 responses to “App Store Data Mining Techniques Revealed – Part 2: Scripting App Store XML Downloads”

  1. fvisticot says:

    Is there a solution to retrieve ALL the applications from the appStore (100000) with a single request ? or in multiple requests ???

  2. Dan Grigsby says:

    fvisicot: That’d take multiple requests. Needs to be more of a spider, digging out other “links” and fetching them. The final article in the series will demonstrate something similar.

  3. NineTail says:

    Great articles! Never thought about using a proxy to scoop iTunes!

    What data is possible to scoop about apps? I am looking for a way
    to find out how many times a particular app was downloaded or paid for.

    I have tried to see if there is any companies willing to sell this information for market research, but couldn’t find anything apart from services to use for your
    own apps.

  4. zpabao says:

    Actuall that’s what I did already in 2008, port all app info from app store, and publish into another web using different layout, you can add additional feature based on that. It’s in Chinese, and I haven’t mainitained for long time, so the content is not up-to-date, more than 70,000 apps currently.
    Hope I can have time to continue it, and I plan to provide an Eng version.

  5. Jon Lim says:

    Hey Dan!

    Great article – very helpful. So much so that I had my own app store scraper running PHP and it recently stopped working. Getting a request from the App Store now results in a blank file, whereas before I was able to view the XML code.

    Am I alone on this front?

  6. marek says:

    I’m also veeery curious about last article in series, can you tell us more or less when it will appear on your blog?