Sunday, September 8, 2013

Matching misspelled brand names -- the easy way

On Friday, Warby Parker's Director of Consumer Insights approached me with some data. They had sent out a survey one question of which was "list up to five eyeglass brands you are familiar with." She was wanting to aggregate the data but, being free text, the answers were riddled with misspellings. Manually, it would take a lot of time and effort to resolve the variants. She asked if there was a better way to standardize them.

With a question like this, your first reaction might be to think of regular expressions. While there are common forms of misspelling such as transposed characters (ie <--> ei) and (un)doubled consonants (Cincinnati, Mississippi), this is not a tenable approach. First, you are not going to capture all of the variants, common and uncommon. Second, brand names are not necessarily dictionary words and so may not follow normal spelling rules.

If we had a set of target brands, we might be able to use edit distance to associate some variants but "d+g" is very far away from "Dolce & Gabbana". That won't work. Besides, we don't have a limited list of brands. This is an open-ended question and gives rise to open ended results. For instance, Dale Earnhardt Jr has a line of glasses (who knew?) and appeared in our results.

To get a sense of the problem,  here are the variants of just Tommy Hilfiger in our dataset:

tommy helfinger
tommy hf
tommy hildfigers
tommy hilfger
tommy hilfieger
tommy hilfigar
tommy hilfiger
tommy hilfigger
tommy hilfigher
tommy hilfigur
tommy hilfigure
tommy hilfiinger
tommy hilfinger
tommy hilifiger
tommy hillfiger
tommy hillfigger
tommy hillfigur
tommy hillfigure
tommy hillfinger

Even if it kind of worked and you could resolve the easy cases, leaving the remainder to resolve manually, it still is not desirable. We want to run this kind of survey at regular intervals and I'm lazy: I want to write code for the whole problem once and rerun multiple times later. Set it and forget it.

This kind of problem is something that search engines contend with all the time. So, what would google do? They have a sophisticated set of algorithms which associate document content and especially link text with target websites. They also have a ton of data and can reach out along the long tail of these variants. For my purposes, however, it doesn't matter how they do it but whether I can piggyback off their efforts.

Here was my solution to the problem:

If I type in "Diane von Burstenburg" into google, it returns,

Showing results for Diane von Furstenberg

and the top result is for dvf.com. This is precisely the behavior I want. It will map all of those Tommy Hilfiger variants to the same website.

We now have a reasonable approach. What about implementation? Google has really locked down API access. Their old JSON API is deprecated but is still available but limited to 100 queries per day. (I used pygoogle to query it easily with

>>> from pygoogle import pygoogle
>>> g = pygoogle('ray ban')
>>> g.pages = 1
>>> g.get_urls()[0]
u'http://www.ray-ban.com/'

but was shut down by google within minutes). Even if you wget their results pages, it doesn't even contain the search results as they are all ajaxed in. I didn't want to pay for API access to their results for a small one off project so went looking elsewhere. DuckDuckGo has a nice JSON API but its results are limited. I didn't feel like parsing Bing's results page. Yahoo (+ BeautifulSoup + ulrlib) saves the day!

The following works well, albeit slowly due to my rate limiting (sleep for 15 seconds):

from bs4 import BeautifulSoup
import urllib
import time
import urlparse

f_out = open("output_terms.tsv","w") #list of terms, one per line
f = open("terms.txt","r")
for line in f.readlines():
    term = line.strip()
    try:
        print term
        f = urllib.urlopen("http://search.yahoo.com/search?p=" + "\"" + term +"\"") 
        soup = BeautifulSoup(f)
        link = soup.find("a", {"id": "link-1"})['href']
        parsed_uri = urlparse.urlparse( link )
        domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
        f_out.write( term + "\t" + link + "\t" + domain + "\n")
        time.sleep(15)
    except TypeError:
       print "ERROR with" + term

f.close()

f_out.close()

where output_terms.tsv was the set of unique terms after I lowercased each term, remove hyphens, ampersands and " and ".

This code outputs rows such as:

under amour http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armer http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armor http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armour http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
underarmor http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/

It is not perfect by any means. Those Tommy Hilfiger variants result in a mix of tommy.com and tommyhilfiger.com, which is Hilfiger's fault for having confusing / poor SEO. More importantly, about 10% of the the results map to wikipedia:

karl lagerfeld http://en.wikipedia.org/wiki/Karl_Lagerfeld

For these, I did

cat output_terms.tsv | grep wikipedia | awk -F '\t' '{print $1}' > wikipediaterms.txt

and reran these through my code using this query instead:

f = urllib.urlopen("http://search.yahoo.com/search?p=" + "\"" + term +"\"" + "-wikipedia")

This worked well and Lagerfeld now maps to

karl lagerfeld http://www.karl.com/ http://www.karl.com/

(There are of course still errors: Catherine Deneuve maps to http://www.imdb.com/name/nm0000366/ http://www.imdb.com/, a perfectly reasonable response. I tried queries with '"searchterm" +glasses' for greater context but the overall results were not great. With that I got lots of ebay links appearing.)

Now I have a hands-free process that seems to captures most of the variants and has trouble mostly only on low frequency, seemingly genuinely ambiguous cases. This can easily be run and rerun for future surveys. Laziness for the win! I don't even care if it fails to swap out very uncommon variants because in this case we don't need perfect data. We will aggregate the websites and filter out anything with frequency less than say X so those odd terms this process got wrong, we don't need to worry about. In other words, we care most about the first few bars of our ordered histogram.

Data scientists need to be good at lateral thinking. One skill is not to focus too much on the perfect algorithmic solution to a problem but, when possible, find a quick, cheap and dirty solution that gets you what you want. If that means a simple hack to piggyback of a huge team of search engine engineers and their enormous corpus, so much the better.


4 comments:

  1. The College or university involving Vermont at Chapel Slope,direct entry nurse practitioner programs is actually yet another one of many major medical packages in the us. Positioned in the college town involving Chapel Slope, Vermont, their particular Bachelor's medical system is probably the hottest packages made available along the whole institution.

    ReplyDelete
  2. I was excited to uncover this great site. I want to to thank you for
    ones time due to this fantastic read!! I definitely savored every part
    of it and i also have you book marked to look at new stuff in your
    website. Top educational apps

    ReplyDelete
  3. Great informative site. I'm really impressed after reading this blog post. I really appreciate the time and effort you promptessay.com spend to share this with us! I do hope to read more updates from you:)

    ReplyDelete
  4. Great informative site. I'm really impressed after reading this blog post.
    happy wheels
    super mario bros
    pacman

    ReplyDelete

Note: Only a member of this blog may post a comment.