How did we make our new year’s card? It is compiled from data from our search indices. We did a wildcard search across a sample of about 30GB of the search index. We then counted and sorted by the number of occurrences for each term, keeping the top 250 terms. This can all be done in 1 operation. Try doing this in a relational database. You would have to do a select on each column for each table and then tokenize each field (to get the individual words), then count and sort. This will take forever. In our case, it took about 9s to run and parse the 30GB. It should be noted that we removed some ‘stop words’ such as ‘the’ and ‘and’ as well as potentially customer specific data such as names. We also added more weight to ’2016′ and ‘Cliffhanger’ for better presentation.
Comments are closed.