Yelp Business Rating Predictions

In this project, I applied machine learning alrorithms to predict Yelp business ratings. The data consist of meta-data elements (such as location, type of business, if restaurant - the type of food they serve, etc.) which can be used as general estimators of a business's performance (in this example, the *star* rating in the Yelp system).

As with the Capital One Labs dataset, there are often many features (the world, and companies, have so much data!). We can think of using the data in different ways to create a varity of models for prediction. In this example, I will walk through creating many models based on different subsets of these features. We will mainly be using variations of sklearn's transformers, estimators, and predictors. The beauty of this architechture is that the algorithms can be customized to suit the needs of many different problems.

We can make the models as simple and complex as we'd like, as long as we remember that the goal is simply to arrive at a good prediction. As we saw before, this does not always entail using all of the data! To begin, let's build a predictive model using the average ratings in each city. It might be the case that certain cities simply have different rating regime's (Minneapolis might consistently have higher ratings than Brooklyn - those Minnesotan's are just so nice! ;) ). To do this we'll use the average rating based in each city for our prediction.

For simple predictive methodology, it is often not necessary to roll a complex fit_object. In this example, we are using a single metric which will have single value. It might make sense to use a fast data structure such as a hash/dictionary. The cities will therefore be the keys while the average ratings will be the values in our key:value pairs. Let's start by loading the data into a data frame, and performing a simple groupby operation.

with open('yelp_train_academic_dataset_business.json', "r") as f:    
    dataset = [json.loads(line) for line in f.readlines()]
    Yelp_Business_df = pd.DataFrame(dataset)   

CityModel = Yelp_Business_df[['stars', 'city']]
CityModel_train, CityModel_test = cross_validation.train_test_split(CityModel, test_size=0.2)
CityModel_gb = CityModel.groupby('city', as_index=False).mean()
CityDict = CityModel_gb.todict()

#now can input queries to dictionary to get mean rating for city.

This simple model is not great at predicting a business rating. But can we do better? Average ratings can also vary greatly by their location within a city. We can use geographical location (latitude and longitude) as a model which is more precise. Here we are just building the complexity of a simple idea.

We need to choose an appropriate algorithm to use and to account for the non-linearity of our modeling variables - latitude and longitude. For example, you can imagine that each have no linear correlation to the star value. Therefore, we need to determine if there are any trends within latitude and longitude regions. We could think of implementing this in a few ways. The simplest model might be to define groups based on geographical location, such as zip code. We could then apply a method similar to above. But what if there are trends within certain zip codes which are more difficult to discern, or are more fine-grained than zip code? To realize this "invisible" structure, we can use either K-nearest neighbors (KNN) or Random Forest regression. I will walk through KNN in this example.

KNN is a parameter-less algorithm. The easiest way to understand KNN is to consider the solution for one nearest-neighbor. If we are trying to predict a y_test value in a test set based on some features (x_test), our best prediction will be obtained by choosing the features (x_train) which are closest to our x_test, and using their y_train values as our prediction value (y_train = y_test). As we scale the number of nearest neighbors, we simply take the average of the training set neighbors to predict the y_test. We will optimize for the number of nearest-neighbors to use by applying a grid-search.

from sklearn import neighbors, cross_validation, grid_search
from sklearn import preprocessing

#LATLONG MODEL

LatLon_df = Yelp_Business_df[['latitude', 'longitude']]
x_latlong = np.array(LatLon_df)
stars = Yelp_Business_df['stars'].values

#setup regression and cross-validation. Grid search over nearest-neighbor parameter and save the best fit with respect to the scoring accuracy

cv = cross_validation.ShuffleSplit(len(stars), n_iter=20, test_size=0.2, random_state=42)
param_grid = { "n_neighbors": range(4, 100, 1) }
KNN = grid_search.GridSearchCV(neighbors.KNeighborsRegressor(), 
                                                param_grid=param_grid, cv=cv, 
                                                scoring='accuracy')

print KNN.fit(x_latlong, stars)
#don't forget to train the model on the entire dataset again!

So this will now infer the number of lat-lon nearest-neighbors to use in the model, based on accuracy. If you deploy this model, you'll see it does better than the city model, but is still a weak predictor. So far we've used primarily location features for modeling. What about restaurant specific information? For example, you can imagine the type of cuisine might influence ratings. This could even depend on the population demographics in the surrounding area (we could think about linking census data here too!). Let's start simple and use the food category label in Yelp as a predictor.

#CATEGORY MODEL

'''
For this model, we need to convert the categorical features obtain a list of dictionaries

We want the form to be:
[,
{restaurant: 1, steakhouses: 1, pizza: 1, shopping:1},
{restaurant:1, steakhouse:1, handyman:1},
...
]
'''
categories_df = Yelp_Business_df[['categories']]
categories = categories_df['categories'].tolist()

catlistdicts = []
for item in range(0,len(categories)):
    itemdict = {}
    keys = categories[item]
    for i in range(0,len(keys)):
        itemdict[keys[i]] = 1
        
    catlistdicts.append(itemdict)

DV = DictVectorizer()
X = DV.fit_transform(catlistdicts)
y = stars #from above

cv = cross_validation.ShuffleSplit(len(y), n_iter=20, test_size=0.2, random_state=42)
LR = linear_model.LinearRegression()

AccScore_linear = cross_validation.cross_val_score(LR, X, y, cv=cv)

print AccuScore_linear

The only problem with this type of a model can be the overabundance of some restraurants. We don't want to bias the category's power in the prediciton simply because it occurs more frequently. One way to handle this is to use TFIDF, which will normalize each category based on it's total count in the entire set.

#this uses TFIDF as an alternate method from above
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(catlistdicts)
y = stars #from above

cv = cross_validation.ShuffleSplit(len(y), n_iter=20, test_size=0.2, random_state=42)
LR = linear_model.LinearRegression()

tfidf_linear = cross_validation.cross_val_score(LR, X, y, cv=cv)

print tfidf_linear

Let's continue on using text/unstructured data for predictions. In Yelp reviews, there are certain attributes which can be nested. For example, a venue could have the following attributes: { 'Attire':'formal','Accepts Credit Cards': False,'Ambience': {'romantic': False, 'classy': True }}. These can be transformed in a similar manner to above, but first we'll need to flatten the dictionaries. Once we encode the text, we can feed it into a regression algorithm.

#Attribute Model - we need to flatten the attributes for each venue

Attributes_df = Yelp_Business_df[['attributes']]
Attribute_dict = Attributes_df['attributes'].to_dict()

import collections
items_overall = []
def flatten(d, parent_key='', sep='_'):
    #items_overall = []
    #print "function call items length", len(items)
    try:
        if type(d) == dict:
            items = []
            for k, v in d.items():
                if type(k) == np.int64: #check to see if first layer is a float (it will be by default of the way the dictionary is created from pandas)
                    items.append(flatten(v, parent_key='', sep=sep))
                else:
                    new_key = parent_key + sep + k if parent_key else k
                
                    if isinstance(v, dict):
                        items.extend(flatten(v, new_key, sep=sep).items())
                    else:
                        if isinstance(v, bool):
                            items.append((new_key, int(v)))
                        
                        elif isinstance(v,int):
                            items.append((new_key, v)) 
                
                        else:
                            items.append((new_key, 1))
        
            return dict(items)
    except:
        return items

flat_attribute_dict = flatten(Attribute_dict)
DV = DictVectorizer()
X = DV.fit_transform(flat_attribute_dict)
y = stars #from above

cv = cross_validation.ShuffleSplit(len(y), n_iter=20, test_size=0.2, random_state=42)
LR = linear_model.LinearRegression()

AttributeModel_linear = cross_validation.cross_val_score(LR, X, y, cv=cv)
print AttributeModel_linear

One powerful component of the SKlearn library is our ability to combine different models. In the above examples, we are using features from a variety of metrics. We can combine these all into one model using a FeatureUnion. To do this we'll need to create a Transformer that takes our transformers from each model as arguments. We will need to transform the input data to predictions at each step through a transformation because Feature Unions assume that only transformations are applied to the data. From here, we can feed results into a regression algorithm.

#in addition to the above:
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline

#create a class to output the prediction in the transform method

class ModelTransformer(TransformerMixin):
    '''
    to make an estimator behave like a transformer
    '''

    def __init__(self, model, name):
        self.model = model
        self.name = name

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return [self.model.predict(x) for x in X] #now the prediction is in the transform method.

pipeline = Pipeline([