Business Applications

Mobile Apps
Product Reviews
Employee Attrition

Product Reviews: Text Attributes

Contents

  1. Summary of Results
  2. Data Source and Discussion
  3. Length
  4. Language Diversity
  5. Descriptor Density
  6. Most Frequent Descriptors
  7. Long-Word Density
  8. Personability
  9. Storytelling
  10. Social Network Markers
  11. Summarize Data
  12. Future Work

<h2>Summary of Results</h2>

This project is a step towards assessing whether or not a review was considered helpful based on the substance of the text review. To this end, in this project we aim to better understand the structure of our collections of reviews, and do so by augmenting the given dataset with eight variables to reveal text attributes. Specifically, we establish the following:

Additional, but brief, discussion for motivation is given in the respective subsections below.

The general structure of these is glimpsed through their respective distributions:

columns_for_hist = ['Length','LanguageDiversity','DescriptorDensity','LongWordDensity','Personability','Storytelling','SocialNetworkBinary']

fig = plt.figure(figsize=[15, 30])

for col in columns_for_hist:
    #setup position on canvas
    i = columns_for_hist.index(col)
    ax = fig.add_subplot(4,2,i+1)
    
    #the actual plot
    ax.hist(reviews_sample[col])
    ax.set_title(col)
    ax.xaxis.set_ticks_position('bottom')

plt.subplots_adjust(hspace = 0.2)
plt.show()

png

Notably, the above results indicate that, with the possible exceptions of Length, Storytelling, and SocialNetworkBinary, there is enough variation within the dataset that we should proceed with inspecting their impact on the helpfulness score of each review.

Note: we don’t expect any single attribute to have much independent value and future modelling will need to account for their interaction, i.e., we expect substantial instances of collinearity. For example, short reviews like “This was great!” have a much higher probability of realizing greater language diversity scores.

<h2>Data Source and Discussion</h2>

The data are made available by The Stanford Network Analysis Project. Included are 568,454 reviews from 256,059 users spanning 74,258 products. The variables included for each observation are:

reviews.columns
Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

Before beginning the actual work, we will arbitrarily choose two reviews to serve as working examples and walk through the reasoning of our desired attributes. While we establish a good number of attributes, most are fairly intuitive and we keep the associated discussions as short as possible. In the course of establishing these attributes, we will write the general functions we will need when we apply our reasoning to the entire dataset.

Note: we will refer to an instance of a review as either “review” or “text”, with no difference in meaning.

#setup our standard environment and a dataframe of the dataset
import pandas as pd
import numpy as np
reviews = pd.read_csv('Reviews.csv')
#we will use the nltk package to perform our NLP
#punkt is the tokenizer
import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Thomas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!





True
#use python builtin to efficiently perform the requisite counting
from collections import Counter
#arbitrarily choose our two worked examples and convert it into the two formats that the package requires
#note: we retain the raw text as its own variable.  this will make some methods easier to apply, e.g., inspecting html structure
raw1 = reviews['Text'][75165]
tokens1 = nltk.word_tokenize(raw1)
ex_text1 = nltk.Text(tokens1)

raw2 = reviews['Text'][275845]
tokens2 = nltk.word_tokenize(raw2)
ex_text2 = nltk.Text(tokens2)

Note: it is possible to remove all punctuation marks before applying the previous step. There are arguments for keeping them and removing them, but we have chosen to leave the reviews as unchanged as possible. We should be aware that, particularly with such short text instances, some of our measurements will be affected by this choice. In particular heavy punctuation users will yield longer lengths and, hence, reduced density measurements. Similarly for those users who include links or other html elements. Removing them would negatively effect the tokenization process. For instance, because we assume that the language contained in the reviews is more casual, we expect a high incidence of contractions, which would split into two tokens in the above process. More thought and experimentation will be required on this topic in future work to find the optimal setup.

For reference, our working examples are:

raw1
"This was an Excellent value of buying on Amazon versus the pet stores.<br />It was packaged and labeled well for shipment.<br /><br />However - the manufacturer's plastic top of the box itself is a little cheap and less functional than I expected.  Frankly, the whole top has to come off each time I want to give my dog a single chew.<br /><br />So - I'm actually putting these chews into a Dog treat jar that has a open-and-close top with a tab.<br /><br />The manufacturer could simply change the bulk packaging to a bag.  Frankly, a Ziploc bag would be just as functional (and probably have a lower manufacturing cost) than the box."
raw2
'Product was fresh, and well wrapped.  All done with obvious care.'

<h2>Length</h2>

We start with the simplest metric, the length of the review. For now, there is no need to write a general function for this.

len(ex_text1)
147
len(ex_text2)
14

Note: this is a measurement of the number of tokens, not a strict word count.

<h2>Language Diversity</h2>

Normalizing by the length of the text, we will establish a simple measure of how diverse or rich the language used is. Specifically, we will take the quotient of the number of unique words by the length of the text.

def langDiversity(text):
    return len(set(text))/len(text)
langDiversity(ex_text1)
0.6122448979591837
langDiversity(ex_text2)
0.9285714285714286

<h2>Descriptor Density</h2>

We look at the use of all adjectives and adverbs, and quotient by the length of the text to establish a density score.

#parts-of-speech flagging requires the following to function
nltk.download('averaged_perceptron_tagger')
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Thomas\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!





True
def descrDens(text):
    #returns an arry of 2-tuples (word,POS)
    pos_tags = nltk.pos_tag(text)
    
    #flags corresponding to adjectives (JJ) and adverbs (RB), where the R and S "suffixes" indicate comparative and superlative versions, respectively
    pos = ["JJ", "JJR", "JJS", "RB", "RBR", "RBS" ]
    
    #we won't use the acutal words here, so just establish a count to increment
    count = 0
    
    #perform the actual counting
    for x in pos_tags:
        if x[1] in pos:
            count += 1
            
    return count/len(text)
descrDens(ex_text1)
0.11564625850340136
descrDens(ex_text2)
0.21428571428571427

<h2>Most Frequent Descriptors</h2>

To complement the previous metric, we will collect the 5 most frequently used adjectives and adverbs in each review. But, to improve our tracking of trends within the text, we will only include descriptors that occur more than once.

#this returns the desired results, but need to return and consider the efficiency
def freqDescr(text):
    #returns an arry of 2-tuples (word,POS)
    pos_tags = nltk.pos_tag(text)
    
    #flags corresponding to adjectives (JJ) and adverbs (RB), where the R and S "suffixes" indicate comparative and superlative versions, respectively
    pos = ["JJ", "JJR", "JJS", "RB", "RBR", "RBS" ]
    
    #implement the collecting
    descr = [x[0] for x in pos_tags if x[1] in pos]
    
    #remove words that occur once
    for key,value in Counter(descr).items():
        if value == 1:
            descr.remove(key)
            
    #limit our results to the top 5, will return an array of 2-tuples (word, count)
    return Counter(descr).most_common(5)
freqDescr(ex_text1)
[('functional', 2)]
freqDescr(ex_text2)
[]

<h2>Long-Word Density</h2>

We take 5 as the average length of a word in English (based on some searching, there seems to be a number of different results but they are all clustered around five). And, we define a word as long if it is longer than this average. We divide by the length of the text to establish the density score

def longWord(token):
    long = [word for word in token if len(word)>5]
    return len(long)/len(token)
longWord(tokens1)
0.19727891156462585
longWord(tokens2)
0.21428571428571427

<h2>Personability</h2>

As a measure of the personability or relatibility of the review, we will measure the pronoun and possessive usage, as a density. This is based on the assumption that other users respond favorably to being able to project themselves into the experiences of others, and this process is made slightly easier through the use of personal pronouns.

#the mechanics here are very much like the descriptor density function
def pronDens(text):
    #returns an arry of 2-tuples (word,POS)
    pos_tags = nltk.pos_tag(text)
    
    #flags corresponding to personal pronouns (PRP) and possessive version (PRP$)
    pos = ["PRP", "PRP$" ]
    
    #we won't use the acutal words here, so just establish a count to increment
    count = 0
    
    #perform the actual counting
    for x in pos_tags:
        #we don't want to include instances of "it"; while they are personal pronouns they are not personable enough for this application
        if 'it' not in x[0].lower():
            if x[1] in pos:
                count += 1
            
    return count/len(text)
pronDens(ex_text1)
0.027210884353741496
pronDens(ex_text2)
0.0

<h2>Storytelling</h2>

As another perspective on the reasoning underlying the previous metric, we assume that users are able to better identify with those reviews that have a stronger narrative structure. In order to measure the level of storytelling, we will measure the density of the usage of words that denote the passage of time or a sequence of events. The technical implementation here is quite simple (establish key words and check for inclusion), but the ease of this method does create some pitfalls. Notably, such keyword lists are only as useful as they are exhaustive and some included words may have alternate meanings.

#this is functional, but not very elegant and in linear time
def storyDens(tokens):
    #adjust our tokens to ignore cases
    tokens_uniform = [x.lower() for x in tokens]
    
    #establish our chosen identifies
    story_words = ['before','first','next','sometimes','earlier','begin','times','finally','previously','starting','consequently','time','conclusion','formerly','initially','following','occasionally','conclude','previous','originally','turn','periodically','end','past','onset','second','rarely','ultimately','prior','beginning','soon','seldomly','finish','preceding','then','yesterday','outset','henceforth','often','lastly','last','before','third','occasion','until','subsequently','now','advance','start','later','intermittently','after','afterwards','annual','anytime','belated','day','days','delay','delayed','early','evening','everyday','future','hour','late','later','midnight','midmorning','midafternoon','afternoon','minute','moment','momentarily','month','morning','night','nighttime','noon','present','schedule','season','someday','sometime','spring','summer','sunrise','sunset','today','tomorrow','tonight','week','winter','fall','autumn','year']
    
    #establish a count to increment
    count = 0
    
    #perform the actual counting
    for x in tokens_uniform:
        if x in story_words:
            count += 1
            
    return count/len(tokens)
storyDens(ex_text1)
0.006802721088435374
storyDens(ex_text2)
0.0

<h2>Social Network Markers</h2>

As yet another angle of attack on the relatability of a review, we will measure the use of words that connote familial or social connection, e.g., son, husband, kid, friend. Because these words are less likely to appear multiple times and because of their assumed high collinearity with the pronoun density, we will establish this metric as a binary.

#the initial setup is similar to storyDens. then exit conditions are simple
def socialBinary(tokens):
    #adjust our tokens to ignore cases
    tokens_uniform = [x.lower() for x in tokens]
    
    #establish our chosen identifies
    social_words = ['father','mother','dad','mom','son','daughter','brother','sister','husband','wife','parent','child','sibling','grandfather','grandmother','grandpa','grandma','grandson','granddaughter','grandparent','grandchild','grandchildren','uncle','aunt','cousin','nephew','niece','stepfather','stepdad','stepmother','stepmom','stepson','stepdaughter','stepsister','stepbrother','kid','kids','grandkids','friend','buddy','pal','acquaintance','classmate','partner','roommate','mate','acquaintance','neighbor','associate','colleague','assistant','boss','supervisor','employee','co-worker','teammate']
    
    #check inclusion, with an exit condition if realized
    for x in tokens_uniform:
        if x in social_words:
            return 1
            
    #this is only realized if the above loop campleted without finding any inclusion
    return 0
socialBinary(tokens1)
0
socialBinary(tokens2)
0

<h2>Augmenting our Dataset</h2>

With functions now defined for our metrics, we will apply them to our dataset. Before applying them to the full dataset, we will take a slice as a testing environment and ensure we do not have any errors in doing so.

test_reviews = reviews.iloc[:20,:]

First, if we look at the functions we wrote above, we find that three of them utilize the tokens of the review and five utilize the NLTK text. While it will cost more in storage, we will opt to establish the tokens and NLTK text as their own columns. This has the advantage that we will be able to apply our functions in a vectorized fashion and reduce the computational time.

In order to do this, we will append the two columns to the dataframe, then define functions that will establish the tokens and NLTK text given a row from our dataframe, then use the apply method to populate our columns.

test_reviews = test_reviews.join(pd.DataFrame(
    {
        'Tokens': np.nan,
        'TextNLTK':  np.nan
    }, index=test_reviews.index
))
def tokenize(row):
    raw = row['Text']
    tokens = nltk.word_tokenize(raw)
    return tokens
def text_to_textNLTK(row):
    raw = row['Text']
    tokens = nltk.word_tokenize(raw)
    text =  nltk.Text(tokens)
    return text
test_reviews['Tokens'] = test_reviews.apply(tokenize, axis=1)
test_reviews['TextNLTK'] = test_reviews.apply(text_to_textNLTK, axis=1)
test_reviews
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text Tokens TextNLTK
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d... [I, have, bought, several, of, the, Vitality, ... (I, have, bought, several, of, the, Vitality, ...
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut... [Product, arrived, labeled, as, Jumbo, Salted,... (Product, arrived, labeled, as, Jumbo, Salted,...
2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 4 1219017600 "Delight" says it all This is a confection that has been around a fe... [This, is, a, confection, that, has, been, aro... (This, is, a, confection, that, has, been, aro...
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200 Cough Medicine If you are looking for the secret ingredient i... [If, you, are, looking, for, the, secret, ingr... (If, you, are, looking, for, the, secret, ingr...
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham "M. Wassir" 0 0 5 1350777600 Great taffy Great taffy at a great price. There was a wid... [Great, taffy, at, a, great, price, ., There, ... (Great, taffy, at, a, great, price, ., There, ...
5 6 B006K2ZZ7K ADT0SRK1MGOEU Twoapennything 0 0 4 1342051200 Nice Taffy I got a wild hair for taffy and ordered this f... [I, got, a, wild, hair, for, taffy, and, order... (I, got, a, wild, hair, for, taffy, and, order...
6 7 B006K2ZZ7K A1SP2KVKFXXRU1 David C. Sullivan 0 0 5 1340150400 Great! Just as good as the expensive brands! This saltwater taffy had great flavors and was... [This, saltwater, taffy, had, great, flavors, ... (This, saltwater, taffy, had, great, flavors, ...
7 8 B006K2ZZ7K A3JRGQVEQN31IQ Pamela G. Williams 0 0 5 1336003200 Wonderful, tasty taffy This taffy is so good. It is very soft and ch... [This, taffy, is, so, good, ., It, is, very, s... (This, taffy, is, so, good, ., It, is, very, s...
8 9 B000E7L2R4 A1MZYO9TZK0BBI R. James 1 1 5 1322006400 Yay Barley Right now I'm mostly just sprouting this so my... [Right, now, I, 'm, mostly, just, sprouting, t... (Right, now, I, 'm, mostly, just, sprouting, t...
9 10 B00171APVA A21BT40VZCCYT4 Carol A. Reed 0 0 5 1351209600 Healthy Dog Food This is a very healthy dog food. Good for thei... [This, is, a, very, healthy, dog, food, ., Goo... (This, is, a, very, healthy, dog, food, ., Goo...
10 11 B0001PB9FE A3HDKO7OW0QNK4 Canadian Fan 1 1 5 1107820800 The Best Hot Sauce in the World I don't know if it's the cactus or the tequila... [I, do, n't, know, if, it, 's, the, cactus, or... (I, do, n't, know, if, it, 's, the, cactus, or...
11 12 B0009XLVG0 A2725IB4YY9JEB A Poeng "SparkyGoHome" 4 4 5 1282867200 My cats LOVE this "diet" food better than thei... One of my boys needed to lose some weight and ... [One, of, my, boys, needed, to, lose, some, we... (One, of, my, boys, needed, to, lose, some, we...
12 13 B0009XLVG0 A327PCT23YH90 LT 1 1 1 1339545600 My Cats Are Not Fans of the New Food My cats have been happily eating Felidae Plati... [My, cats, have, been, happily, eating, Felida... (My, cats, have, been, happily, eating, Felida...
13 14 B001GVISJM A18ECVX2RJ7HUE willie "roadie" 2 2 4 1288915200 fresh and greasy! good flavor! these came securely packed... the... [good, flavor, !, these, came, securely, packe... (good, flavor, !, these, came, securely, packe...
14 15 B001GVISJM A2MUGFV2TDQ47K Lynrie "Oh HELL no" 4 5 5 1268352000 Strawberry Twizzlers - Yummy The Strawberry Twizzlers are my guilty pleasur... [The, Strawberry, Twizzlers, are, my, guilty, ... (The, Strawberry, Twizzlers, are, my, guilty, ...
15 16 B001GVISJM A1CZX3CP8IKQIJ Brian A. Lee 4 5 5 1262044800 Lots of twizzlers, just what you expect. My daughter loves twizzlers and this shipment ... [My, daughter, loves, twizzlers, and, this, sh... (My, daughter, loves, twizzlers, and, this, sh...
16 17 B001GVISJM A3KLWF6WQ5BNYO Erica Neathery 0 0 2 1348099200 poor taste I love eating them and they are good for watch... [I, love, eating, them, and, they, are, good, ... (I, love, eating, them, and, they, are, good, ...
17 18 B001GVISJM AFKW14U97Z6QO Becca 0 0 5 1345075200 Love it! I am very satisfied with my Twizzler purchase.... [I, am, very, satisfied, with, my, Twizzler, p... (I, am, very, satisfied, with, my, Twizzler, p...
18 19 B001GVISJM A2A9X58G2GTBLP Wolfee1 0 0 5 1324598400 GREAT SWEET CANDY! Twizzlers, Strawberry my childhood favorite ca... [Twizzlers, ,, Strawberry, my, childhood, favo... (Twizzlers, ,, Strawberry, my, childhood, favo...
19 20 B001GVISJM A3IV7CL2C13K2U Greg 0 0 5 1318032000 Home delivered twizlers Candy was delivered very fast and was purchase... [Candy, was, delivered, very, fast, and, was, ... (Candy, was, delivered, very, fast, and, was, ...

To perform the vectorized operations, the apply method feeds a row into the given function. As such, we will proceed as follows: (1) append our attribute columns with NaN values, (2) we will define adjusted versions of our attribute functions so that they may leverage our newly created Tokens and TextNLTK columns within the context of an entire row observation, (3) apply these new functions and populate our newly created columns

test_reviews = test_reviews.join(pd.DataFrame(
    {
        'Length': np.nan,
        'LanguageDiversity':  np.nan,
        'DescriptorDensity':  np.nan,
        'MostFrequentDescriptors':  np.nan,
        'LongWordDensity':  np.nan,
        'Personability':  np.nan,
        'Storytelling':  np.nan,
        'SocialNetworkBinary':  np.nan
    }, index=test_reviews.index
))
#we didn't define a function for length, but we will do so here given the row considerations
def rowLength(row):
    text = row['TextNLTK']
    return len(text)
test_reviews['Length'] = test_reviews.apply(rowLength, axis=1)
def rowLangDiversity(row):
    text = row['TextNLTK']
    return langDiversity(text)
test_reviews['LanguageDiversity'] = test_reviews.apply(rowLangDiversity, axis=1)
def rowDescrDens(row):
    text = row['TextNLTK']
    return descrDens(text)
test_reviews['DescriptorDensity'] = test_reviews.apply(rowDescrDens, axis=1)
def rowFreqDescr(row):
    text = row['TextNLTK']
    return freqDescr(text)
test_reviews['MostFrequentDescriptors'] = test_reviews.apply(rowFreqDescr, axis=1)
def rowLongWord(row):
    token = row['Tokens']
    return longWord(token)
test_reviews['LongWordDensity'] = test_reviews.apply(rowLongWord, axis=1)
def rowPronDens(row):
    text = row['TextNLTK']
    return pronDens(text)
test_reviews['Personability'] = test_reviews.apply(rowPronDens, axis=1)
def rowStoryDens(row):
    tokens = row['Tokens']
    return storyDens(tokens)
test_reviews['Storytelling'] = test_reviews.apply(rowStoryDens, axis=1)
def rowSocialBinary(row):
    tokens = row['Tokens']
    return socialBinary(tokens)
test_reviews['SocialNetworkBinary'] = test_reviews.apply(rowSocialBinary, axis=1)

And our dataframe is now given by:

test_reviews
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text Tokens TextNLTK Length LanguageDiversity DescriptorDensity MostFrequentDescriptors LongWordDensity Personability Storytelling SocialNetworkBinary
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d... [I, have, bought, several, of, the, Vitality, ... (I, have, bought, several, of, the, Vitality, ... 51 0.803922 0.156863 [(better, 2)] 0.294118 0.078431 0.000000 0
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut... [Product, arrived, labeled, as, Jumbo, Salted,... (Product, arrived, labeled, as, Jumbo, Salted,... 37 0.837838 0.135135 [] 0.324324 0.000000 0.000000 0
2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 4 1219017600 "Delight" says it all This is a confection that has been around a fe... [This, is, a, confection, that, has, been, aro... (This, is, a, confection, that, has, been, aro... 109 0.660550 0.146789 [(tiny, 2)] 0.211009 0.027523 0.009174 1
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200 Cough Medicine If you are looking for the secret ingredient i... [If, you, are, looking, for, the, secret, ingr... (If, you, are, looking, for, the, secret, ingr... 46 0.847826 0.108696 [] 0.239130 0.108696 0.000000 0
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham "M. Wassir" 0 0 5 1350777600 Great taffy Great taffy at a great price. There was a wid... [Great, taffy, at, a, great, price, ., There, ... (Great, taffy, at, a, great, price, ., There, ... 32 0.718750 0.125000 [] 0.062500 0.031250 0.000000 0
5 6 B006K2ZZ7K ADT0SRK1MGOEU Twoapennything 0 0 4 1342051200 Nice Taffy I got a wild hair for taffy and ordered this f... [I, got, a, wild, hair, for, taffy, and, order... (I, got, a, wild, hair, for, taffy, and, order... 88 0.738636 0.147727 [(only, 2)] 0.181818 0.079545 0.000000 1
6 7 B006K2ZZ7K A1SP2KVKFXXRU1 David C. Sullivan 0 0 5 1340150400 Great! Just as good as the expensive brands! This saltwater taffy had great flavors and was... [This, saltwater, taffy, had, great, flavors, ... (This, saltwater, taffy, had, great, flavors, ... 57 0.824561 0.175439 [] 0.263158 0.017544 0.000000 0
7 8 B006K2ZZ7K A3JRGQVEQN31IQ Pamela G. Williams 0 0 5 1336003200 Wonderful, tasty taffy This taffy is so good. It is very soft and ch... [This, taffy, is, so, good, ., It, is, very, s... (This, taffy, is, so, good, ., It, is, very, s... 30 0.833333 0.233333 [] 0.200000 0.066667 0.000000 0
8 9 B000E7L2R4 A1MZYO9TZK0BBI R. James 1 1 5 1322006400 Yay Barley Right now I'm mostly just sprouting this so my... [Right, now, I, 'm, mostly, just, sprouting, t... (Right, now, I, 'm, mostly, just, sprouting, t... 29 0.896552 0.172414 [] 0.172414 0.137931 0.034483 0
9 10 B00171APVA A21BT40VZCCYT4 Carol A. Reed 0 0 5 1351209600 Healthy Dog Food This is a very healthy dog food. Good for thei... [This, is, a, very, healthy, dog, food, ., Goo... (This, is, a, very, healthy, dog, food, ., Goo... 29 0.827586 0.206897 [] 0.206897 0.103448 0.000000 0
10 11 B0001PB9FE A3HDKO7OW0QNK4 Canadian Fan 1 1 5 1107820800 The Best Hot Sauce in the World I don't know if it's the cactus or the tequila... [I, do, n't, know, if, it, 's, the, cactus, or... (I, do, n't, know, if, it, 's, the, cactus, or... 184 0.554348 0.130435 [(n't, 3), (hot, 3), (once, 2)] 0.168478 0.081522 0.005435 0
11 12 B0009XLVG0 A2725IB4YY9JEB A Poeng "SparkyGoHome" 4 4 5 1282867200 My cats LOVE this "diet" food better than thei... One of my boys needed to lose some weight and ... [One, of, my, boys, needed, to, lose, some, we... (One, of, my, boys, needed, to, lose, some, we... 72 0.736111 0.111111 [(higher, 2)] 0.152778 0.069444 0.013889 0
12 13 B0009XLVG0 A327PCT23YH90 LT 1 1 1 1339545600 My Cats Are Not Fans of the New Food My cats have been happily eating Felidae Plati... [My, cats, have, been, happily, eating, Felida... (My, cats, have, been, happily, eating, Felida... 86 0.686047 0.162791 [(new, 3), (now, 2)] 0.151163 0.093023 0.046512 0
13 14 B001GVISJM A18ECVX2RJ7HUE willie "roadie" 2 2 4 1288915200 fresh and greasy! good flavor! these came securely packed... the... [good, flavor, !, these, came, securely, packe... (good, flavor, !, these, came, securely, packe... 19 0.842105 0.263158 [] 0.263158 0.052632 0.000000 0
14 15 B001GVISJM A2MUGFV2TDQ47K Lynrie "Oh HELL no" 4 5 5 1268352000 Strawberry Twizzlers - Yummy The Strawberry Twizzlers are my guilty pleasur... [The, Strawberry, Twizzlers, are, my, guilty, ... (The, Strawberry, Twizzlers, are, my, guilty, ... 24 0.916667 0.083333 [] 0.250000 0.125000 0.000000 1
15 16 B001GVISJM A1CZX3CP8IKQIJ Brian A. Lee 4 5 5 1262044800 Lots of twizzlers, just what you expect. My daughter loves twizzlers and this shipment ... [My, daughter, loves, twizzlers, and, this, sh... (My, daughter, loves, twizzlers, and, this, sh... 29 0.862069 0.068966 [] 0.344828 0.068966 0.000000 1
16 17 B001GVISJM A3KLWF6WQ5BNYO Erica Neathery 0 0 2 1348099200 poor taste I love eating them and they are good for watch... [I, love, eating, them, and, they, are, good, ... (I, love, eating, them, and, they, are, good, ... 45 0.777778 0.111111 [] 0.155556 0.200000 0.022222 0
17 18 B001GVISJM AFKW14U97Z6QO Becca 0 0 5 1345075200 Love it! I am very satisfied with my Twizzler purchase.... [I, am, very, satisfied, with, my, Twizzler, p... (I, am, very, satisfied, with, my, Twizzler, p... 28 0.821429 0.142857 [] 0.285714 0.214286 0.000000 0
18 19 B001GVISJM A2A9X58G2GTBLP Wolfee1 0 0 5 1324598400 GREAT SWEET CANDY! Twizzlers, Strawberry my childhood favorite ca... [Twizzlers, ,, Strawberry, my, childhood, favo... (Twizzlers, ,, Strawberry, my, childhood, favo... 157 0.649682 0.063694 [] 0.254777 0.038217 0.006369 0
19 20 B001GVISJM A3IV7CL2C13K2U Greg 0 0 5 1318032000 Home delivered twizlers Candy was delivered very fast and was purchase... [Candy, was, delivered, very, fast, and, was, ... (Candy, was, delivered, very, fast, and, was, ... 31 0.774194 0.129032 [] 0.161290 0.064516 0.000000 0

With these working as we would like to in our test environment and acknowledging a hardware limitation on my end, we will follow the same steps for a sample drawn form our dataframe, without further discussion. When done, we will export to a csv to allow for quick entry into future analysis.

reviews = reviews.join(pd.DataFrame(
    {
        'Tokens': np.nan,
        'TextNLTK':  np.nan
    }, index=reviews.index
))
#apply in chunks
reviews.iloc[550000:,10] = reviews.iloc[550000:,9:12].apply(tokenize, axis=1)
#we will save a version of the dataframe with the populated tokens
reviews.to_csv('reviews_with_tokens.csv', index=False) 
#establish our sample
#we will use just a routine sample size, and I will need to think further on the ideal sample size
reviews_sample = reviews.sample(n=1000, random_state=1)
#populate the NLTK text objects
#note these need to be populated in an active session, hence we didn't populate them when we saved the tokens
reviews_sample['TextNLTK'] = reviews_sample.apply(text_to_textNLTK, axis=1)
reviews_sample = reviews_sample.join(pd.DataFrame(
    {
        'Length': np.nan,
        'LanguageDiversity':  np.nan,
        'DescriptorDensity':  np.nan,
        'MostFrequentDescriptors':  np.nan,
        'LongWordDensity':  np.nan,
        'Personability':  np.nan,
        'Storytelling':  np.nan,
        'SocialNetworkBinary':  np.nan
    }, index=reviews_sample.index
))
reviews_sample['Length'] = reviews_sample.apply(rowLength, axis=1)
reviews_sample['LanguageDiversity'] = reviews_sample.apply(rowLangDiversity, axis=1)
#sluggish based on my use of NLTK; need to optimize
reviews_sample['DescriptorDensity'] = reviews_sample.apply(rowDescrDens, axis=1)
#need to optimize
reviews_sample['MostFrequentDescriptors'] = reviews_sample.apply(rowFreqDescr, axis=1)
reviews_sample['LongWordDensity'] = reviews_sample.apply(rowLongWord, axis=1)
#need to optimize
reviews_sample['Personability'] = reviews_sample.apply(rowPronDens, axis=1)
reviews_sample['Storytelling'] = reviews_sample.apply(rowStoryDens, axis=1)
reviews_sample['SocialNetworkBinary'] = reviews_sample.apply(rowSocialBinary, axis=1)
#we will drop our NLTK related columns and the original text before saving our sample dataframe (on memory considerations)
del reviews_sample['Text']
del reviews_sample['Tokens']
del reviews_sample['TextNLTK']
#save final augmented version of dataframe for quicker entry into analysis
reviews_sample.to_csv('reviews_sample_augmented.csv', index=False) 

And, our final result is given by:

reviews_sample.head()
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Length LanguageDiversity DescriptorDensity MostFrequentDescriptors LongWordDensity Personability Storytelling SocialNetworkBinary
288312 288313 B000ENUC3S AN66F3Q4QNU43 Donna Speaker 0 0 5 1340496000 Cherry Pie Larabar 39 0.717949 0.102564 [(low, 2)] 0.076923 0.076923 0.000000 0
431726 431727 B002TMV3CG A3G007LQX6KGOD SevereWX 0 0 5 1287878400 Melitta Coffee 112 0.687500 0.151786 [] 0.294643 0.071429 0.008929 0
110311 110312 B004867T24 A11LNY2OLQSUSV M. Castillo 0 0 5 1331769600 great treat 42 0.880952 0.166667 [] 0.119048 0.095238 0.000000 0
91855 91856 B004U7KPY0 A1QCYVHWO5934U PistolaMia "PistolaMia" 0 0 5 1332806400 Daily Calming 25 0.800000 0.280000 [] 0.240000 0.000000 0.000000 0
338855 338856 B000FD78R0 A30U2QQN2FFHE9 J. Amicucci 2 3 5 1271376000 Best Canned Artichokes Out There! 279 0.562724 0.186380 [(not, 5), (too, 3), (just, 2), (couple, 2), (... 0.182796 0.096774 0.028674 0

<h2>Summarize Data</h2>

Because this project has focused on expanding the dataset we were given with potentially useful attributes of the text review (rather than performing the full analysis of how these contribute to the helpfulness score, which we will be the focus of future work), we will end with a simple summary of our data through a view of the distributions of our new attributes.

import matplotlib.pyplot as plt
columns_for_hist = ['Length','LanguageDiversity','DescriptorDensity','LongWordDensity','Personability','Storytelling','SocialNetworkBinary']

fig = plt.figure(figsize=[15, 30])

for col in columns_for_hist:
    #setup position on canvas
    i = columns_for_hist.index(col)
    ax = fig.add_subplot(4,2,i+1)
    
    #the actual plot
    ax.hist(reviews_sample[col])
    ax.set_title(col)
    ax.xaxis.set_ticks_position('bottom')

plt.subplots_adjust(hspace = 0.2)
plt.show()

png

<h2>Future Work</h2>

I would also like to establish metrics for spelling error density (NLTK can’t do this, but the Pattern module seems like a good option); how clean the structure of the review is (that is, I assume reviews that are closer to bullet-points are more often marked as helpful than walls of text), probably in the form of some weighted scoring; the occurence of abbreviations and “sms-speech”, e.g., “u”, “gr8”; and if an edit has occured.

I need to spend some more time optimizing this workflow. NLTK doesn’t scale very well in an application of numerous short texts, so I need to think through precisely where it’s needed and the most efficient method to pipe it in.

Most of these attributes were either low-hanging fruit or aspects that I was curious about having any impact, and this project was intended to build some creative muscles. Expanding this work with more established methodologies is certainly necessary, but will take more research.

I have ommitted sentiment analysis. It would be simple enough to apply some density scoring for chosen word groups, but I prefer to delay this feature until I can perform a deeper dive.

The most glaring ommission must certainly be that we have not performed any regression to measure the utility of these attributes. We will do so in a follow-up project.