Input Encoding for Machine Learning

Machine learning methods tend to rely on numerical input. Many types of data such as user reviews are in textual form; some conversion is needed to go from e.g. a review text so some sort of numerical format that can be used as input for a neural net.

One of the more common approaches is discussed here.

The Pang/Lee Sentiment Polarity Dataset

This dataset of 2000 movie reviews categorized in positive and negative sentiment classes has been used in a number of studies, and it still provides a good base for comparison of machine learning methods.

http://www.cs.cornell.edu/people/pabo/movie-review-data/

Download and unpack the .tar file in the current directory of your notebook or Python script.

Files are organized in directories txt_sentoken/pos/ and /neg/ with reviews in individual files, such as

txt_sentoken/pos/cv754_7216.txt

kolya is one of the richest films i've seen in some time . zdenek sverak plays a confirmed old bachelor ( who's likely to remain so ) , who finds his life as a czech cellist increasingly impacted by the five-year old boy that he's taking care of . though it ends rather abruptly-- and i'm whining , 'cause i wanted to spend more time with these characters-- the acting , writing , and production values are as high as , if not higher than , comparable american dramas . this father-and-son delight-- sverak also wrote the script , while his son , jan , directed-- won a golden globe for best foreign language film and , a couple days after i saw it , walked away an oscar . in czech and russian , with english subtitles .

We observe

  • the text is already split into tokens with whitespace (blank or newline) as separator
  • the text is all lower-case; this is both a blessing and a curse:
    • for our hot word approach this is fine
    • entity recognition e.g. of proper names would be more difficult

We also see typical problems in data from real sources such as movie reviews: The double dash '--' was not split properly. Expect things like that to happen. Never rely on anything to work perfectly.

  • We can use the read() function to get the whole content of the file
  • and then split() without parameter i.e. whitespace to get the tokens

Let's print the content of positive and negative reviews and check whether the splitting works as expected:

In [395]:
filelst = ['txt_sentoken/pos/cv754_7216.txt','txt_sentoken/neg/cv435_24355.txt']
for fn in filelst:
    print(fn)
    print(open(fn).read().split())
    
txt_sentoken/pos/cv754_7216.txt
['kolya', 'is', 'one', 'of', 'the', 'richest', 'films', "i've", 'seen', 'in', 'some', 'time', '.', 'zdenek', 'sverak', 'plays', 'a', 'confirmed', 'old', 'bachelor', '(', "who's", 'likely', 'to', 'remain', 'so', ')', ',', 'who', 'finds', 'his', 'life', 'as', 'a', 'czech', 'cellist', 'increasingly', 'impacted', 'by', 'the', 'five-year', 'old', 'boy', 'that', "he's", 'taking', 'care', 'of', '.', 'though', 'it', 'ends', 'rather', 'abruptly--', 'and', "i'm", 'whining', ',', "'cause", 'i', 'wanted', 'to', 'spend', 'more', 'time', 'with', 'these', 'characters--', 'the', 'acting', ',', 'writing', ',', 'and', 'production', 'values', 'are', 'as', 'high', 'as', ',', 'if', 'not', 'higher', 'than', ',', 'comparable', 'american', 'dramas', '.', 'this', 'father-and-son', 'delight--', 'sverak', 'also', 'wrote', 'the', 'script', ',', 'while', 'his', 'son', ',', 'jan', ',', 'directed--', 'won', 'a', 'golden', 'globe', 'for', 'best', 'foreign', 'language', 'film', 'and', ',', 'a', 'couple', 'days', 'after', 'i', 'saw', 'it', ',', 'walked', 'away', 'an', 'oscar', '.', 'in', 'czech', 'and', 'russian', ',', 'with', 'english', 'subtitles', '.']
txt_sentoken/neg/cv435_24355.txt
['a', 'couple', 'of', 'criminals', '(', 'mario', 'van', 'peebles', 'and', 'loretta', 'devine', ')', 'move', 'into', 'a', 'rich', "family's", 'house', 'in', 'hopes', 'of', 'conning', 'them', 'out', 'of', 'their', 'jewels', '.', 'however', ',', 'someone', 'else', 'steals', 'the', 'jewels', 'before', 'they', 'are', 'able', 'to', 'get', 'to', 'them', '.', 'writer', 'mario', 'van', 'peebles', 'delivers', 'a', 'clever', 'script', 'with', 'several', 'unexpected', 'plot', 'twists', ',', 'but', 'director', 'mario', 'van', 'peebles', 'undermines', 'his', 'own', 'high', 'points', 'with', 'haphazard', 'camera', 'work', ',', 'editing', 'and', 'pacing', '.', 'it', 'felt', 'as', 'though', 'the', 'film', 'should', 'have', 'been', 'wrapping', 'up', 'at', 'the', 'hour', 'mark', ',', 'but', 'alas', 'there', 'was', 'still', '35', 'more', 'minutes', 'to', 'go', '.', 'daniel', 'baldwin', '(', 'i', "can't", 'believe', "i'm", 'about', 'to', 'type', 'this', ')', 'gives', 'the', 'best', 'performance', 'in', 'the', 'film', ',', 'outshining', 'the', 'other', 'talented', 'members', 'of', 'the', 'cast', '.', '[r]']

There is the occasional weird element, such as '[r]' but generally the data looks good.

We can now encode our data into a numerical format.

Strictly speaking we already have that, by just taking the orders ord(c) in the character set, or even the bit vector encoding each character:

In [396]:
[ ord(c) for c in 'films' ]
Out[396]:
[102, 105, 108, 109, 115]

Our neural net would have a hard time learning. Nevertheless, similar methods have been applied successfully in machine learning with other network architectures.

With our simple one-layer feed-forward neural network encoding the individual words is a more common approach.

One-hot Encoding

This method takes a list of words w and transforms each text t into an encoding vector e where

e[i] = 1 if t contains w[i], and 0 else

This results in a numeric vector of 0 and 1 with length equal to the number of words in w.

For a list of H hot words and N observations we get an N x H matrix.

  • This method is simple and easy to implement
  • It is also very fast
  • It works moderately well for common tasks like sentiment detection
  • It completely ignores the word order

To get some idea of what to use for our hot words we turn the token lists into sets and use intersection():

In [397]:
s1 = set(open(filelst[0]).read().split())
s2 = set(open(filelst[1]).read().split())
print('tokens in both:')
print(s1.intersection(s2))
tokens in both:
{'of', 'the', 'best', 'are', 'couple', 'script', 'it', 'and', 'i', 'though', 'film', '(', 'this', ')', 'a', 'with', '.', 'as', 'to', 'his', 'more', "i'm", 'high', 'in', ','}

Let's illustrate the idea on our samples and a very small list of hot words.

In [398]:
hotwords = ['golden','high','mostly','script','haphazard']
enc = []
for fn in filelst:
    sents = open(fn).read().split()
    enc.append([ int(hotwords[i] in sents) for i in range(len(hotwords)) ])
enc = np.asarray(enc)
print(enc)
[[1 1 0 1 0]
 [0 1 0 1 1]]

Some words, like golden and haphazard, carry obvious meaning for sentiment. A particular challenge in this approach is to select a suitable set of hot words for encoding. We take the easy way and simply

  • read in the whole corpus
  • get a list of the most frequent words first

We read the corpus by reading each file in the directory pos/ and assign a y-value of 1, and the same for /neg and 0.

In [399]:
import os
from collections import defaultdict
import random

data = []
y = []
d = 'txt_sentoken/'
for sd in ['pos/','neg/']:
    for fn in os.listdir(d + sd):
        data.append( set(open(d+sd+fn).read().split()) )
        y.append( int(sd=='pos/') )
        
print(len(data), len(y))
2000 2000

To get an idea how that worked we print some data.

Since the reviews are now sets of words there is no order, and we pick random words.

In [400]:
random.seed(17)
print(random.sample(data[0], 10), y[0])
print(random.sample(data[-1], 10), y[-1])
['bit', 'truly', ',', 'find', 'search', 'and', 'fact', 'really', 'become', 'an'] 1
['for', 'can', 'completely', 'backbone', 'more', 'if', "we've", '.', 'guilty', 'cushions'] 0

For the hot words we have to come up with some selection criteria.

  • they have to be present in many reviews, otherwise the encoding will be mostly 0
  • the have to convey meaning, so very frequent words should be excluded

Let's count the number of occurrence for each word:

In [401]:
wcnt = defaultdict(int)
for wset in data:
    for w in wset:
        wcnt[w] += 1
        

To get an idea of the result we print the most frequent words:

In [402]:
print(sorted(wcnt, key=wcnt.get, reverse=True)[:100])
['.', 'the', 'of', 'and', 'to', ',', 'a', 'is', 'in', 'that', 'with', 'it', 'for', 'as', ')', '(', 'but', 'this', 'on', 'an', 'are', 'by', 'be', 'his', 'one', 'who', 'film', 'at', 'from', 'not', 'have', 'he', 'has', 'all', '"', 'movie', 'i', 'out', 'was', 'more', 'so', 'like', 'about', 'when', 'they', 'up', 'you', 'or', 'some', 'if', 'what', 'just', 'which', 'into', 'only', 'their', 'there', 'even', "it's", 'than', ':', '?', 'time', 'can', 'no', 'most', 'good', 'him', 'much', 'her', 'would', 'other', 'been', 'get', 'its', 'also', 'will', 'do', 'after', 'story', 'them', 'two', 'first', 'character', 'we', 'way', 'make', 'well', 'see', 'very', 'does', 'while', 'any', 'characters', 'too', 'because', 'where', 'little', 'how', 'had']

Now we emply a very simple strategy for hot word selection, using the number of occurrence as limits:

In [403]:
hotw = [ k for k in wcnt if wcnt[k] > 100 and wcnt[k] < 500 ]
print(hotw)
print(len(hotw))
["can't", 'guess', 'used', 'original', 'special', 'clever', 'finally', 'excellent', 'highly', 'often', 'talk', 'put', 'school', 'material', 'especially', 'definitely', 'shot', 'entire', 'single', 'considering', 'oh', 'word', 'must', 'audiences', 'approach', 'truly', 'getting', 'smart', 'quite', 'project', 'help', 'leave', 'million', 'probably', 'feel', 'bit', 'found', 'video', 'opening', 'effects', 'beginning', 'successful', "i'm", 'times', 'bunch', 'watching', 'amazing', '3', 'onto', 'become', 'our', 'supposed', 'yourself', 'yes', 'simply', 'budget', 'possibly', 'lives', 'paul', 'certain', 'appears', 'following', 'hardly', 'crew', 'question', 'ago', 'soundtrack', 'perfectly', 'fine', 'each', 'case', 'martin', 'comedy', 'unique', 'mr', 'murder', 'talent', 'pictures', 'huge', 'given', 'bruce', 'hard', 'got', 'plan', 'follows', 'entertaining', 'always', 'true', 'mention', 'entirely', 'presence', 'score', 'red', 'classic', 'stuff', 'effect', 'living', 'familiar', 'leads', 'along', 'worth', 'drama', 'opens', 'star', 'personal', 'scott', 'later', 'point', 'room', 'might', 'line', 'turn', 'father', 'consider', 'music', '*', 'tale', 'general', 'violent', 'share', 'completely', 'proves', 'face', 'doubt', 'van', "film's", 'subject', 'second', 'feature', 'elements', 'sets', 'history', 'battle', 'fairly', 'exactly', 'direction', 'sequences', 'level', 'involving', 'production', 'immediately', 'use', 'mark', 'reason', 'american', 'himself', 'night', 'impressive', 'giving', 'interesting', 'child', 'constantly', 'whose', 'screenplay', 'various', 'themselves', 'picture', 'de', 'small', 'eyes', 'throughout', 'already', 'otherwise', 'shows', 'poor', 'looks', 'minute', 'within', 'sense', 'sure', 'friends', 'wrong', 'under', 'top', 'recent', 'rich', 'boy', 'comic', 'element', 'fun', 'doing', 'set', 'clearly', 'john', 'stupid', 'series', 'somewhat', 'order', 'attention', 'figure', 'sequence', 'came', 'attempt', 'aspect', 'silly', 'relationship', 'certainly', 'version', 'making', 'latest', 'note', 'biggest', 'perhaps', 'mother', 'effective', 'straight', 'brief', 'ultimately', "who's", 'matter', 'job', 'clear', 'upon', 'famous', 'particularly', 'example', 'late', 'others', 'career', 'background', 'style', 'performances', 'camera', 'home', 'merely', 'female', 'open', 'none', 'similar', 'jack', 'wife', 'obvious', 'dialogue', 'opportunity', 'james', 'team', 'details', 'early', 'lack', 'laughs', 'piece', 'during', 'group', 'less', "she's", 'giant', 'interested', '--', 'running', 'among', 'sounds', "i've", 'watch', 'heart', 'hilarious', 'anything', 'space', 'slow', 'deal', 'convincing', 'light', 'dream', 'member', 'studio', 'including', 'business', 'due', 'earlier', "wasn't", 'done', 'quickly', 'robert', 'needs', 'sex', 'lee', 'key', 'year', 'far', 'wonder', 'alone', 'screenwriter', 'number', 'parents', 'contains', 'society', 'roles', 'sees', 'entertainment', 'major', 'planet', 'written', 'day', "i'd", 'alien', "you're", 'apart', 'motion', 'herself', 'thought', 'actress', 'company', 'girl', 'half', "let's", 'couple', 'king', 'earth', 'released', 'soon', 'jim', 'short', 'four', 'truth', 'happens', 'hands', 'thriller', 'yet', 'until', 'directed', 'dark', "you'll", 'called', 'fight', 'tell', 'ground', 'lies', 'genre', 'took', 'maybe', 'oscar', 'lost', 'basically', 'horror', 'hour', 'death', 'okay', 'unfortunately', 'slowly', 'fall', 'believe', 'once', 'guy', 'sometimes', 'features', 'next', 'wait', 'fiction', 'stories', 'state', 'central', 'inside', 'itself', 'god', 'men', 'three', 'evil', 'main', 'power', 'nice', 'else', 'seriously', 'form', 'idea', 'appear', 'final', 'cinema', 'favorite', 'anyone', 'white', 'lines', 'eye', 'try', 'thus', 'words', 'house', 'women', 'ending', 'review', 'expect', 's', 'ben', 'except', 'hit', 'joke', 'stop', 'cut', 'killing', 'spend', 'brings', 'minutes', 'lots', 'flaws', 'suddenly', 'laugh', 'turned', 'humor', 'minor', 'ways', 'start', 'problem', 'remember', 'flick', 'include', 'overall', 'works', 'starts', 'hours', 'chance', 'trouble', 'forget', 'previous', 'hand', 'puts', 'rather', 'science', 'alive', 'break', 'begin', 'several', 'son', 'known', 'obviously', 'perfect', 'pay', 'high', 'kids', 'tom', 'someone', 'powerful', 'provide', 'occasionally', 'says', 'problems', 'five', 'casting', 'want', 'situation', 'course', 'play', 'moments', 'working', 'large', 'hero', 'different', 'instead', "'", 'talking', 'named', 'moves', 'future', '1', 'finds', 'knows', 'slightly', 'friend', 'nor', 'meets', 'romance', 'third', 'playing', 'british', 'fan', 'either', 'return', 'thrown', 'seem', 'neither', 'fans', 'happy', 'possible', 'human', 'having', 'gives', 'easy', '2', 'difficult', 'easily', 'discovers', 'former', 'local', 'left', 'forced', 'further', 'hell', "we're", 'becomes', 'likely', 'taken', 'cinematic', 'am', 'manages', 'nature', 'place', 'title', 'against', '&', 'feels', 'fast', 'run', 'past', 'cheap', 'everything', 'wish', 'cinematography', 'uses', 'whether', 'looking', 'road', 'able', 'control', 'theater', 'sort', 'george', 'together', 'told', 'viewer', 'common', 'pretty', 'myself', 'focus', 'leaving', 'show', "wouldn't", 'beyond', 'mostly', 'eventually', 'turns', 'side', 'annoying', 'critics', 'create', 'coming', 'memorable', 'appearance', 'free', 'missing', 'tells', 'care', 'mind', 'live', 'town', 'e', 'stay', 'crime', 'offers', 'realistic', 'complex', 'head', 'mission', 'everyone', 'writer', 'deserves', 'money', 'police', '-', 'wanted', 'hollywood', 'experience', 'boring', 'dead', 'keeps', 'peter', 'usual', 'keep', 'sound', "you've", 'delivers', 'seemingly', 'remains', 'dumb', 'full', 'pace', 'starring', 'dramatic', 'david', 'beautiful', 'falls', 'supporting', 'cannot', 'surprisingly', 'involved', 'learn', 'expected', 'exciting', 'development', 'actor', 'thinks', 'across', 'despite', 'theme', 'somehow', 'america', 'chris', 'cop', 'attempts', 'give', 'killer', 'fear', "what's", "won't", 'depth', 'tv', 'rating', 'predictable', 'name', 'private', 'richard', 'lead', 'believable', 'liked', 'looked', 'extremely', 'wants', 'wasted', 'behind', 'rest', 'unlike', 'woman', 'worked', 'provides', 'means', 'fire', 'important', 'killed', 'escape', 'created', 'whom', 'suspense', 'prison', 'setting', 'moving', 'trying', 'taking', 'add', 'law', 'gave', 'blood', 'hate', 'ones', 'violence', 'emotional', 'telling', 'scary', 'black', 'runs', 'sent', 'person', 'brought', 'book', 'apparently', 'effort', 'said', 'stars', 'impossible', 'guys', 'chase', 'kid', 'cool', 'type', 'ride', 'gun', 'begins', 'enjoy', 'wild', 'based', 'solid', 'nearly', 'art', 'present', "i'll", 'release', 'happened', 'shots', 'reality', 'move', 'filmmakers', 'parts', "didn't", 'actual', 'close', 'seeing', 'basic', 'enjoyable', 'leaves', 'totally', 'moment', 'talented', 'wonderful', 'strange', 'felt', 'ask', 'hope', 'let', 'water', 'showing', 'modern', 'save', 'whole', 'events', 'middle', 'william', 'game', 'viewers', 'climax', 'surprise', 'ten', 'kevin', 'fellow', 'york', 'stands', 'gone', 'plenty', 'quality', 'thinking', 'ends', 'r', 'feeling', 'secret', 'novel', 'writing', 'asks', 'happen', 'language', 'meanwhile', 'strong', 'pull', 'rated', 'tension', 'city', 'reasons', 'conclusion', 'children', 'went', 'follow', 'incredibly', 'amount', 'subtle', 'decide', 'kill', "they're", 'family', 'air', "aren't", 'body', 'kind', 'read', 'directing', "haven't", 'summer', 'tone', 'car', 'understand', 'husband', 'visual', 'call', 'indeed', 'leading', 'mysterious', 'bring', 'heard', 'dr', 'daughter', 'act', 'message', 'longer', 'produced', 'saw', 'bill', 'seemed', 'tries', 'knew', 'party', 'age', 'usually', 'steve', 'popular', 'brothers', 'above', 'imagine', 'situations', 'led', 'saying', 'class', 'front', 'computer', 'chemistry', 'interest', 'hold', 'television', 'hear', 'admit', 'complete', 'romantic', 'greatest', 'particular', 'premise', 'force', 'result', 'days', 'michael', 'credit', 'ridiculous', 'sit', 'shown', 'simple', 'typical', 'points', 'quick', 'english', 'realize', '10', 'questions', 'manner', 'voice', 'need', 'serious', 'decides', 'office', 'using', 'whatever', 'dog', 'absolutely', 'war', 'towards', 'credits', 'success', 'box', 'girlfriend', 'caught', 'thanks', 'anyway', 'spent', 'potential', 'members', 'sequel', 'jokes', 'brother', "couldn't", 'flat', 'deep', 'recently', 'purpose', 'amusing', 'ideas', 'williams', 'mean', 'intelligence', 'cold', 'writers', 'change', 'agent', 'brilliant', 'atmosphere', 'meet', 'rock', 'terrible', 'sweet', 'fails', 'list', 'ability', 'worse', 'stand', 'die', 'outside', 'wrote', 'married', 'choice', 'joe', 'villain', 'street', 'decent', 'energy', 'worst', 'becoming', 'somewhere', 'near', 'sexual', 'six', "movie's", 'intelligent', 'view', 'aside', "year's", 'waste', 'country', 'filled', 'charm', 'mystery', 'robin', 'dull', 'awful', 'mess']
892

Our two-layer feed-forward net is quite efficiently implemented, and an encoding vector of several hundred columns is computationally feasible for a few thousand observations, even for low-end computers.

In [404]:
X = np.asarray([ [ int(hotw[i] in obs) for i in range(len(hotw)) ] for obs in data ])

y = np.asarray(y)
print(X.shape, y.shape)
(2000, 892) (2000,)

Our data is now converted into numerical input. The array consists of 0s and 1s only.

In [405]:
print(X)
[[1 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [1 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 1 1 0]]

How many bits are hot for the average observation:

In [406]:
np.mean(np.sum(X, axis=1))
Out[406]:
89.729

About 10%, this seems reasonable.

Recall our single-layer neural net with sigmoid activation function:

In [407]:
f = lambda x: 1 / (1 + np.exp(-x))
f_ = lambda x: f(x) * (1 - f(x))

def nn1(X, y, l=0.01, epochs=200):
    w = np.random.rand(X.shape[1]) - 0.5
    for ep in range(epochs):
        h = np.dot(X, w)
        o = f(h)
        w += np.dot(-l * (o - y) * f_(h), X)
        if ep % (epochs/5) == 0: print(sum(abs(o-y))/len(X))
    return w, sum(abs(o-y)/len(X))

w, e = nn1(X, y)
0.47022495366872763
0.18892601323783778
0.12239850467735822
0.09382104286182694
0.07704960401683984

Even with this very simple approach both for the neural net and the encoding the error drops dramatically from the initial 50% for random guessing.

Of course this performance on the training data is not very relevant for practical problems where we want to detect the sentiment of new observations.

A more useful measure is the performance on a separate part of the dataset that serves as test data.

For this purpose we perform a split into training and testing data. Obviously we have to split X and y in the same fashion, with observations and classes still corresponding.

Here is just one way of doing this.

In [408]:
ix = random.sample(range(len(X)), len(X))
print(ix[:10])
X = X[ix,:]
y = y[ix]

print(y[:10])
[281, 1128, 127, 286, 1664, 1621, 1792, 402, 309, 1807]
[1 0 1 1 0 0 0 1 1 0]

The data are now in random order. We use a 80:20 split for training and testing, a common choice.

In [409]:
lim = int(0.8 * len(X))
X_train, y_train = X[:lim,], y[:lim]
X_test, y_test = X[lim:,], y[lim:]

w, e = nn1(X_train, y_train)
0.4974441479033672
0.17823658768218686
0.10732427726643483
0.09274026775036188
0.0771213974318786

Now the net is trained, and we apply the weight vector to the test data i.e. we only apply the feed forward part of the network code:

In [410]:
def fwd(X, y, w):
    h = np.dot(X, w)
    o = f(h)
    return sum(abs(o-y)/len(X))

fwd(X_test, y_test, w)
Out[410]:
0.25182068494963505

As expected the performance on the test data is much worse than on the training data.

EXERCISES:

  • find other data sets with text input that needs to be categorized
  • apply the code, interpret the results
  • make a few changes and see what happens
  • experiment with other forms of encoding, such as using the number of occurrences instead of 0/1
  • write the code again from scratch, without copying from anywhere

The last step will really bring your programming skills to the next level.

In [ ]: