-
Snipping words for counting words serially
-
It may be easier to solve just one book, but to fit all books in one disk is impossible
-
All the words in documents could be map to reducer respectively, based on key.
-
Earlier, there's multiple key(words) with value 1.
-
Then reducers would have produce all the counting of the words(key), in result 1 word value all the counting words
-
in result if we put the code with sentence, 'Hello my name is Dave, Dave is my name'), it will produce all the tupple(key,value) above.
-
(recall string subtition to subtitute 0 with cleaned_data, 1 with 1)
-
Code above is the 'mapper' function
-
Then we will shuffle into reducers based on keys. if we have two reducers, we will split the keys in half
-
The reducer will take a line as = 'my\t1'
-
It will split '\t' making it a tuple(list) = ['my',1]
-
It then check if it really len(list) = 2, otherwise break
-
if old key is different than the key than we currently have, init. assign key and word_count = 0
-
then add the count (which is 1). After that, if we receive same key, then just increment the word_count with count
-
Finally we print every key with its count if it's not None.
-
Note that this means we have shuffle all the keys, means we have sorted the keys.