-
This lesson gonna teach us how to use twitter dataset to analyze the data.
-
This the dataset in general social network data, for this in particular tweeter.
-
Introduced The Agregation Framework, MongoDB powerful data anaylisis, to analyze what kind of data we've been working on.
-
Here is the step to extract the user who tweeted the most based on the structure of data twitter above.
-
The Agregation Framework in MongoDB implemented this
-
the framework using pipeline to solve the problem.
-
First it uses group operator, where the id(unique) means that we group all the tweet based on the uniqueness(id) of user screen name. the "$user.screen_name" doesn't mean operator, but value of "user.screen_name". Then for every tweet based on the same username, increment (count) to one.
-
The sort then perform the sorting based on count, on the descending(-1) order.
-
This is two-stage performed by the pipeline of agregation framework.
-
The stage in agregation pipeline can be single or series of stage to get a result
-
Here we reshaping tweet to the middle(based on what we want) and then performe sorting stage in 'sort'
-
Agregation operators:
-
$project: Reshaping all the data so that it can be presented nicely depend what we want, to the next stage or as result.
-
$match: filter documents.
-
$group, compact multiple documents(given parameters) with single documents that satisfied the operator. operator $group as follows:
-
$sum
-
$first
-
$last
-
$max
-
$min
-
$avg
-
$push. Deal with Array
-
$addtoSet. Deal with Array, Perform as a set to update a value in array,
-
$skip: skip documents by index
-
$limit: limit by number, the documents. 3, means only first three allowed.
-
$unwind: unwind the array of a documents, to a multiple documents with same data, but different by each value of array name. This is useful as in Twitter, we may want to group by the hashtag
-
This produce 4-stage pipeline for agregation
-
friends: who i follow
-
followers: who follow me
-
This is the function of who included the most user mentions.
-
This will produce unique hashtag as an array, but not containing the same value.
-
Multiple stage with same name operator.
-
This one counts the user that has the most unique user mentions(user that mentions many unique users, the most)
-
We can index our database for fasten our query
-
To do this we specify our leftmost queries hashtag-->username
-
Keep in mind that although read faster, write becomes slower because the database has to be updated.
-
Here is the indexex command from monggo shell
-
If we execute second line, it will have few seconds to execute, because the data have 7 millions set.
-
But when we set index(tg), the result for the query give immediate results
-
We can specift name type(e.g. location) but the value must follow [x,y] format
-
Then we can query based on the $near operator.