Analyzing Data

  |   Source
Analyzing Data
  • This lesson gonna teach us how to use twitter dataset to analyze the data.


  • This the dataset in general social network data, for this in particular tweeter.
  • Introduced The Agregation Framework, MongoDB powerful data anaylisis, to analyze what kind of data we've been working on.


  • Here is the step to extract the user who tweeted the most based on the structure of data twitter above.

  • The Agregation Framework in MongoDB implemented this
  • the framework using pipeline to solve the problem.
  • First it uses group operator, where the id(unique) means that we group all the tweet based on the uniqueness(id) of user screen name. the "$user.screen_name" doesn't mean operator, but value of "user.screen_name". Then for every tweet based on the same username, increment (count) to one.
  • The sort then perform the sorting based on count, on the descending(-1) order.
  • This is two-stage performed by the pipeline of agregation framework.




  • The stage in agregation pipeline can be single or series of stage to get a result
  • Here we reshaping tweet to the middle(based on what we want) and then performe sorting stage in 'sort'
  • Agregation operators:
    • $project: Reshaping all the data so that it can be presented nicely depend what we want, to the next stage or as result.
    • $match: filter documents.
    • $group, compact multiple documents(given parameters) with single documents that satisfied the operator. operator $group as follows:
      • $sum
      • $first
      • $last
      • $max
      • $min
      • $avg
      • $push. Deal with Array
      • $addtoSet. Deal with Array, Perform as a set to update a value in array,
    • $skip: skip documents by index
    • $limit: limit by number, the documents. 3, means only first three allowed.
    • $unwind: unwind the array of a documents, to a multiple documents with same data, but different by each value of array name. This is useful as in Twitter, we may want to group by the hashtag



  • This produce 4-stage pipeline for agregation

  • friends: who i follow
  • followers: who follow me

  • This is the function of who included the most user mentions.


  • This will produce unique hashtag as an array, but not containing the same value.



  • Multiple stage with same name operator.
  • This one counts the user that has the most unique user mentions(user that mentions many unique users, the most)

  • We can index our database for fasten our query
  • To do this we specify our leftmost queries hashtag-->username
  • Keep in mind that although read faster, write becomes slower because the database has to be updated.

  • Here is the indexex command from monggo shell
  • If we execute second line, it will have few seconds to execute, because the data have 7 millions set.
  • But when we set index(tg), the result for the query give immediate results



  • We can specift name type(e.g. location) but the value must follow [x,y] format

  • Then we can query based on the $near operator.