Analyzing Data

2018-04-05 00:00 | Source

Analyzing Data

This the dataset in general social network data, for this in particular tweeter.
Introduced The Agregation Framework, MongoDB powerful data anaylisis, to analyze what kind of data we've been working on.

Here is the step to extract the user who tweeted the most based on the structure of data twitter above.

The Agregation Framework in MongoDB implemented this
the framework using pipeline to solve the problem.
First it uses group operator, where the id(unique) means that we group all the tweet based on the uniqueness(id) of user screen name. the "$user.screen_name" doesn't mean operator, but value of "user.screen_name". Then for every tweet based on the same username, increment (count) to one.
The sort then perform the sorting based on count, on the descending(-1) order.
This is two-stage performed by the pipeline of agregation framework.

The stage in agregation pipeline can be single or series of stage to get a result
Here we reshaping tweet to the middle(based on what we want) and then performe sorting stage in 'sort'
Agregation operators:
- $project: Reshaping all the data so that it can be presented nicely depend what we want, to the next stage or as result.
- $match: filter documents.
- $group, compact multiple documents(given parameters) with single documents that satisfied the operator. operator $group as follows:
  - $sum
  - $first
  - $last
  - $max
  - $min
  - $avg
  - $push. Deal with Array
  - $addtoSet. Deal with Array, Perform as a set to update a value in array,
- $skip: skip documents by index
- $limit: limit by number, the documents. 3, means only first three allowed.
- $unwind: unwind the array of a documents, to a multiple documents with same data, but different by each value of array name. This is useful as in Twitter, we may want to group by the hashtag

This will produce unique hashtag as an array, but not containing the same value.

Multiple stage with same name operator.
This one counts the user that has the most unique user mentions(user that mentions many unique users, the most)

We can index our database for fasten our query
To do this we specify our leftmost queries hashtag-->username
Keep in mind that although read faster, write becomes slower because the database has to be updated.

Here is the indexex command from monggo shell
If we execute second line, it will have few seconds to execute, because the data have 7 millions set.
But when we set index(tg), the result for the query give immediate results