Auditing the data
Auditing the data
-
Data-Type: (twovs2) we may expect one data type
-
range: data must be properly distributed
-
regex: the data must follow some pattern
-
unique: there must be unique each of element
-
cross-field constraints:
-
foreign-key constraints: may refer with id of some other database, and that id must exists
-
membership: must be in one particular group
-
This is our common problem in wrangling the data
-
Wikipedia has some nice dataset in inbox for every informatiion, in this case we choose cities.
-
DBpedia done a good job to import the inbox wikipedia to csv file
-
We can see that there's null fields, or different data formats.
-
Wikipedia has some information that can be filled by other people, human input. And as we know, human error.
-
There's some data type that need special attention, treatment, etc
-
Once we cleaning and wrangling the data, we will move into MongoDB.
-
There some fields that is related of our other fields. In this case, the density of population measured by range of land and total population.It might be the case that we want to crosscheck the input, so the data can be well distributed.
-
If we look at the cities, we may found that, some of cities are measured in milimitres. That's way there's a lot of numbers. While it's not logical to include cities in militimitres(the reason DBpedia include the range in km).
-
You also want to some set of accuracy, comparing the data with our data that we can put reliability(gold standard)
-
In this example we depend on ISO standard naming country
-
We take a look carefully, some country has more than one name(array)
-
auditing is really situation dependant and need more resourced/reference for auditing.
-
Data may be coming from various sources.
-
It's important to maintain consitency across all data.
-
The example above, is location based on user input/gps/ip adress location.
-
We also want to choose the most reliable resource and compare it to the others.
-
Sometime the unique identity of our data change.
-
This is the uid for three company, over the years they change name, or missing(from stock list)
-
Last, is auditing uniformity. How we represent one field as one data type
-
Some input may be input manually, to fit some other systems.