Getting Lots of Data and Artificial Data Synthesis

2014-09-02 04:40 | Source

It's now can be deducted that to create a powerful learning algorithm, we must use low bias algorithm in a huge amount of data.
But sometimes we can't receive data much, how do we do this?
There's Artificial Data that we can create and it can be derrive into two version
- Create big data from scratch
- Create data based on small amount data that we have, and amplify it into huge amount.

As earlier, we have ML problems that based on the text in the image predict the character in it.
In the figure in the left, we are given set of training example, and labeled data classifer according to character.
Nowadays, in every computer, there's huge categories of font types (ttf), or we can also simply search ttf's in the search engine and giving to us lot of variety.
Based on that, we can amplify our data.
So this is what we do. Given ttf set of characters, put a background from various color, including ones in the training set. Then put some blur, or variation that make the ttf blend in with the background colour.
That's how we create synthetic data, as shown in the figure on the right.
This is the step on how we make artificial data from scratch.

Next, is how we amplify our existing small dataset.
Based on the figure on the left(the gridlines just for visual purposes), we are creating another sixteen variety distortions based on the original image.

This is how we synthezie our original data with distortions not based on the image, but on the audio.
Given original audio, we can amplify our data by introducing different backsound.

The distortion that we synthesize should be a distortion that is meaningful to learn.
In the case of character recognition, we often see the text written in the flag, or on the balloon or else that making the distortion of the text. This is the correct distortion that we want to learn
In the contrary, if meaningless data introduced, then the learning algorithm will produce nothing. The fact that adding some gaussian random noise to add in the text will impact meaningless in the systems
This is often consider as an artisan to create the distortion and sometimes just try to do it originally.

So here's additional advice regarding acquiring more data.
Just to be reminded once again, verify that you are benefit much if you are using much larger data. PLOT LEARNING CURVE to make sure that your algorithm has low bias. if not, then add new features or hidden units in the neural network
Then often if scientist work as a team, the question asked (no.2) .
If you have figure a way to increase much data in short amount of time (few days) then you will be a star in the team. Because adding data is huge performance increase in our learning algorithm
Or maybe collect/label ourself. Sometimes we would be surprise if we know how short amount of time to increase our data. Let 1 example need 10 sec to label for us(humans). and if we try to increase it by 10.000 examples, divide by our team, maybe few days to complete and it still going to be performance boost and such advise will make us star in team.
There's also "crowd source" where we ask people to label our training dataset, where the people get paid for the effort. See the example more.

So we have talked about how we can increase data from scratch or amplify our small existing dataset.
We also discussed about analyzing our data first before started to decide whether we increase our data or not.
And also increase our data by tenth with relatively small amount of time(few days/weeks) or money(crowd sourcing) could give us huge boost on our learning algorithms.