KNN(ML) With MongoDB (NoSQL)

A new economical and Proficient approach in machine learning

As discussed in our other blog (Machine Learning with NoSQL), we get to know about the power of using machine learning with NoSQL and why many startups and entrepreneurship programs lack funds in the beginning. Solution to that lack of funds is provided in “PAY as you GROW” blog.

Today we will discuss machine Learning with NoSQL in more detail. The first question that arises is

Why to do it?

Why can’t we get data from HADOOP, HDFS OR LARGE CSV?

Why use MongoDB or any other NoSQL Database?

Let's Begin..., For the sake of this article we will only focus on MongoDB as NoSQL Database.

We are using KNN as machine learning Algorithm for this article. There is also github code provided that contains a Proof Of Concept(POC). The KNN (k-nearest neighbors) algorithm is an uncomplicated, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.

A supervised machine learning algorithm is one that relies on proper labeled input data to learn a function that produces an appropriate output when given new unlabeled data.

Let us understand what supervised learning is through an example:

Imagine a student as a computer, its teacher as a supervisor, and we want the student (computer) to learn what a "dog looks like". Teacher will show the student several different pictures, some of which are dogs and the rest could be pictures of anything like cat, pig, cow . When a student sees a dog, the teacher shouts "This is a dog!" When it's not a dog, the teacher shouts "This is not a dog!" After doing this several times with the student, the teacher shows them a random picture and asks "Is it a dog?" and the student will correctly say "This is a dog!" or "This is not a dog!" depending on what the student has learn from previous pictures.

That is supervised machine learning.

Now we get to know one important thing, the most important thing for Machine learning data. Without quality data no matter how accurate an algorithm we use, the result will not be good.

Now the point arises when we have large data sets, then what?

People would say use Hadoop or any other distributed system. The people forget when we are working with hadoop or any distributed system the learning curve is very steep and it is an expensive technology in terms of talent and investment which gets us in the Lack of funds issue. So the other solution we will listen to is to use large csv files.

Nice solution but let's think practically, in today’s world how many of us can effort memory efficient computers. If you are a startup and is only driven by ideas, and innovations, it is very hard to effort machines that will start a memory hog once it has to deal with a csv file bigger than 100 mb.

Now to get around these issues we suggest NoSQL(MongoDB). In MongoDB we can handle large data sets without having a memory hog as on large CSV files and the learning curve is gradual as compared to Hadoop.

AARK Technology Hub as a company always ensures to provide innovative and proficient solutions.

This AARK Technology Hub Proof of Concept is is freely available on our GITHUB and you can easily download and use it anywhere.