{"id":4265,"date":"2017-11-30T15:47:51","date_gmt":"2017-11-30T23:47:51","guid":{"rendered":"https:\/\/www.couchbase.com\/blog\/?p=4265"},"modified":"2023-06-14T00:34:23","modified_gmt":"2023-06-14T07:34:23","slug":"zero-effort-machine-learning-couchbase-spark-mllib","status":"publish","type":"post","link":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/","title":{"rendered":"Zero Effort Machine Learning with Couchbase and Spark MLlib"},"content":{"rendered":"<p>The past few years we noticed how machine learning had been proven to be a technology in which companies should invest massively, you can easily find dozens of papers talking about how company X saved tons of money by adding some level of AI into their process.<br \/>\nSurprisingly I still notice many industries being skeptical about it and others which think it is &#8220;cool&#8221; but does not have anything in mind yet.<\/p>\n<p>I believe the reason for such dissonance is due to 2 main factors: Many companies have no idea how AI fits in their business and for most of the developers, it still sounds like black magic.<\/p>\n<p>That is why I would like to show you today how you can start with machine learning with almost zero effort.<\/p>\n<h4>Linear Regression<\/h4>\n<p>On the most basic level of machine learning, we have something called Linear Regression, which is roughly an algorithm that tries to &#8220;explain&#8221; a number by giving weight to a set of features, let&#8217;s see some examples:<\/p>\n<ul>\n<li>\u00a0The price of a house could be explained by things like size, location, number of bedrooms and bathrooms.<\/li>\n<li>\u00a0The price of a car could be explained by its model, year, mileage, condition, etc.<\/li>\n<li>\u00a0The time spent for a given task could be predicted by the number of subtasks, level of difficulty, worker experience, etc<\/li>\n<\/ul>\n<p>There are a plenty of use cases were Linear Regression (or other Regression types) can be used, but let&#8217;s focus on the first one related to house prices.<\/p>\n<p>Imagine we are running a real estate company in a particular region of the country, as we are an old company, there is some data record of which were the houses were sold in the past and for how much.<\/p>\n<p>In this case, each row in our historical data will look like this:<\/p>\n<pre class=\"lang:js decode:true\">{\r\n\"id\": 7129300520,\r\n\"date\": \"20141013T000000\",\r\n\"price\": 221900,\r\n\"bedrooms\": 3,\r\n\"bathrooms\": 1,\r\n\"sqft_living\": 1180,\r\n\"sqft_lot\": 5650,\r\n\"floors\": 1,\r\n\"waterfront\": 0,\r\n\"view\": 0,\r\n\"condition\": 3,\r\n\"grade\": 7,\r\n\"sqft_above\": 1180,\r\n\"sqft_basement\": 0,\r\n\"yr_built\": 1955,\r\n\"yr_renovated\": 0,\r\n\"zipcode\": 98178,\r\n\"lat\": 47.5112,\r\n\"long\": -122.257,\r\n\"sqft_living15\": 1340,\r\n\"sqft_lot15\": 5650\r\n}<\/pre>\n<p>&nbsp;<\/p>\n<h3>The problem &#8211; How to price a house<\/h3>\n<p>Now, imagine you just joined the company and you have to sell the following house:<\/p>\n<pre class=\"lang:js decode:true\">{\r\n\"id\": 1000001,\r\n\"date\": \"20150422T000000\",\r\n\"bedrooms\": 6,\r\n\"bathrooms\": 3,\r\n\"price\": null,\r\n\"sqft_living\": 2400,\r\n\"sqft_lot\": 9373,\r\n\"floors\": 2,\r\n\"waterfront\": 0,\r\n\"view\": 0,\r\n\"condition\": 3,\r\n\"grade\": 7,\r\n\"sqft_above\": 2400,\r\n\"sqft_basement\": 0,\r\n\"yr_built\": 1991,\r\n\"yr_renovated\": 0,\r\n\"zipcode\": 98002,\r\n\"lat\": 47.3262,\r\n\"long\": -122.214,\r\n\"sqft_living15\": 2060,\r\n\"sqft_lot15\": 7316\r\n}\r\n<\/pre>\n<p><strong>For how much would you sell it?<\/strong><\/p>\n<p>The question above would be very challenging if you never sold a similar house in the past. Luckily, you have the right tool for the job: A Linear Regression.<\/p>\n<h3><\/h3>\n<h3>The Answer &#8211; Predicting house prices with Linear Regression<\/h3>\n<p>Before you go further, you will need to install the following items:<\/p>\n<ul>\n<li>\u00a0<a href=\"https:\/\/www.couchbase.com\/downloads\/\">Couchbase Server 5<\/a><\/li>\n<li>\u00a0<a href=\"https:\/\/spark.apache.org\/releases\/spark-release-2-2-0.html\">Spark 2.2<\/a><\/li>\n<li>\u00a0<a href=\"https:\/\/www.scala-sbt.org\/download.html\">SBT<\/a> (as we are running using Scala)<\/li>\n<\/ul>\n<h4>\u00a0Loading the Dataset<\/h4>\n<p>With your Couchbase Server running, go to the administrative portal ( usually at https:\/\/127.0.0.1:8091) and create a new bucket called <strong>houses_prices<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-4281 aligncenter\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/bucket_creation-300x258.png\" alt=\"\" width=\"333\" height=\"286\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/bucket_creation-300x258.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/bucket_creation-1024x881.png 1024w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/bucket_creation-768x661.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/bucket_creation-20x17.png 20w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/bucket_creation.png 1072w\" sizes=\"auto, (max-width: 333px) 100vw, 333px\" \/><\/p>\n<p>Now, let&#8217;s clone our tutorial code:<\/p>\n<pre class=\"lang:default decode:true \">git clone https:\/\/github.com\/couchbaselabs\/couchbase-spark-mllib-sample.git<\/pre>\n<p>In the root folder there is a file called <strong>house_prices_train_data.zip<\/strong>, it is our dataset which I borrowed from an old machine learning course on <a href=\"https:\/\/www.coursera.org\/learn\/ml-foundations\/\">Coursera<\/a>. Please unzip it and then run the following command:<\/p>\n<pre class=\"lang:default decode:true\">.\/cbimport json -c couchbase:\/\/127.0.0.1 -u YOUR_USER -p YOUR_PASSWORD -b houses_prices -d &lt;PATH_TO_UNZIPED_FILE&gt;\/house_prices_train_data -f list -g key::%id% -t 4<\/pre>\n<p><strong>TIP<\/strong>: If you are not familiar with <strong>cbimport\u00a0<\/strong>please <a href=\"https:\/\/developer.couchbase.com\/documentation\/server\/current\/tools\/cbimport.html\">check this tutorial<\/a><\/p>\n<p>If your command ran successfully, you should notice that your <strong>houses_prices<\/strong> bucket has been populated:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-4282 aligncenter\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/filled_bucket-300x135.png\" alt=\"\" width=\"749\" height=\"337\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/filled_bucket-300x135.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/filled_bucket-1024x461.png 1024w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/filled_bucket-768x346.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/filled_bucket-1536x691.png 1536w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/filled_bucket-2048x922.png 2048w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/filled_bucket-20x9.png 20w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/filled_bucket-1320x594.png 1320w\" sizes=\"auto, (max-width: 749px) 100vw, 749px\" \/><\/p>\n<p>Let&#8217;s also quickly add a primary index for it:<\/p>\n<pre class=\"lang:default decode:true\">CREATE PRIMARY INDEX ON `houses_prices`<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-4283 aligncenter\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/index_creation-300x172.png\" alt=\"\" width=\"680\" height=\"390\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/index_creation-300x172.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/index_creation-1024x586.png 1024w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/index_creation-768x439.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/index_creation-1536x879.png 1536w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/index_creation-20x11.png 20w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/index_creation-1320x755.png 1320w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/index_creation.png 1654w\" sizes=\"auto, (max-width: 680px) 100vw, 680px\" \/><\/p>\n<h4><\/h4>\n<h4>Time to Code!<\/h4>\n<p>Our environment is ready, it is time to code!<\/p>\n<p>In the <a href=\"https:\/\/github.com\/couchbaselabs\/couchbase-spark-mllib-sample\/blob\/master\/src\/main\/scala\/LinearRegressionExample.scala\">LinearRegressionExample<\/a> class we start by creating the Spark context with our bucket credentials:<\/p>\n<pre class=\"lang:scala decode:true\">val spark = SparkSession\r\n    .builder()\r\n    .appName(\"SparkSQLExample\")\r\n    .master(\"local[*]\") \/\/ use the JVM as the master, great for testing\r\n    .config(\"spark.couchbase.nodes\", \"127.0.0.1\") \/\/ connect to couchbase on localhost\r\n    .config(\"spark.couchbase.bucket.houses_prices\", \"\") \/\/ open the houses_prices bucket with empty password\r\n    .config(\"com.couchbase.username\", \"YOUR_USER\")\r\n    .config(\"com.couchbase.password\", \"YOUR_PASSWORD\")\r\n    .getOrCreate()<\/pre>\n<p>and then we load all the data from the database:<\/p>\n<pre class=\"lang:scala decode:true\">val houses = spark.read.couchbase()<\/pre>\n<p>As Spark uses a lazy approach, the data is not loaded until it is really needed. You can clearly see the beauty of the <strong>Couchbase Connector<\/strong> above, we just converted a JSON Document into a Spark Dataframe with zero effort.<\/p>\n<p>In other databases for example, you would be required to export the data to a CSV file with some specific formats, copy it to your machine, load and do some extra procedures to convert it to a dataframe (not to mention the cases where the file generated is too big).<\/p>\n<p>In a real world you would need to do some filtering instead of just grabbing all data, again our connector is there for you, as you can even run some N1QL queries with it:<\/p>\n<pre class=\"lang:scala decode:true\">\/\/loading documents by its type\r\nval airlines = spark.read.couchbase(EqualTo(\"type\", \"airline\"))\r\n\r\n\/\/loading data using N1QL\r\n\/\/ This query groups airports by country and counts them.\r\nval query = N1qlQuery.simple(\"\" +\r\n    \"select country, count(*) as count \" +\r\n    \"from `travel-sample` \" +\r\n    \"where type = 'airport' \" +\r\n    \"group by country \" +\r\n    \"order by count desc\")\r\n\r\nval schema = StructType(\r\n   StructField(\"count\", IntegerType) ::\r\n   StructField(\"country\", StringType) :: Nil\r\n)\r\n\r\nval rdd = spark.sparkContext.couchbaseQuery(query).map(\r\n      r =&gt; Row(r.value.getInt(\"count\"), r.value.getString(\"country\")))\r\nspark.createDataFrame(rdd, schema).show()<\/pre>\n<p><strong>TIP<\/strong>: There are a lot of examples on how to use Couchbase connector <a href=\"https:\/\/github.com\/couchbaselabs\/couchbase-spark-samples\/tree\/master\/src\/main\/scala\">here<\/a>.<\/p>\n<p>Our dataframe still looks exactly like what we had in our database:<\/p>\n<pre class=\"lang:scala decode:true \">houses.show(10)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-4284 aligncenter\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/dataframe_data-300x59.png\" alt=\"\" width=\"906\" height=\"178\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/dataframe_data-300x59.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/dataframe_data-768x152.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/dataframe_data-1536x304.png 1536w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/dataframe_data-20x4.png 20w\" sizes=\"auto, (max-width: 906px) 100vw, 906px\" \/><\/p>\n<p>There are two different types of data here, &#8220;<em>scalar numbers<\/em>&#8221; such as <strong>bathrooms<\/strong>\u00a0and <strong>sqft_living<\/strong>\u00a0and &#8220;<em>categorical variables<\/em>&#8221; such as <strong>zipcode<\/strong>\u00a0and <strong>yr_renovated<\/strong>. Those categorical variables are not just simple numbers, they have a much deeper meaning as they describe a property, in the zipcode case, for example, it represents the location of the house.<\/p>\n<p>Linear Regression does not like that kind of categorical variables, so if we really want to use zipcode in our Linear Regression, as it seems to be a relevant field to predict the price of a house, we have to convert it to a\u00a0<strong>dummy variable<\/strong>, which is fairly simple process:<\/p>\n<ol>\n<li>Distinct all values of the target column.\u00a0<strong>Ex:\u00a0<\/strong><span class=\"lang:default decode:true crayon-inline \">SELECT DISTINCT(ZIPCODE) FROM HOUSES_PRICES<\/span><\/li>\n<li>Convert each row into a column. <strong>Ex:<\/strong> zipcode_98002, zipcode_98188, zipcode_98059<\/li>\n<li>Update those new columns with 1s and 0s according to the value of the zipcode content:<\/li>\n<\/ol>\n<p><strong>Ex:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-4285\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/data_before_transformation-300x179.png\" alt=\"\" width=\"300\" height=\"179\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/data_before_transformation-300x179.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/data_before_transformation-20x12.png 20w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/data_before_transformation.png 438w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>The table above will be transformed to:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-4288\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/data_after_transformation-300x74.png\" alt=\"\" width=\"690\" height=\"170\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/data_after_transformation-300x74.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/data_after_transformation-768x190.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/data_after_transformation-20x5.png 20w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/data_after_transformation.png 964w\" sizes=\"auto, (max-width: 690px) 100vw, 690px\" \/><\/p>\n<p>That is what we are doing on the line below:<\/p>\n<pre class=\"lang:scala decode:true\">val df = transformCategoricalFeatures(houses)<\/pre>\n<p>Converting categorical variables is a very standard procedure and Spark already has some utilities to do this work for you:<\/p>\n<pre class=\"lang:scala decode:true\">def transformCategoricalFeatures(dataset: Dataset[_]): DataFrame = {\r\n    val df1 = encodeFeature(\"zipcode\", \"zipcodeVec\", dataset)\r\n    val df2 = encodeFeature(\"yr_renovated\", \"yr_renovatedVec\", df1)\r\n    val df3 = encodeFeature(\"condition\", \"conditionVec\", df2)\r\n    encodeFeature(\"grade\", \"gradeVec\", df3)\r\n}\r\n\r\ndef encodeFeature(featureName: String, outputName: String, dataset: Dataset[_]): DataFrame = {\r\n    val indexer = new StringIndexer()\r\n        .setInputCol(featureName)\r\n        .setOutputCol(featureName + \"Index\")\r\n        .fit(dataset)\r\n\r\n    val indexed = indexer.transform(dataset)\r\n\r\n    val encoder = new OneHotEncoder()\r\n      .setInputCol(featureName + \"Index\")\r\n      .setOutputCol(outputName)\r\n\r\n    encoder.transform(indexed)\r\n}<\/pre>\n<p><strong>NOTE:<\/strong> The final dataframe will not look exactly like the example shown above as it is already optimized to avoid\u00a0 The<a href=\"https:\/\/en.wikipedia.org\/wiki\/Sparse_matrix\"> Sparse Matrix<\/a> problem.<\/p>\n<p>Now, we can select the fields we would like to use and group them in a vector called <strong>features<\/strong>, as this linear regression implementation expects a field called\u00a0<strong>label<\/strong>, we also have to rename the\u00a0<strong>price<\/strong> column :<\/p>\n<pre class=\"lang:scala decode:true  \">\/\/just using almost all columns as features, no special feature engineering here\r\nval features = Array(\"sqft_living\", \"bedrooms\",\r\n     \"gradeVec\", \"waterfront\",\r\n     \"bathrooms\", \"view\",\r\n     \"conditionVec\", \"sqft_above\",\r\n     \"sqft_basement\",\r\n     \"sqft_lot\", \"floors\",\r\n     \"yr_built\", \"zipcodeVec\", \"yr_renovatedVec\")\r\n\r\nval assembler = new VectorAssembler()\r\n    .setInputCols(features)\r\n    .setOutputCol(\"features\")\r\n\r\n\/\/the Linear Regression implementation expect a feature called \"label\"\r\nval renamedDF = assembler.transform(df.withColumnRenamed(\"price\", \"label\"))<\/pre>\n<p>You can play around with those features removing\/adding them as you wish, later you can try for example remove the &#8220;<em>sqft_living<\/em>&#8221; feature to see how the algorithm has a much worse performance.<\/p>\n<p>Finally, we will only use houses in which the price is not null to train our machine learning algorithm, as our whole goal is to make our Linear Regression &#8220;learn&#8221; how to predict the price by a giving set of features.<\/p>\n<pre class=\"lang:scala decode:true \">val data = renamedDF.select(\"label\", \"features\").filter(\"price is not null\")\r\n<\/pre>\n<p>Here is where the magic happens, first we split our data into training (<em>80%<\/em>) and test (<em>20%<\/em>), but for the purpose of this article let&#8217;s ignore the test data, then we create our LinearRegression instance and <strong>fit<\/strong> our data into it.<\/p>\n<pre class=\"lang:scala decode:true\">\/\/let's split our data into test and training (a common thing during model selection)\r\nval splits = data.randomSplit(Array(0.8, 0.2), seed = 1L)\r\nval trainingData = splits(0).cache()\r\n\/\/let's ignore the test data for now as we are not doing model selection\r\nval testData = splits(1)\r\n\r\nval lr = new LinearRegression()\r\n.setMaxIter(1000)\r\n.setStandardization(true)\r\n.setRegParam(0.1)\r\n.setElasticNetParam(0.8)\r\n\r\nval lrModel = lr.fit(trainingData)\r\n\r\n<\/pre>\n<p><em>The <strong>lrModel<\/strong>\u00a0variable is already a trained model capable of predicting house prices!<\/em><\/p>\n<p>Before we start predicting things, let&#8217;s just check some metrics of our trained model:<\/p>\n<pre class=\"lang:scala decode:true\">println(s\"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}\")\r\nval trainingSummary = lrModel.summary\r\nprintln(s\"numIterations: ${trainingSummary.totalIterations}\")\r\nprintln(s\"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(\",\")}]\")\r\ntrainingSummary.residuals.show()\r\nprintln(s\"RMSE: ${trainingSummary.rootMeanSquaredError}\")\r\nprintln(s\"r2: ${trainingSummary.r2}\")\r\n\r\n<\/pre>\n<p>The one you should care here is called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Root-mean-square_deviation\">RMSE &#8211; Root Mean Squared Error<\/a> which roughly is the\u00a0<strong>average deviation of what our model predicts X the actual price sold<\/strong>.<\/p>\n<pre class=\"lang:default decode:true\">RMSE: 147556.0841305963\r\nr2: 0.8362288980410875<\/pre>\n<p>On average we miss the actual price by <em>$147556.0841305963<\/em>, which is not bad at all considering we barely did any <a href=\"https:\/\/en.wikipedia.org\/wiki\/Feature_engineering\">feature engineering<\/a> or removed any outliers (some houses might have inexplicable high or low prices and it might mess up with your Linear Regression)<\/p>\n<p>There is only one house with a missing price in this dataset, exactly the one that we pointed in the beginning:<\/p>\n<pre class=\"lang:scala decode:true \">val missingPriceData = renamedDF.select(\"features\")\r\n    .filter(\"price is null\")\r\n\r\nmissingPriceData.show()<\/pre>\n<p>And now we can finally predict the expected house price:<\/p>\n<pre class=\"lang:scala decode:true \">\/\/printing out the predicted values\r\nval predictedValues = lrModel.transform(missingPriceData)\r\npredictedValues.select(\"prediction\").show()\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-4287\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/predicted_price.png\" alt=\"\" width=\"292\" height=\"164\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/predicted_price.png 292w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2017\/11\/predicted_price-20x11.png 20w\" sizes=\"auto, (max-width: 292px) 100vw, 292px\" \/><\/p>\n<p>Awesome, isn&#8217;t it?<\/p>\n<p>For production purpose, you would still need to do a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Model_selection\">model selection<\/a> first, check other metrics of your regression and save the model instead of training it on the fly, but it&#8217;s amazing how much can be done with less than 100 lines of code!<\/p>\n<p>If you have any questions, feel free to ask me on twitter at <a href=\"https:\/\/twitter.com\/deniswsrosa\">@deniswsrosa<\/a>\u00a0 or on our <a href=\"https:\/\/www.couchbase.com\/forums\/\">forums<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The past few years we noticed how machine learning had been proven to be a technology in which companies should invest massively, you can easily find dozens of papers talking about how company X saved tons of money by adding [&hellip;]<\/p>\n","protected":false},"author":8754,"featured_media":13873,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[1],"tags":[],"ppma_author":[9059],"class_list":["post-4265","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.7.1 (Yoast SEO v25.7) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Zero Effort Machine Learning with Couchbase and Spark MLlib - The Couchbase Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Zero Effort Machine Learning with Couchbase and Spark MLlib\" \/>\n<meta property=\"og:description\" content=\"The past few years we noticed how machine learning had been proven to be a technology in which companies should invest massively, you can easily find dozens of papers talking about how company X saved tons of money by adding [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/\" \/>\n<meta property=\"og:site_name\" content=\"The Couchbase Blog\" \/>\n<meta property=\"article:published_time\" content=\"2017-11-30T23:47:51+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-14T07:34:23+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/bucket_creation-300x258.png\" \/>\n<meta name=\"author\" content=\"Denis Rosa, Developer Advocate, Couchbase\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@deniswsrosa\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Denis Rosa, Developer Advocate, Couchbase\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/\"},\"author\":{\"name\":\"Denis Rosa, Developer Advocate, Couchbase\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/fe3c5273e805e72a5294611a48f62257\"},\"headline\":\"Zero Effort Machine Learning with Couchbase and Spark MLlib\",\"datePublished\":\"2017-11-30T23:47:51+00:00\",\"dateModified\":\"2023-06-14T07:34:23+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/\"},\"wordCount\":1261,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2022\/11\/couchbase-nosql-dbaas.png\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/\",\"url\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/\",\"name\":\"Zero Effort Machine Learning with Couchbase and Spark MLlib - The Couchbase Blog\",\"isPartOf\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2022\/11\/couchbase-nosql-dbaas.png\",\"datePublished\":\"2017-11-30T23:47:51+00:00\",\"dateModified\":\"2023-06-14T07:34:23+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#primaryimage\",\"url\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2022\/11\/couchbase-nosql-dbaas.png\",\"contentUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2022\/11\/couchbase-nosql-dbaas.png\",\"width\":1800,\"height\":630},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.couchbase.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Zero Effort Machine Learning with Couchbase and Spark MLlib\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#website\",\"url\":\"https:\/\/www.couchbase.com\/blog\/\",\"name\":\"The Couchbase Blog\",\"description\":\"Couchbase, the NoSQL Database\",\"publisher\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\",\"name\":\"The Couchbase Blog\",\"url\":\"https:\/\/www.couchbase.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png\",\"contentUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png\",\"width\":218,\"height\":34,\"caption\":\"The Couchbase Blog\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/fe3c5273e805e72a5294611a48f62257\",\"name\":\"Denis Rosa, Developer Advocate, Couchbase\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/image\/be0716f6199cfb09417c92cf7a8fa8d6\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g\",\"caption\":\"Denis Rosa, Developer Advocate, Couchbase\"},\"description\":\"Denis Rosa is a Developer Advocate for Couchbase and lives in Munich - Germany. He has a solid experience as a software engineer and speaks fluently Java, Python, Scala and Javascript. Denis likes to write about search, Big Data, AI, Microservices and everything else that would help developers to make a beautiful, faster, stable and scalable app.\",\"sameAs\":[\"https:\/\/x.com\/deniswsrosa\"],\"url\":\"https:\/\/www.couchbase.com\/blog\/author\/denis-rosa\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Zero Effort Machine Learning with Couchbase and Spark MLlib - The Couchbase Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/","og_locale":"en_US","og_type":"article","og_title":"Zero Effort Machine Learning with Couchbase and Spark MLlib","og_description":"The past few years we noticed how machine learning had been proven to be a technology in which companies should invest massively, you can easily find dozens of papers talking about how company X saved tons of money by adding [&hellip;]","og_url":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/","og_site_name":"The Couchbase Blog","article_published_time":"2017-11-30T23:47:51+00:00","article_modified_time":"2023-06-14T07:34:23+00:00","og_image":[{"url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2017\/11\/bucket_creation-300x258.png","type":"","width":"","height":""}],"author":"Denis Rosa, Developer Advocate, Couchbase","twitter_card":"summary_large_image","twitter_creator":"@deniswsrosa","twitter_misc":{"Written by":"Denis Rosa, Developer Advocate, Couchbase","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#article","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/"},"author":{"name":"Denis Rosa, Developer Advocate, Couchbase","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/fe3c5273e805e72a5294611a48f62257"},"headline":"Zero Effort Machine Learning with Couchbase and Spark MLlib","datePublished":"2017-11-30T23:47:51+00:00","dateModified":"2023-06-14T07:34:23+00:00","mainEntityOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/"},"wordCount":1261,"commentCount":0,"publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2022\/11\/couchbase-nosql-dbaas.png","inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/","url":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/","name":"Zero Effort Machine Learning with Couchbase and Spark MLlib - The Couchbase Blog","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#primaryimage"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2022\/11\/couchbase-nosql-dbaas.png","datePublished":"2017-11-30T23:47:51+00:00","dateModified":"2023-06-14T07:34:23+00:00","breadcrumb":{"@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#primaryimage","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2022\/11\/couchbase-nosql-dbaas.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2022\/11\/couchbase-nosql-dbaas.png","width":1800,"height":630},{"@type":"BreadcrumbList","@id":"https:\/\/www.couchbase.com\/blog\/zero-effort-machine-learning-couchbase-spark-mllib\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.couchbase.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Zero Effort Machine Learning with Couchbase and Spark MLlib"}]},{"@type":"WebSite","@id":"https:\/\/www.couchbase.com\/blog\/#website","url":"https:\/\/www.couchbase.com\/blog\/","name":"The Couchbase Blog","description":"Couchbase, the NoSQL Database","publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.couchbase.com\/blog\/#organization","name":"The Couchbase Blog","url":"https:\/\/www.couchbase.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","width":218,"height":34,"caption":"The Couchbase Blog"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/fe3c5273e805e72a5294611a48f62257","name":"Denis Rosa, Developer Advocate, Couchbase","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/image\/be0716f6199cfb09417c92cf7a8fa8d6","url":"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g","caption":"Denis Rosa, Developer Advocate, Couchbase"},"description":"Denis Rosa is a Developer Advocate for Couchbase and lives in Munich - Germany. He has a solid experience as a software engineer and speaks fluently Java, Python, Scala and Javascript. Denis likes to write about search, Big Data, AI, Microservices and everything else that would help developers to make a beautiful, faster, stable and scalable app.","sameAs":["https:\/\/x.com\/deniswsrosa"],"url":"https:\/\/www.couchbase.com\/blog\/author\/denis-rosa\/"}]}},"authors":[{"term_id":9059,"user_id":8754,"is_guest":0,"slug":"denis-rosa","display_name":"Denis Rosa, Developer Advocate, Couchbase","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g","author_category":"","last_name":"Rosa, Developer Advocate, Couchbase","first_name":"Denis","job_title":"","user_url":"","description":"Denis Rosa is a Developer Advocate for Couchbase and lives in Munich - Germany. He has a solid experience as a software engineer and speaks fluently Java, Python, Scala and Javascript. Denis likes to write about search, Big Data, AI, Microservices and everything else that would help developers to make a beautiful, faster, stable and scalable app."}],"_links":{"self":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts\/4265","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/users\/8754"}],"replies":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/comments?post=4265"}],"version-history":[{"count":0,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts\/4265\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/media\/13873"}],"wp:attachment":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/media?parent=4265"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/categories?post=4265"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/tags?post=4265"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=4265"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}