{"id":17010,"date":"2025-04-07T10:48:50","date_gmt":"2025-04-07T17:48:50","guid":{"rendered":"https:\/\/www.couchbase.com\/blog\/?p=17010"},"modified":"2025-07-08T09:15:31","modified_gmt":"2025-07-08T16:15:31","slug":"pyspark-ga-couchbase-spark-connector","status":"publish","type":"post","link":"https:\/\/www.couchbase.com\/blog\/ko\/pyspark-ga-couchbase-spark-connector\/","title":{"rendered":"Couchbase\uc640 PySpark\ub85c \ud655\uc7a5\uc131\uc774 \ub6f0\uc5b4\ub09c AI\/ML \uc560\ud50c\ub9ac\ucf00\uc774\uc158 \uad6c\ucd95\ud558\uae30"},"content":{"rendered":"<p>We are excited to announce the General Availability (GA) of the Python support for <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/pyspark.html\" target=\"_blank\" rel=\"noopener\">Couchbase Spark Connector<\/a>, bringing first-class integration between Couchbase Server and <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/index.html\" target=\"_blank\" rel=\"noopener\">Apache Spark<\/a> to Python data engineers\u200b. This GA release means the connector is production-ready and fully supported, enabling PySpark applications to seamlessly read from and write to Couchbase. With Couchbase\u2019s high-performance NoSQL database (with SQL++\/SQL++ query language) and Spark\u2019s distributed processing engine, data engineers can now easily combine these technologies to build fast, scalable data pipelines and analytics workflows. In short, the Couchbase Spark Connector for PySpark unlocks efficient, parallel data integration \u2013 allowing you to leverage Spark for ETL\/ELT, real-time analytics, machine learning, and more on data stored in Couchbase.<\/p>\n<p>In this post, we\u2019ll cover how to get started with the PySpark connector, demonstrate basic read\/write operations (both key-value and query-based) for both Couchbase operational database and Capella Columnar databases; and share performance tuning tips to get the best throughput. Whether you\u2019ve been using the Couchbase Spark Connector in Scala, or if you\u2019re new to Couchbase-Spark integration, this guide will help you quickly ramp up using PySpark for your data engineering needs.<\/p>\n<h2>Why PySpark?<\/h2>\n<p>Adding PySpark support to the existing Couchbase Spark Connector was driven by the growing demand from data engineers and developers who prefer Python for its simplicity and massive Python ML ecosystem for Spark in data science and engineering workflows. This support ensures that teams already using Python can now integrate Couchbase (whether you are using <a href=\"https:\/\/www.couchbase.com\/products\/capella\/\" target=\"_blank\" rel=\"noopener\">Couchbase Capella (DBaaS)<\/a>, <a href=\"https:\/\/www.couchbase.com\/products\/server\/\" target=\"_blank\" rel=\"noopener\">self managed operational database<\/a> or <a href=\"https:\/\/www.couchbase.com\/products\/analytics\/\" target=\"_blank\" rel=\"noopener\">Capella Columnar<\/a> database) into Python-based Spark workflows, enabling broader adoption and streamlined data processes.<\/p>\n<p>Python\u2019s dominance in AI\/ML use cases, supported by frameworks such as SparkML, PyTorch, TensorFlow, H2O, DataRobot, scikit-learn, and SageMaker, as well as popular exploratory data analysis tools like Matplotlib and Plotly, further underscores the necessity for PySpark integration. Additionally, PySpark compatibility unlocks accelerated ETL and ML pipelines leveraging GPU acceleration (Spark RAPIDS) and facilitates sophisticated feature engineering and data wrangling tasks using widely adopted libraries such as Pandas, NumPy, and Spark&#8217;s built-in feature engineering APIs. This new support significantly streamlines data processes and expands adoption opportunities for Couchbase in data science and engineering teams.<\/p>\n<h2>Getting started with Couchbase PySpark<\/h2>\n<p>Getting started is straightforward. The Couchbase Spark Connector is distributed as a single JAR (Java archive) that you add to your Spark environment. You can obtain the connector from the official <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/download-links.html\" target=\"_blank\" rel=\"noopener\">Couchbase download site<\/a> or via <a href=\"https:\/\/mvnrepository.com\/artifact\/com.couchbase.client\/spark-connector\" target=\"_blank\" rel=\"noopener\">Maven coordinates<\/a>. Once you have the JAR, using it in PySpark is as simple as configuring your Spark session with the connector and Couchbase connection settings.<\/p>\n<p><b>1. Get or create a Couchbase operational database or Capella Columnar database<\/b><\/p>\n<p>The fastest way to start with Couchbase is to use our <a href=\"https:\/\/cloud.couchbase.com\/sign-in\" target=\"_blank\" rel=\"noopener\">Capella DBaaS<\/a>. Once there, you can either find your existing database or create an <a href=\"https:\/\/docs.couchbase.com\/cloud\/clusters\/databases.html\" target=\"_blank\" rel=\"noopener\">operational<\/a> or <a href=\"https:\/\/docs.couchbase.com\/columnar\/admin\/prepare-project.html\" target=\"_blank\" rel=\"noopener\">columnar<\/a> (for analytics) database. Alternatively, you can use our <a href=\"https:\/\/www.couchbase.com\/products\/server\/\" target=\"_blank\" rel=\"noopener\">self managed Couchbase<\/a>.<\/p>\n<p><b>2. Install PySpark (if not already)<\/b><\/p>\n<p>If you are working in a Python environment, install PySpark using pip. For example, in a virtual environment:<\/p>\n<pre class=\"nums:false lang:default decode:true\">pip install pyspark<\/pre>\n<p>This will install Apache Spark for use with Python. If you\u2019re running on an existing Spark cluster or Databricks, PySpark may already be available.<\/p>\n<p><b>3. Include the Couchbase Spark Connector JAR<\/b><\/p>\n<p><a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/download-links.html#using-from-pyspark\" target=\"_blank\" rel=\"noopener\">Download<\/a> the <code>spark-connector-assembly-&lt;version&gt;.jar<\/code> for the latest connector release. Then, when creating your Spark session or submitting your job, provide this JAR in the configuration. You can do this by setting the <code>--jars<\/code> option in <code>spark-submit<\/code> or via the SparkSession builder in code (as shown below).<\/p>\n<p><b>4. Configure the Couchbase connection<\/b><\/p>\n<p>You need to specify the Couchbase cluster connection string and credentials (username and password). In Capella, you can find this on the \u201cConnect\u201d tab for operational and <strong>Settings-&gt;Connection String<\/strong> for columnar. Optionally, specify a default bucket or scope if needed (though you can also specify bucket\/scope per operation).<\/p>\n<p>Below is a <b>quick PySpark example<\/b> that sets up a <code>SparkSession<\/code> to connect to a Couchbase cluster and then reads some data:<\/p>\n<pre class=\"nums:false lang:default decode:true\">from pyspark.sql import SparkSession\r\n# Initialize SparkSession with Couchbase connector and connection settings\r\n\r\nspark = SparkSession.builder \\\r\n\u00a0\u00a0\u00a0\u00a0.appName(\"CouchbaseIntegrationExample\") \\\r\n\u00a0\u00a0\u00a0\u00a0.master(\"local[*]\") \\\u00a0 # using local Spark for example; omit or adjust for Spark cluster\r\n\u00a0\u00a0\u00a0\u00a0.config(\"spark.jars\", \"\/path\/to\/spark-connector-assembly-&lt;version&gt;.jar\") \\\r\n\u00a0\u00a0\u00a0\u00a0.config(\"spark.couchbase.connectionString\", \"couchbases:\/\/&lt;YOUR_CLUSTER_HOSTNAME&gt;\") \\\r\n\u00a0\u00a0\u00a0\u00a0.config(\"spark.couchbase.username\", \"&lt;USERNAME&gt;\") \\\r\n\u00a0\u00a0\u00a0\u00a0.config(\"spark.couchbase.password\", \"&lt;PASSWORD&gt;\") \\\r\n\u00a0\u00a0\u00a0\u00a0.getOrCreate()\r\n\r\n# Test the connection by reading a few documents from Couchbase (using a sample bucket)\r\ndf = spark.read.format(\"couchbase.query\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"bucket\", \"bucket_name\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"scope\", \"scope_name\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"collection\", \"collection_name\") \\\r\n\u00a0\u00a0\u00a0\u00a0.load()\r\n\r\ndf.printSchema()\r\ndf.show(5)<\/pre>\n<p>In the code above, we configure the Spark session to include the Couchbase connector JAR and point it to a Couchbase cluster. We then create a DataFrame <code>df<\/code> by reading from the <code>bucket_name<\/code> bucket (specifically the s<code>cope_name.collection_name<\/code> collection) via the Query service.<\/p>\n<p>For the rest of this document, we are assuming you have loaded our sample dataset <a href=\"https:\/\/docs.couchbase.com\/scala-sdk\/current\/ref\/travel-app-data-model.html\" target=\"_blank\" rel=\"noopener\">travel-sample <\/a>which can be done for Couchbase <a href=\"https:\/\/docs.couchbase.com\/cloud\/get-started\/run-first-queries.html\" target=\"_blank\" rel=\"noopener\">Capella operational<\/a> or <a href=\"https:\/\/docs.couchbase.com\/columnar\/intro\/examples.html#travel-sample\" target=\"_blank\" rel=\"noopener\">Columnar<\/a> very easily.<\/p>\n<h2>Read\/write to Couchbase using PySpark<\/h2>\n<p>Once your Spark session is connected to Couchbase, you can perform both <b>key-value operations<\/b> (for writes) and <b>query operations<\/b> (using SQL++ for both read and writes) through DataFrames.<\/p>\n<p>Following table shows the format Sparks connector supports to read and write to Couchbase and columnar databases:<\/p>\n<table style=\"border: 1px solid Gainsboro;\">\n<tbody>\n<tr style=\"border: 1px solid Gainsboro;\">\n<td style=\"border: 1px solid Gainsboro; width: 15%;\"><\/td>\n<td style=\"border: 1px solid Gainsboro;\"><b>Couchbase\/Capella operational database<\/b><\/td>\n<td style=\"border: 1px solid Gainsboro;\"><b>Capella Columnar database<\/b><\/td>\n<\/tr>\n<tr style=\"border: 1px solid Gainsboro;\">\n<td style=\"border: 1px solid Gainsboro; background-color: white;\">Read operations<\/td>\n<td style=\"border: 1px solid Gainsboro; background-color: white;\"><code>read.format(\"couchbase.query\")<\/code><\/td>\n<td style=\"border: 1px solid Gainsboro; background-color: white;\"><code>read.format(\"couchbase.columnar\")<\/code><\/td>\n<\/tr>\n<tr style=\"border: 1px solid Gainsboro;\">\n<td style=\"border: 1px solid Gainsboro;\">Write operations<\/td>\n<td style=\"border: 1px solid Gainsboro;\">(recommended to use Data Service)<\/p>\n<p><code>write.format(\"couchbase.kv\")<\/code><\/p>\n<p><code>write.format(\"couchbase.query\")<\/code><\/td>\n<td><code>write.format(\"couchbase.columnar\")<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Reading from Couchbase with a Query DataFrame<\/h3>\n<p>The Couchbase Spark Connector allows you to load data from a Couchbase bucket as a Spark DataFrame via SQL++ queries. Using the DataFrame reader with format <code>couchbase.query<\/code>, you can specify a bucket (and scope\/collection) and optional query parameters. For example, to read all documents from a collection or a subset defined by a filter:<\/p>\n<pre class=\"nums:false lang:default decode:true\"># Read all documents from a Couchbase collection using the Query service\r\nairlines_df = spark.read.format(\"couchbase.query\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"bucket\", \"travel-sample\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"scope\", \"inventory\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"collection\", \"airline\") \\\r\n\u00a0\u00a0\u00a0\u00a0.load()\r\n\r\n# Example: filter the DataFrame using Spark (will push down to Couchbase where possible)\r\nusa_airlines_df = airlines_df.filter(\"country = 'United States'\")\r\nusa_airlines_df.show(5)<\/pre>\n<p>In this example, <code>airlines_df<\/code> loads all documents from the <code>travel-sample.inventory.airline<\/code> collection into a Spark DataFrame. We then apply a filter to find airlines based in the United States. The connector will attempt to <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/spark-sql.html#aggregate-push-down\" target=\"_blank\" rel=\"noopener\">push down<\/a> filters to Couchbase so that unnecessary data isn\u2019t transferred (i.e. it will include the <code>WHERE country = 'United States'<\/code> clause in the SQL++ query it runs, if possible). The result, <code>usa_airlines_df<\/code>, can be used like any other DataFrame in Spark (for example, you could join it with other DataFrames, apply aggregations, etc.).<\/p>\n<p>Under the hood, the <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/spark-sql.html#_dataframe_partitioning\" target=\"_blank\" rel=\"noopener\">connector partitions<\/a> the query results into multiple tasks if configured (more on this in <i>Performance Tuning<\/i> below), and uses Couchbase\u2019s Query service (powered by the SQL++ engine) to retrieve the data. Each Spark partition corresponds to a subset of data retrieved by an equivalent SQL++ query\u200b. This allows parallel reads from Couchbase, leveraging the distributed nature of both Spark and Couchbase.<\/p>\n<h3>Writing to Couchbase with Key-Value (KV) operations (recommended)<\/h3>\n<p>The connector also supports writing data to Couchbase, either via the <b>Data service (KV)<\/b> or via the Query service (executing SQL++ <code>INSERT\/UPSERT<\/code> commands for you). The <b>recommended<\/b> way for most use cases is to use the <b>Key-Value data source<\/b> (<code>format(\"couchbase.kv\")<\/code>) for <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/spark-sql.html#_dataframe_persistence\" target=\"_blank\" rel=\"noopener\">better performance\u200b<\/a>. In key-value mode, each Spark task will write documents directly to Couchbase data nodes.<\/p>\n<p>When writing a DataFrame to Couchbase, you must ensure there is a unique ID for each document (since Couchbase requires a document ID). By default, the connector looks for a column named <code>__META_ID<\/code> (or <code>META_ID<\/code> in newer versions) in the DataFrame for the document ID. You can also specify a custom ID field via the <code>IdFieldName<\/code> option.<\/p>\n<p>For example, suppose we have a Spark DataFrame <code>new_airlines_df<\/code> that we want to write to Couchbase. It has a column <code>airline_id<\/code> that should serve as the Couchbase document key, and the rest of the columns are the document content:<\/p>\n<pre class=\"nums:false lang:default decode:true\"># Assume new_airlines_df is a DataFrame we want to write to Couchbase\r\n# It contains an \"airline_id\" column to use as the document ID.\r\nnew_airlines_df.write.format(\"couchbase.kv\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"bucket\", \"mybucket\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"scope\", \"myscope\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"collection\", \"airlines\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"idFieldName\", \"airline_id\") \\\r\n\u00a0\u00a0\u00a0\u00a0.save()<\/pre>\n<h3>Writing to Couchbase with Query (SQL++) operations<\/h3>\n<p>While we recommend using the Data service (KV) as above as it is typically faster than Query service, if you prefer, you can also write via the Query service by using <code>format(\"couchbase.query\")<\/code> on write. This will internally execute SQL++ UPSERT statements for each row. This may be useful if you need to leverage a SQL++ feature (for example, server-side transformations), but for straightforward inserts\/updates, the KV approach is more efficient.<\/p>\n<pre class=\"nums:false lang:default decode:true\">df.write.format(\"couchbase.query\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"bucket\", \"mybucket\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"scope\", \"myscope\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"collection\", \"airlines\") \\\r\n\u00a0\u00a0\u00a0\u00a0.mode(\"overwrite\") \\\r\n\u00a0\u00a0\u00a0\u00a0.save()<\/pre>\n<p>In the next section, let us modify these basic read\/write cases for Couchbase\u2019s latest analytics product &#8211; Capella Columnar.<\/p>\n<h2>PySpark support for Capella \u200b\u200bColumnar<\/h2>\n<p>One of the key new features in the Couchbase Spark Connector GA is Capella Columnar support. Capella Columnar is a JSON-native analytical database service in Couchbase Capella that stores data in a column-oriented format for high-performance analytics<\/p>\n<h3>Reading Columnar-Formatted Data with PySpark<\/h3>\n<p>Reading data from a Couchbase Capella Columnar cluster in PySpark is similar to couchbase operational cluster except three changes:<\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Use the <code>format(\"couchbase.columnar\")<\/code> to specify connection is for columnar service.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">The connection string for columnar can be retrieved from Capella UI.<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">You will also specify which dataset to load by providing the database, scope, and collection names (analogous to bucket\/scope\/collection in Couchbase) as options<\/li>\n<\/ol>\n<p>Once Spark is configured, you can use the Spark DataFrame reader API to load data from the columnar service:<\/p>\n<pre class=\"nums:false lang:default decode:true\">from pyspark.sql import SparkSession\r\n# Initialize SparkSession with Couchbase configs (assuming connector jar is available)\r\nspark = SparkSession.builder \\\r\n\u00a0\u00a0\u00a0\u00a0.appName(\"Couchbase Spark Connector Columnar Example\") \\\r\n\u00a0\u00a0\u00a0\u00a0.config(\"spark.couchbase.connectionString\", \"couchbases:\/\/your.columnar.connection.string\") \\\r\n\u00a0\u00a0\u00a0\u00a0.config(\"spark.couchbase.username\", \"YourColumnarUsername\") \\\r\n\u00a0\u00a0\u00a0\u00a0.config(\"spark.couchbase.password\", \"YourColumnarPassword\") \\\r\n\u00a0\u00a0\u00a0\u00a0.getOrCreate()\r\n\r\n# Read a DataFrame from Couchbase Capella Columnar (travel-sample.inventory.airline collection)\r\nairlines_df = spark.read.format(\"couchbase.columnar\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"database\", \"travel-sample\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"scope\", \"inventory\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"collection\", \"airline\") \\\r\n\u00a0\u00a0\u00a0\u00a0.load()<\/pre>\n<p>In this example, the resulting <code>airlines_df<\/code> is a normal Spark DataFrame \u2014 you can inspect it, run transformations, and perform actions like <code>.count()<\/code> or <code>.show()<\/code> as usual. For instance, <code>airlines_df.show(5)<\/code> will print a few airline documents, and <code>airlines_df.count()<\/code> will return the number of documents in the collection. Under the hood, the connector automatically infers a schema for the JSON documents by sampling up to a certain number of records (by default 1000)\u200b. All fields that consistently appear in the sampled documents become columns in the DataFrame, with appropriate Spark data types.<\/p>\n<p>Note that if your documents have varying schemas, the inference might produce a schema that includes the union of all fields (fields not present in some documents will be null in those rows)\u200b. In cases where the schema is evolving or you want to restrict which records are considered, you can provide an explicit filter (predicate) to the reader, as described next.<\/p>\n<h3>Querying a Columnar Dataset in Couchbase via Spark<\/h3>\n<p>Often you may not want to load an entire collection, especially if it\u2019s large. You can optimize performance by pushing down filter predicates directly to the Capella Columnar service when loading data, avoiding unnecessary data transfer. Use <code>.option(\"filter\", \"\")<\/code> to apply a SQL++ WHERE clause during the read operation. For instance, to load only airlines based in the United States:<\/p>\n<pre class=\"nums:false lang:default decode:true\">usa_airlines_df = spark.read.format(\"couchbase.columnar\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"database\", \"travel-sample\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"scope\", \"inventory\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"collection\", \"airline\") \\\r\n\u00a0\u00a0\u00a0\u00a0.option(\"filter\", \"country = 'United States'\") \\\r\n\u00a0\u00a0\u00a0\u00a0.load()\r\n\r\nprint(usa_airlines_df.count())\u00a0 # Only airlines where country = 'United States'<\/pre>\n<p>The connector executes this filter directly at the source, retrieving only relevant documents. You can also push down projections (selecting specific fields) and aggregations in some cases \u2013 the connector will offload <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/spark-sql.html#aggregate-push-down\" target=\"_blank\" rel=\"noopener\">simple aggregates<\/a> like <code>COUNT<\/code>, <code>MIN<\/code>, <code>MAX<\/code>, and <code>SUM<\/code> to the Columnar engine whenever possible, rather than computing them in Spark, for better performance\u200b<\/p>\n<p>Once data is loaded into a DataFrame, you can perform standard <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/columnar.html#spark-sql\" target=\"_blank\" rel=\"noopener\">Spark transformations<\/a>, joins, and aggregations. For example, to count airlines per country using Spark SQL, you can even create a temporary view to run Spark SQL queries on the data as follows:<\/p>\n<pre class=\"nums:false lang:default decode:true\">airlines_df.createOrReplaceTempView(\"airlines_view\")\r\nresult_df = spark.sql(\"\"\"\r\n\u00a0\u00a0\u00a0\u00a0SELECT country, COUNT(*) AS airline_count\r\n\u00a0\u00a0\u00a0\u00a0FROM airlines_view\r\n\u00a0\u00a0\u00a0\u00a0GROUP BY country\r\n\u00a0\u00a0\u00a0\u00a0ORDER BY airline_count DESC\r\n\"\"\")\r\nresult_df.show(10)<\/pre>\n<p>This query runs entirely within Spark engine, giving flexibility to integrate Couchbase data seamlessly into complex analytical workflows.<\/p>\n<p>Having covered basic reads and writes, let\u2019s move on to how you can tune performance when moving large volumes of data between Couchbase and Spark.<\/p>\n<h2>Performance tuning tips<\/h2>\n<p>To maximize throughput and efficiency when using the Couchbase PySpark Connector, consider the following best practices.<\/p>\n<h3>Tuning your read operations<\/h3>\n<p><b>Use Query Partitioning for parallelism<\/b><br \/>\n(<a href=\"https:\/\/www.couchbase.com\/products\/capella\/\" target=\"_blank\" rel=\"noopener\">Couchbase Capella (DBaaS)<\/a>, <a href=\"https:\/\/www.couchbase.com\/products\/server\/\" target=\"_blank\" rel=\"noopener\">self managed operational database<\/a> or <a href=\"https:\/\/www.couchbase.com\/products\/analytics\/\" target=\"_blank\" rel=\"noopener\">Capella Columnar<\/a>)<\/p>\n<p>When reading via the Query service for operational or columnar database, take advantage of the connector\u2019s ability to partition the query results. You can specify a <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/spark-sql.html#_dataframe_partitioning\" target=\"_blank\" rel=\"noopener\">partitionCount<\/a> (and a numeric partitioning field with lower\/upper bounds) for the DataFrame read. A good rule of thumb is to set <code>partitionCount<\/code> to <b>at least the total number of query service CPU cores<\/b> available in your Couchbase cluster. This ensures Spark will run multiple queries in parallel, leveraging all query nodes. For example, if your Couchbase cluster\u2019s Query service has 8 cores in total, set <code>partitionCount &gt;= 8<\/code> so that at least 8 parallel SQL++ queries will be issued. This can dramatically increase read throughput by utilizing all query nodes concurrently. Note that you must have enough cores in your Spark cluster as well to run that many parallel queries.<\/p>\n<p><b>Leverage covering indexes for query efficiency<br \/>\n<\/b>(<a href=\"https:\/\/www.couchbase.com\/products\/capella\/\" target=\"_blank\" rel=\"noopener\">Couchbase Capella (DBaaS)<\/a>, <a href=\"https:\/\/www.couchbase.com\/products\/server\/\" target=\"_blank\" rel=\"noopener\">self managed operational database<\/a>)<b><br \/>\n<\/b><\/p>\n<p>If using SQL++ queries, try to query through <a href=\"https:\/\/docs.couchbase.com\/server\/current\/n1ql\/n1ql-language-reference\/covering-indexes.html\" target=\"_blank\" rel=\"noopener\">covering indexes<\/a> whenever possible. A covering index is an index that includes <i>all<\/i> fields your query needs, so the query can be served entirely from the index without fetching from the data service\u200b. Covered queries avoid the extra network hop to fetch full documents, thus <b>delivering better performance<\/b>\u200b. Design your Couchbase secondary indexes to include the fields you filter on <i>and<\/i> the fields you return, if feasible. This might mean creating specific indexes for your Spark jobs that cover exactly the data needed.<\/p>\n<p><b>Ensure index replicas to avoid bottlenecks<br \/>\n<\/b>(<a href=\"https:\/\/www.couchbase.com\/products\/capella\/\" target=\"_blank\" rel=\"noopener\">Couchbase Capella (DBaaS)<\/a>, <a href=\"https:\/\/www.couchbase.com\/products\/server\/\" target=\"_blank\" rel=\"noopener\">self managed operational database<\/a>)<\/p>\n<p>Along with using covering indexes, make sure your indexes are replicated across multiple index nodes. <a href=\"https:\/\/docs.couchbase.com\/server\/current\/learn\/services-and-indexes\/indexes\/index-replication.html#index-replication\" target=\"_blank\" rel=\"noopener\">Index replication<\/a> not only provides high availability, but also allows queries to be <b>load-balanced across index copies on different nodes for higher throughput<\/b>\u200b. In practice, if you have (for example) 3 index nodes, replicating important indexes across them means the Spark connector\u2019s parallel queries can hit different index nodes rather than all pounding a single node.<\/p>\n<h3>Tuning your write operations<\/h3>\n<p><b>Prefer the Data service for bulk writes<\/b><br \/>\n(<a href=\"https:\/\/www.couchbase.com\/products\/capella\/\" target=\"_blank\" rel=\"noopener\">Couchbase Capella (DBaaS)<\/a>, <a href=\"https:\/\/www.couchbase.com\/products\/server\/\" target=\"_blank\" rel=\"noopener\">self managed operational database<\/a>)<\/p>\n<p>We recommend to use the key-value data source (<a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/spark-sql.html#_dataframe_persistence\" target=\"_blank\" rel=\"noopener\">Data service<\/a>) rather than the Query service for write operations. Writing through the Data service (direct KV upserts) is typically <b>several times faster<\/b> than doing SQL++-based inserts. In fact, internal benchmarks have shown writing via KV can be around <i>3x faster<\/i> than using SQL++ in Spark jobs. This is because the Data service can ingest documents in parallel directly to the nodes responsible, with lower latency per operation. Note that you have indices updated separately, if needed, for those new documents, as KV writes won\u2019t automatically trigger index updates beyond the primary index.<\/p>\n<p><b>Increase write partitions for Query service writes<\/b><br \/>\n(<a href=\"https:\/\/www.couchbase.com\/products\/capella\/\" target=\"_blank\" rel=\"noopener\">Couchbase Capella (DBaaS)<\/a>, <a href=\"https:\/\/www.couchbase.com\/products\/server\/\" target=\"_blank\" rel=\"noopener\">self managed operational database<\/a>)<\/p>\n<p>While not recommended, if you decide to use <code>couchbase.query<\/code> for writing (for example, if performing a server side transformations while writing) , optimize the performance by using a high number of write partitions. You can repartition your DataFrame before writing so that Spark runs many concurrent write tasks. A rough guideline is to use on the order of <b>hundreds of partitions<\/b> for large scale writes via SQL++. For instance, using about <i>128 partitions per Query node CPU<\/i> is a starting point some users have found effective. This means if you have 8 query cores, try ~1024 partitions. The idea is to flood the query service with enough parallel UPSERT statements to maximize throughput. Be cautious and find the right balance for your cluster \u2013 too high concurrency could overload the query service. Monitor Couchbase\u2019s query throughput and adjust accordingly.<\/p>\n<p>By following these tuning tips \u2013 aligning partition counts with cluster resources, indexing smartly, and choosing the right service for the job \u2013 you can achieve optimal performance for Couchbase-Spark integration. Keep an eye on both Spark\u2019s job metrics and Couchbase\u2019s performance stats (available in the Couchbase UI and logs) to identify any bottlenecks (e.g., if one query node is doing all the work, or if the network is saturated) and adjust the configuration as needed.<\/p>\n<h2>Community and support<\/h2>\n<p>Couchbase PySpark support is built upon Couchbase Spark Connector for Couchbase and <a href=\"https:\/\/github.com\/couchbase\/couchbase-spark-connector\" target=\"_blank\" rel=\"noopener\">is open-source<\/a>, and we encourage you to contribute, provide feedback, and join the conversation. you can access our comprehensive <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/pyspark.html\" target=\"_blank\" rel=\"noopener\">documentation<\/a>, join the <a href=\"https:\/\/www.couchbase.com\/forums\/\" target=\"_blank\" rel=\"noopener\">Couchbase Forums<\/a> or <a href=\"https:\/\/discord.com\/invite\/K7NPMPGrPk\" target=\"_blank\" rel=\"noopener\">Couchbase Discord<\/a>.<\/p>\n<h2>Further reading<\/h2>\n<p>For more information and detailed documentation, please refer to the official <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/index.html\" target=\"_blank\" rel=\"noopener\">Couchbase Spark Connector documentation<\/a> and relevant section about <a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/pyspark.html\" target=\"_blank\" rel=\"noopener\">PySpark<\/a>:<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/docs.couchbase.com\/spark-connector\/current\/pyspark.html\" target=\"_blank\" rel=\"noopener\">Couchbase PySpark Documentation<\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/github.com\/couchbase\/couchbase-spark-connector\" target=\"_blank\" rel=\"noopener\">Couchbase Spark Connector GitHub Repository<\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/www.couchbase.com\/forums\/tag\/spark\" target=\"_blank\" rel=\"noopener\">Couchbase Forums (Spark Connector section)<\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/github.com\/couchbase\/couchbase-spark-connector\/tree\/master\/src\/test\/pyspark\/examples\/basic\" target=\"_blank\" rel=\"noopener\">Couchbase PySpark Connector Jupyter Notebook Example<\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/github.com\/couchbase\/couchbase-spark-connector\/blob\/master\/src\/test\/pyspark\/examples\/ml\/pyspark_ml_example_hotel_cancellations.ipynb\" target=\"_blank\" rel=\"noopener\">Couchbase PySpark ML Example Jupyter Notebook: Hotel Cancellations<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Happy coding!<\/p>\n<p>The Couchbase Team<\/p>\n<p><br style=\"font-weight: 400;\" \/><br style=\"font-weight: 400;\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We are excited to announce the General Availability (GA) of the Python support for Couchbase Spark Connector, bringing first-class integration between Couchbase Server and Apache Spark to Python data engineers\u200b. This GA release means the connector is production-ready and fully [&hellip;]<\/p>\n","protected":false},"author":85357,"featured_media":17013,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[1815,10129,2242,2294,2225,1816,10133,9417,9139,9141,1812],"tags":[10105,10104],"ppma_author":[9987],"class_list":["post-17010","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-best-practices-and-tutorials","category-columnar","category-connectors","category-analytics","category-cloud","category-couchbase-server","category-engineering","category-performance","category-python","category-scala","category-n1ql-query","tag-data-engineering","tag-pyspark"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Build Highly Scalable AI\/ML Applications With Couchbase and PySpark - The Couchbase Blog<\/title>\n<meta name=\"description\" content=\"Couchbase Spark Connector now supports PySpark! Build fast, scalable data pipelines and ML apps with Couchbase and Apache Spark in Python.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.couchbase.com\/blog\/ko\/pyspark-ga-couchbase-spark-connector\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Build Highly Scalable AI\/ML Applications With Couchbase and PySpark\" \/>\n<meta property=\"og:description\" content=\"Couchbase Spark Connector now supports PySpark! Build fast, scalable data pipelines and ML apps with Couchbase and Apache Spark in Python.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.couchbase.com\/blog\/ko\/pyspark-ga-couchbase-spark-connector\/\" \/>\n<meta property=\"og:site_name\" content=\"The Couchbase Blog\" \/>\n<meta property=\"article:published_time\" content=\"2025-04-07T17:48:50+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-08T16:15:31+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2025\/04\/Pyspark-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1340\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Vishal Dhiman, Sr. Product Manager\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Vishal Dhiman, Sr. Product Manager\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/\"},\"author\":{\"name\":\"Vishal Dhiman, Sr. Product Manager\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#\\\/schema\\\/person\\\/853c7ac2867fb9e801ff769321364961\"},\"headline\":\"Build Highly Scalable AI\\\/ML Applications With Couchbase and PySpark\",\"datePublished\":\"2025-04-07T17:48:50+00:00\",\"dateModified\":\"2025-07-08T16:15:31+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/\"},\"wordCount\":2546,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/1\\\/2025\\\/04\\\/Pyspark-1.png\",\"keywords\":[\"data engineering\",\"PySpark\"],\"articleSection\":[\"Best Practices and Tutorials\",\"Columnar\",\"Connectors\",\"Couchbase Analytics\",\"Couchbase Capella\",\"Couchbase Server\",\"Engineering\",\"High Performance\",\"Python\",\"Scala\",\"SQL++ \\\/ N1QL Query\"],\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/\",\"url\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/\",\"name\":\"Build Highly Scalable AI\\\/ML Applications With Couchbase and PySpark - The Couchbase Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/1\\\/2025\\\/04\\\/Pyspark-1.png\",\"datePublished\":\"2025-04-07T17:48:50+00:00\",\"dateModified\":\"2025-07-08T16:15:31+00:00\",\"description\":\"Couchbase Spark Connector now supports PySpark! Build fast, scalable data pipelines and ML apps with Couchbase and Apache Spark in Python.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/1\\\/2025\\\/04\\\/Pyspark-1.png\",\"contentUrl\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/1\\\/2025\\\/04\\\/Pyspark-1.png\",\"width\":2560,\"height\":1340,\"caption\":\"Couchbase PySpark connector released\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/pyspark-ga-couchbase-spark-connector\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Build Highly Scalable AI\\\/ML Applications With Couchbase and PySpark\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/\",\"name\":\"The Couchbase Blog\",\"description\":\"Couchbase, the NoSQL Database\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#organization\",\"name\":\"The Couchbase Blog\",\"url\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/2023\\\/04\\\/admin-logo.png\",\"contentUrl\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/2023\\\/04\\\/admin-logo.png\",\"width\":218,\"height\":34,\"caption\":\"The Couchbase Blog\"},\"image\":{\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/#\\\/schema\\\/person\\\/853c7ac2867fb9e801ff769321364961\",\"name\":\"Vishal Dhiman, Sr. Product Manager\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/1\\\/2024\\\/09\\\/vishal-dhiman-couchbase.jpg58e586f8e4645cc672ef6f140799b4b3\",\"url\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/1\\\/2024\\\/09\\\/vishal-dhiman-couchbase.jpg\",\"contentUrl\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/1\\\/2024\\\/09\\\/vishal-dhiman-couchbase.jpg\",\"caption\":\"Vishal Dhiman, Sr. Product Manager\"},\"url\":\"https:\\\/\\\/www.couchbase.com\\\/blog\\\/ko\\\/author\\\/vishald\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Build Highly Scalable AI\/ML Applications With Couchbase and PySpark - The Couchbase Blog","description":"Couchbase Spark \ucee4\ub125\ud130\uac00 \uc774\uc81c PySpark\ub97c \uc9c0\uc6d0\ud569\ub2c8\ub2e4! Python\uc5d0\uc11c Couchbase\uc640 Apache Spark\ub85c \ube60\ub974\uace0 \ud655\uc7a5 \uac00\ub2a5\ud55c \ub370\uc774\ud130 \ud30c\uc774\ud504\ub77c\uc778\uacfc ML \uc571\uc744 \uad6c\ucd95\ud558\uc138\uc694.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.couchbase.com\/blog\/ko\/pyspark-ga-couchbase-spark-connector\/","og_locale":"ko_KR","og_type":"article","og_title":"Build Highly Scalable AI\/ML Applications With Couchbase and PySpark","og_description":"Couchbase Spark Connector now supports PySpark! Build fast, scalable data pipelines and ML apps with Couchbase and Apache Spark in Python.","og_url":"https:\/\/www.couchbase.com\/blog\/ko\/pyspark-ga-couchbase-spark-connector\/","og_site_name":"The Couchbase Blog","article_published_time":"2025-04-07T17:48:50+00:00","article_modified_time":"2025-07-08T16:15:31+00:00","og_image":[{"width":2560,"height":1340,"url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2025\/04\/Pyspark-1.png","type":"image\/png"}],"author":"Vishal Dhiman, Sr. Product Manager","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Vishal Dhiman, Sr. Product Manager","Est. reading time":"12\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/#article","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/"},"author":{"name":"Vishal Dhiman, Sr. Product Manager","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/853c7ac2867fb9e801ff769321364961"},"headline":"Build Highly Scalable AI\/ML Applications With Couchbase and PySpark","datePublished":"2025-04-07T17:48:50+00:00","dateModified":"2025-07-08T16:15:31+00:00","mainEntityOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/"},"wordCount":2546,"commentCount":0,"publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2025\/04\/Pyspark-1.png","keywords":["data engineering","PySpark"],"articleSection":["Best Practices and Tutorials","Columnar","Connectors","Couchbase Analytics","Couchbase Capella","Couchbase Server","Engineering","High Performance","Python","Scala","SQL++ \/ N1QL Query"],"inLanguage":"ko-KR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/","url":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/","name":"Build Highly Scalable AI\/ML Applications With Couchbase and PySpark - The Couchbase Blog","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/#primaryimage"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2025\/04\/Pyspark-1.png","datePublished":"2025-04-07T17:48:50+00:00","dateModified":"2025-07-08T16:15:31+00:00","description":"Couchbase Spark \ucee4\ub125\ud130\uac00 \uc774\uc81c PySpark\ub97c \uc9c0\uc6d0\ud569\ub2c8\ub2e4! Python\uc5d0\uc11c Couchbase\uc640 Apache Spark\ub85c \ube60\ub974\uace0 \ud655\uc7a5 \uac00\ub2a5\ud55c \ub370\uc774\ud130 \ud30c\uc774\ud504\ub77c\uc778\uacfc ML \uc571\uc744 \uad6c\ucd95\ud558\uc138\uc694.","breadcrumb":{"@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/"]}]},{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/#primaryimage","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2025\/04\/Pyspark-1.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2025\/04\/Pyspark-1.png","width":2560,"height":1340,"caption":"Couchbase PySpark connector released"},{"@type":"BreadcrumbList","@id":"https:\/\/www.couchbase.com\/blog\/pyspark-ga-couchbase-spark-connector\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.couchbase.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Build Highly Scalable AI\/ML Applications With Couchbase and PySpark"}]},{"@type":"WebSite","@id":"https:\/\/www.couchbase.com\/blog\/#website","url":"https:\/\/www.couchbase.com\/blog\/","name":"\uce74\uc6b0\uce58\ubca0\uc774\uc2a4 \ube14\ub85c\uadf8","description":"NoSQL \ub370\uc774\ud130\ubca0\uc774\uc2a4, Couchbase","publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/www.couchbase.com\/blog\/#organization","name":"\uce74\uc6b0\uce58\ubca0\uc774\uc2a4 \ube14\ub85c\uadf8","url":"https:\/\/www.couchbase.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","width":218,"height":34,"caption":"The Couchbase Blog"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/853c7ac2867fb9e801ff769321364961","name":"\ube44\uc0ec \ub514\ub9cc, Sr. \uc81c\ud488 \uad00\ub9ac\uc790","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/09\/vishal-dhiman-couchbase.jpg58e586f8e4645cc672ef6f140799b4b3","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/09\/vishal-dhiman-couchbase.jpg","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/09\/vishal-dhiman-couchbase.jpg","caption":"Vishal Dhiman, Sr. Product Manager"},"url":"https:\/\/www.couchbase.com\/blog\/ko\/author\/vishald\/"}]}},"acf":[],"authors":[{"term_id":9987,"user_id":85357,"is_guest":0,"slug":"vishald","display_name":"Vishal Dhiman, Sr. Product Manager","avatar_url":{"url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/09\/vishal-dhiman-couchbase.jpg","url2x":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/09\/vishal-dhiman-couchbase.jpg"},"0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/posts\/17010","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/users\/85357"}],"replies":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/comments?post=17010"}],"version-history":[{"count":0,"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/posts\/17010\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/media\/17013"}],"wp:attachment":[{"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/media?parent=17010"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/categories?post=17010"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/tags?post=17010"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/ko\/wp-json\/wp\/v2\/ppma_author?post=17010"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}