Skip to main content

Databricks

DatabricksExternal LinkThis link leads to an external website and will open in a new tab is a unified data analytics platform built on Apache Spark that supports data engineering, machine learning, and collaborative data science at scale.

Files.com integrates with Databricks through the Files.com S3-Compatible Endpoint. Databricks connects to your Files.com site the same way it connects to any S3 compatible bucket, and loads data directly from Files.com folders. Databricks supports this through its connect to cloud storage storageExternal LinkThis link leads to an external website and will open in a new tab feature, which lets you use your Files.com site as external cloud storage.

Configuring Files.com

Your Files.com site includes an S3-Compatible Endpoint. Generate an S3-compatible API Key to create the access key ID and secret access key that Databricks needs. The API Key must be associated with a user account on your site.

Configure the Files.com folder permissions for the folder you want Databricks to access so that the user associated with the access key has access.

Verify that the user can access the folder and see the expected files before configuring Databricks. The user must have the correct permissions on the folder, including "Read" permissions at a minimum.

Configuring Databricks

Refer to the Databricks documentation on connecting to S3-compatible cloud storageExternal LinkThis link leads to an external website and will open in a new tab for details on how to access your Files.com site from Databricks.

Databricks supports Path-Style URL addessing. You can use either default or your custom subdomain as the bucket name for your Files.com site. For example, if your custom subdomain is mysite.files.com, the bucket name is mysite.

With path-style, the endpoint URLs for S3-compatible connections to your Files.com site are:

  • https://s3-<your_custom_subdomain>.files.com
  • https://s3.files.com

Use your Files.com S3-Compatible Endpoint credentials when configuring your cluster's Spark configuration. For example:

spark.hadoop.fs.s3a.endpoint https://s3-mysite.files.com
spark.hadoop.fs.s3a.access.key    YOUR_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key    YOUR_SECRET_KEY
spark.hadoop.fs.s3a.path.style.access    true
spark.hadoop.fs.s3a.connection.ssl.enabled    true
spark.hadoop.fs.s3a.impl    org.apache.hadoop.fs.s3a.S3AFileSystem

Data on your Files.com site can then be accessed using the Databricks s3a connector:

s3a://mysite/path/to/folder/myfile.csv

or

s3a://default/path/to/folder/myfile.csv

Verify that Databricks can successfully load and unload data files from your Files.com folder before putting the integration into production.

Troubleshooting Databricks Connections

Most issues come from incorrect access permissions. Confirm that the S3-Compatible Endpoint access key Databricks is using, and its associated user, have the correct permissions on the target folder, including "Read" permissions at a minimum. Make sure that the permissions of the key and its user have not changed since they were first set up.