A summary of different file systems in Databricks.
DBFS in databricks is a distributed file system, which maps the cloud storage to a file system for ease of use.
There are two different styles of notation when we work with file paths in DBFS.
The first is the spark API format: dbfs:/some/path/.
The second is the file API format: /dbfs/some/path/
We use the spark API format when:
spark.write.format('csv').save('dbfs:/path/to/my-csv')spark.read.load('dbfs:/path/to/my-csv')dbutils.fs interface:dbutils.fs.ls('dbfs:/some/path')%fs magic for DBFS operations: %fs ls dbfs:/path/to/fileSELECT * FROM JSON.`dbfs:/FileStore/shared_uploads/my-json-table`ref:
We use the file API format when:
%sh ls -l /dbfs/path/to/filef = open('/dbfs/path/to/file')When we want to specify file path in local driver node, for bash and Python code, we just use the path as is:
%sh ls /usr/include/zlib.hf = open('/usr/include/zlib.h')However, if want to access the local driver file system using the dbutils.fs interface,
we need to use this notation: file:/path/to/file:
dbutils.fs.ls('file:/usr/include/zlib.h')%fs ls file:/usr/include/zlib.hWe can also move a file from local file system to DBFS:
dbutils.fs.mv("file:/tmp/some_file.csv", "dbfs:/tmp/my_file.csv")ref:
To read a file under workspace folder in spark, we need to use the full path with file: notation.
For example:
df = spark.read.csv('file:/some/csv/file/path/under/workspace', header=True)ref:
A summary of different file systems in Databricks.
DBFS in databricks is a distributed file system, which maps the cloud storage to a file system for ease of use.
There are two different styles of notation when we work with file paths in DBFS.
The first is the spark API format: dbfs:/some/path/.
The second is the file API format: /dbfs/some/path/
We use the spark API format when:
spark.write.format('csv').save('dbfs:/path/to/my-csv')spark.read.load('dbfs:/path/to/my-csv')dbutils.fs interface:dbutils.fs.ls('dbfs:/some/path')%fs magic for DBFS operations: %fs ls dbfs:/path/to/fileSELECT * FROM JSON.`dbfs:/FileStore/shared_uploads/my-json-table`ref:
We use the file API format when:
%sh ls -l /dbfs/path/to/filef = open('/dbfs/path/to/file')When we want to specify file path in local driver node, for bash and Python code, we just use the path as is:
%sh ls /usr/include/zlib.hf = open('/usr/include/zlib.h')However, if want to access the local driver file system using the dbutils.fs interface,
we need to use this notation: file:/path/to/file:
dbutils.fs.ls('file:/usr/include/zlib.h')%fs ls file:/usr/include/zlib.hWe can also move a file from local file system to DBFS:
dbutils.fs.mv("file:/tmp/some_file.csv", "dbfs:/tmp/my_file.csv")ref:
To read a file under workspace folder in spark, we need to use the full path with file: notation.
For example:
df = spark.read.csv('file:/some/csv/file/path/under/workspace', header=True)ref: