The Resource class is arguable the most important class of the whole Frictionless Framework. It's based on Data Resource Standard and Tabular Data Resource Standard
Let's create a data resource:
from frictionless import Resource
resource = Resource('table.csv') # from a resource path
resource = Resource('resource.json') # from a descriptor path
resource = Resource({'path': 'table.csv'}) # from a descriptor
resource = Resource(path='table.csv') # from arguments
As you can see it's possible to create a resource providing different kinds of sources which will be detector to have some type automatically (e.g. whether it's a descriptor or a path). It's possible to make this step more explicit:
from frictionless import Resource
resource = Resource(path='data/table.csv') # from a path
resource = Resource('data/resource.json') # from a descriptor
The standards support a great deal of resource metadata which is possible to have with Frictionless Framework too:
from frictionless import Resource
resource = Resource(
name='resource',
title='My Resource',
description='My Resource for the Guide',
path='table.csv',
# it's possible to provide all the official properties like mediatype, etc
)
print(resource)
{'name': 'resource',
'type': 'table',
'title': 'My Resource',
'description': 'My Resource for the Guide',
'path': 'table.csv',
'scheme': 'file',
'format': 'csv',
'mediatype': 'text/csv'}
If you have created a resource, for example, from a descriptor you can access this properties:
from frictionless import Resource
resource = Resource('resource.json')
print(resource.name)
# and others
name
And edit them:
from frictionless import Resource
resource = Resource('resource.json')
resource.name = 'new-name'
resource.title = 'New Title'
resource.description = 'New Description'
# and others
print(resource)
{'name': 'new-name',
'type': 'table',
'title': 'New Title',
'description': 'New Description',
'path': 'table.csv',
'scheme': 'file',
'format': 'csv',
'mediatype': 'text/csv'}
As any of the Metadata classes the Resource class can be saved as JSON or YAML:
from frictionless import Resource
resource = Resource('table.csv')
resource.to_json('resource.json') # Save as JSON
resource.to_yaml('resource.yaml') # Save as YAML
You might have noticed that we had to duplicate the with Resource(...)
statement in some examples. The reason is that Resource is a streaming interface. Once it's read you need to open it again. Let's show it in an example:
from pprint import pprint
from frictionless import Resource
resource = Resource('capital-3.csv')
resource.open()
pprint(resource.read_rows())
pprint(resource.read_rows())
# We need to re-open: there is no data left
resource.open()
pprint(resource.read_rows())
# We need to close manually: not context manager is used
resource.close()
[{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]
[]
[{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]
At the same you can read data for a resource without opening and closing it explicitly. In this case Frictionless Framework will open and close the resource for you so it will be basically a one-time operation:
from frictionless import Resource
resource = Resource('capital-3.csv')
pprint(resource.read_rows())
[{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]
The Resource class is also a metadata class which provides various read and stream functions. The extract
functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data. It can be rows
, data
, text
, or bytes
. Let's try reading all of them:
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_bytes())
pprint(resource.read_text())
pprint(resource.read_cells())
pprint(resource.read_rows())
(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
b'3\n4,5,Italy,60\n5,4,Spain,47\n')
''
[['id', 'capital_id', 'name', 'population'],
['1', '1', 'Britain', '67'],
['2', '3', 'France', '67'],
['3', '2', 'Germany', '83'],
['4', '5', 'Italy', '60'],
['5', '4', 'Spain', '47']]
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
It's really handy to read all your data into memory but it's not always possible if a file is really big. For such cases, Frictionless provides streaming functions:
from frictionless import Resource
with Resource('country-3.csv') as resource:
pprint(resource.byte_stream)
pprint(resource.text_stream)
pprint(resource.cell_stream)
pprint(resource.row_stream)
for row in resource.row_stream:
print(row)
<frictionless.system.loader.ByteStreamWithStatsHandling object at 0x7f4e11e0c8e0>
<_io.TextIOWrapper name='country-3.csv' encoding='utf-8'>
<itertools.chain object at 0x7f4e11e0d900>
<generator object TableResource.__open_row_stream.<locals>.row_stream at 0x7f4e11902c70>
{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67}
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67}
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83}
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60}
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}
[email protected]
as a feature preview and request for comments. The implementation is raw and doesn't cover many edge cases.
Indexing resource in Frictionless terms means loading a data table into a database. Let's explore how this feature works in different modes.
All the example are written for SQLite for simplicity
This mode is supported for any database that is supported by sqlalchemy
. Under the hood, Frictionless will infer Table Schema and populate the data table as it normally reads data. It means that type errors will be replaced by null
values and in-general it guarantees to finish successfully for any data even very invalid.
frictionless index table.csv --database sqlite:///index/project.db --name table
frictionless extract sqlite:///index/project.db --table table --json
──────────────────────────────────── Index ─────────────────────────────────────
[table] Indexed 3 rows in 0.205 seconds
──────────────────────────────────── Result ────────────────────────────────────
Succesefully indexed 1 tables
{
"project": [
{
"id": 1,
"name": "english"
},
{
"id": 2,
"name": "中国人"
}
]
}
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', name='table')
print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
{'project': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}
[email protected]+
command to be available.
Fast mode is supported for SQLite and Postgresql databases. It will infer Table Schema using a data sample and index data using COPY
in Potgresql and .import
in SQLite. For big data files this mode will be 10-30x faster than normal indexing but the speed comes with the price -- if there is invalid data the indexing will fail.
frictionless index table.csv --database sqlite:///index/project.db --name table --fast
frictionless extract sqlite:///index/project.db --table table --json
──────────────────────────────────── Index ─────────────────────────────────────
[table] Indexed 30 bytes in 0.223 seconds
──────────────────────────────────── Result ────────────────────────────────────
Succesefully indexed 1 tables
{
"project": [
{
"id": 1,
"name": "english"
},
{
"id": 2,
"name": "中国人"
}
]
}
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', name='table', fast=True)
print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
{'project': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}
To ensure that the data will be successfully indexed it's possible to use fallback
option. If the fast indexing fails Frictionless will start over in normal mode and finish the process successfully.
frictionless index table.csv --database sqlite:///index/project.db --name table --fast --fallback
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', name='table', fast=True, fallback=True)
Another option is to provide a path to QSV binary. In this case, initial schema inferring will be done based on the whole data file and will guarantee that the table is valid type-wise:
frictionless index table.csv --database sqlite:///index/project.db --name table --fast --qsv qsv_path
import sqlite3
from frictionless import Resource, formats
resource = Resource('table.csv')
resource.index('sqlite:///index/project.db', name='table', fast=True, qsv_path='qsv_path')
The scheme also know as protocol indicates which loader Frictionless should use to read or write data. It can be file
(default), text
, http
, https
, s3
, and others.
from frictionless import Resource
with Resource(b'header1,header2\nvalue1,value2', format='csv') as resource:
print(resource.scheme)
print(resource.to_view())
buffer
+----------+----------+
| header1 | header2 |
+==========+==========+
| 'value1' | 'value2' |
+----------+----------+
The format or as it's also called extension helps Frictionless to choose a proper parser to handle the file. Popular formats are csv
, xlsx
, json
and others
from frictionless import Resource
with Resource(b'header1,header2\nvalue1,value2.csv', format='csv') as resource:
print(resource.format)
print(resource.to_view())
csv
+----------+--------------+
| header1 | header2 |
+==========+==============+
| 'value1' | 'value2.csv' |
+----------+--------------+
Frictionless automatically detects encoding of files but sometimes it can be inaccurate. It's possible to provide an encoding manually:
from frictionless import Resource
with Resource('country-3.csv', encoding='utf-8') as resource:
print(resource.encoding)
print(resource.path)
utf-8
country-3.csv
utf-8
data/country-3.csv
By default, Frictionless uses the first file found in a zip archive. It's possible to adjust this behaviour:
from frictionless import Resource
with Resource('table-multiple-files.zip', innerpath='table-reverse.csv') as resource:
print(resource.compression)
print(resource.innerpath)
print(resource.to_view())
zip
table-reverse.csv
+----+-----------+
| id | name |
+====+===========+
| 1 | '中国人' |
+----+-----------+
| 2 | 'english' |
+----+-----------+
It's possible to adjust compression detection by providing the algorithm explicitly. For the example below it's not required as it would be detected anyway:
from frictionless import Resource
with Resource('table.csv.zip', compression='zip') as resource:
print(resource.compression)
print(resource.to_view())
zip
+----+-----------+
| id | name |
+====+===========+
| 1 | 'english' |
+----+-----------+
| 2 | '中国人' |
+----+-----------+
Please read Table Dialect Guide for more information.
Please read Table Schema Guide for more information.
Please read Checklist Guide for more information.
Please read Pipeline Guide for more information.
Resource's stats can be accessed with resource.stats
:
from frictionless import Resource
resource = Resource('table.csv')
resource.infer(stats=True)
print(resource.stats)
<frictionless.resource.stats.ResourceStats object at 0x7f4e10c7a070>
Resource representation. This class is one of the cornerstones of of Frictionless framework. It loads a data source, and allows you to stream its parsed contents. At the same time, it's a metadata class data description. ```python with Resource("data/table.csv") as resource: resource.header == ["id", "name"] resource.read_rows() == [ {'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}, ] ```
(*, source: Optional[Any] = None, control: Optional[Control] = None, packagify: bool = False, name: Optional[str] = , title: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, profile: Optional[str] = None, licenses: List[Dict[str, Any]] = NOTHING, sources: List[Dict[str, Any]] = NOTHING, path: Optional[str] = None, data: Optional[Any] = None, scheme: Optional[str] = None, format: Optional[str] = None, datatype: Optional[str] = , mediatype: Optional[str] = None, compression: Optional[str] = None, extrapaths: List[str] = NOTHING, innerpath: Optional[str] = None, encoding: Optional[str] = None, hash: Optional[str] = None, bytes: Optional[int] = None, fields: Optional[int] = None, rows: Optional[int] = None, dialect: Union[Dialect, str] = NOTHING, schema: Union[Schema, str] = NOTHING, basepath: Optional[str] = None, detector: Detector = NOTHING, package: Optional[Package] = None) -> None
# TODO: add docs
Optional[Any]
# TODO: add docs
Optional[Control]
# TODO: add docs
bool
Resource name according to the specs. It should be a slugified name of the resource.
Optional[str]
Type of the resource
ClassVar[str]
Resource title according to the specs It should a human-oriented title of the resource.
Optional[str]
Resource description according to the specs It should a human-oriented description of the resource.
Optional[str]
A URL for the home on the web that is related to this package. For example, github repository or ckan dataset address.
Optional[str]
A fully-qualified URL that points directly to a JSON Schema that can be used to validate the descriptor
Optional[str]
The license(s) under which the resource is provided. If omitted it's considered the same as the package's licenses.
List[Dict[str, Any]]
The raw sources for this data resource. It MUST be an array of Source objects. Each Source object MUST have a title and MAY have path and/or email properties.
List[Dict[str, Any]]
Path to data source
Optional[str]
Inline data source
Optional[Any]
Scheme for loading the file (file, http, ...). If not set, it'll be inferred from `source`.
Optional[str]
File source's format (csv, xls, ...). If not set, it'll be inferred from `source`.
Optional[str]
Frictionless Framework specific data type as "table" or "schema"
Optional[str]
Mediatype/mimetype of the resource e.g. “text/csv”, or “application/vnd.ms-excel”. Mediatypes are maintained by the Internet Assigned Numbers Authority (IANA) in a media type registry.
Optional[str]
Source file compression (zip, ...). If not set, it'll be inferred from `source`.
Optional[str]
List of paths to concatenate to the main path. It's used for multipart resources.
List[str]
Path within the compressed file. It defaults to the first file in the archive (if the source is an archive).
Optional[str]
Source encoding. If not set, it'll be inferred from `source`.
Optional[str]
# TODO: add docs
Optional[str]
# TODO: add docs
Optional[int]
# TODO: add docs
Optional[int]
# TODO: add docs
Optional[int]
# TODO: add docs
Union[Dialect, str]
# TODO: add docs
Union[Schema, str]
# TODO: add docs
Optional[str]
File/table detector. For more information, please check the Detector documentation.
Detector
Parental to this resource package. For more information, please check the Package documentation.
Optional[Package]
# TODO: add docs
ResourceStats
Whether the resource is tabular
ClassVar[bool]
A basepath of the resource The normpath of the resource is joined `basepath` and `/path`
Optional[str]
File's bytes used as a sample These buffer bytes are used to infer characteristics of the source file (e.g. encoding, ...).
types.IBuffer
Byte stream in form of a generator
types.IByteStream
Whether the table is closed
bool
Whether resource is not path based
bool
Whether resource is multipart
bool
Normalized path of the resource or raise if not set
Optional[str]
Normalized paths of the resource
List[str]
All paths of the resource
List[str]
Stringified resource location
str
Whether resource is remote
bool
Text stream in form of a generator
types.ITextStream
Close the resource as "filelike.close" does
() -> None
Dereference underlaying metadata If some of underlaying metadata is provided as a string it will replace it by the metadata object
Describe the given source as a resource
(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, stats: bool = False, **options: Any) -> Metadata
Infer metadata
(*, stats: bool = False) -> None
List dataset resources
(*, name: Optional[str] = None) -> List[Resource]
Open the resource as "io.open" does
Read bytes into memory
(*, size: Optional[int] = None) -> bytes
Read data into memory
(*, size: Optional[int] = None) -> Any
Read text into memory
(*, size: Optional[int] = None) -> str
Create a copy from the resource
(**options: Any) -> Self
Validate resource
(checklist: Optional[Checklist] = None, *, name: Optional[str] = None, on_row: Optional[types.ICallbackFunction] = None, parallel: bool = False, limit_rows: Optional[int] = None, limit_errors: int = 1000) -> Report