This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.
Extracting data means reading tabular data from a source. We can use various customizations for this process such as providing a file format, table schema, limiting fields or rows amount, and much more. This guide will discuss the main extract
functions (extract
, extract_resource
, extract_package
) and will then go into more advanced details about the Resource Class
, Package Class
, Header Class
, and Row Class
. The output from the extract function is in 'utf-8' encoding scheme.
Let's see this with some real files:
Download
country-3.csv
to reproduce the examples (right-click and "Save link as").
cat country-3.csv
id,capital_id,name,population
1,1,Britain,67
2,3,France,67
3,2,Germany,83
4,5,Italy,60
5,4,Spain,47
with open('country-3.csv') as file:
print(file.read())
id,capital_id,name,population
1,1,Britain,67
2,3,France,67
3,2,Germany,83
4,5,Italy,60
5,4,Spain,47
Download
capital-3.csv
to reproduce the examples (right-click and "Save link as").
cat capital-3.csv
id,name
1,London
2,Berlin
3,Paris
4,Madrid
5,Rome
with open('capital-3.csv') as file:
print(file.read())
id,name
1,London
2,Berlin
3,Paris
4,Madrid
5,Rome
To start, we will extract data from a resource:
frictionless extract country-3.csv
─────────────────────────────────── Dataset ────────────────────────────────────
dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name ┃ type ┃ path ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ country-3 │ table │ country-3.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
country-3
┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ id ┃ capital_id ┃ name ┃ population ┃
┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1 │ 1 │ Britain │ 67 │
│ 2 │ 3 │ France │ 67 │
│ 3 │ 2 │ Germany │ 83 │
│ 4 │ 5 │ Italy │ 60 │
│ 5 │ 4 │ Spain │ 47 │
└────┴────────────┴─────────┴────────────┘
from pprint import pprint
from frictionless import extract
rows = extract('country-3.csv')
pprint(rows)
{'country-3': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
{'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
{'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
{'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
{'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
The high-level interface for extracting data provided by Frictionless is a set of extract
functions:
extract
: detects the source file type and extracts data accordinglyresource.extract
: returns a data tablepackage.extract
: returns a map of the package's tablesAs described in more detail in the Introduction, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema.
The command/function would be used as follows:
frictionless extract your-table.csv
frictionless extract your-resource.json --type resource
frictionless extract your-package.json --type package
from frictionless import extract
rows = extract('capital-3.csv')
resource = extract('capital-3.csv', type="resource")
package = extract('capital-3.csv', type="package")
The extract
functions always reads data in the form of rows, into memory. The lower-level interfaces will allow you to stream data, which you can read about in the Resource Class section below.
A resource contains only one file. To extract a resource, we have three options. First, we can use the same approach as above, extracting from the data file itself:
frictionless extract capital-3.csv
─────────────────────────────────── Dataset ────────────────────────────────────
dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name ┃ type ┃ path ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ capital-3 │ table │ capital-3.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
capital-3
┏━━━━┳━━━━━━━━┓
┃ id ┃ name ┃
┡━━━━╇━━━━━━━━┩
│ 1 │ London │
│ 2 │ Berlin │
│ 3 │ Paris │
│ 4 │ Madrid │
│ 5 │ Rome │
└────┴────────┘
from pprint import pprint
from frictionless import extract
rows = extract('capital-3.csv')
pprint(rows)
{'capital-3': [{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]}
Our second option is to extract the resource from a descriptor file by using the extract_resource
function. A descriptor file is useful because it can contain different metadata and be stored on the disc.
As an example of how to use extract_resource
, let's first create a descriptor file (note: this example uses YAML for the descriptor, but Frictionless also supports JSON):
from frictionless import Resource
resource = Resource('capital-3.csv')
resource.infer()
# as an example, in the next line we will append the schema
resource.schema.missing_values.append('3') # will interpret 3 as a missing value
resource.to_yaml('capital.resource-test.yaml') # use resource.to_json for JSON format
You can also use a pre-made descriptor file.
Now, this descriptor file can be used to extract the resource:
frictionless extract capital.resource-test.yaml
─────────────────────────────────── Dataset ────────────────────────────────────
dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name ┃ type ┃ path ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ capital-3 │ table │ capital-3.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
capital-3
┏━━━━━━┳━━━━━━━━┓
┃ id ┃ name ┃
┡━━━━━━╇━━━━━━━━┩
│ 1 │ London │
│ 2 │ Berlin │
│ None │ Paris │
│ 4 │ Madrid │
│ 5 │ Rome │
└──────┴────────┘
from pprint import pprint
from frictionless import extract
rows = extract('capital.resource.yaml')
pprint(rows)
{'capital-invalid': [{'id': 1, 'name': 'London', 'name2': 'Britain'},
{'id': 2, 'name': 'Berlin', 'name2': 'Germany'},
{'id': 3, 'name': 'Paris', 'name2': 'France'},
{'id': 4, 'name': 'Madrid', 'name2': 'Spain'},
{'id': 5, 'name': 'Rome', 'name2': 'Italy'},
{'id': 6, 'name': 'Zagreb', 'name2': 'Croatia'},
{'id': 7, 'name': 'Athens', 'name2': 'Greece'},
{'id': 8, 'name': 'Vienna', 'name2': 'Austria'},
{'id': 8, 'name': 'Warsaw', 'name2': None},
{'id': None, 'name': None, 'name2': None},
{'id': None, 'name': 'Tokio', 'name2': 'Japan'}]}
So what has happened in this example? We set the textual representation of the number "3" to be a missing value. In the output we can see how the id
number 3 now appears as None
representing a missing value. This toy example demonstrates how the metadata in a descriptor can be used; other values like "NA" are more common for missing values.
You can read more advanced details about the Resource Class below.
The third way we can extract information is from a package, which is a set of two or more files, for instance, two data files and a corresponding metadata file.
As a primary example, we provide two data files to the extract
command which will be enough to detect that it's a dataset. Let's start by using the command-line interface:
frictionless extract *-3.csv
─────────────────────────────────── Dataset ────────────────────────────────────
dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name ┃ type ┃ path ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ capital-3 │ table │ capital-3.csv │
│ country-3 │ table │ country-3.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
capital-3
┏━━━━┳━━━━━━━━┓
┃ id ┃ name ┃
┡━━━━╇━━━━━━━━┩
│ 1 │ London │
│ 2 │ Berlin │
│ 3 │ Paris │
│ 4 │ Madrid │
│ 5 │ Rome │
└────┴────────┘
country-3
┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ id ┃ capital_id ┃ name ┃ population ┃
┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1 │ 1 │ Britain │ 67 │
│ 2 │ 3 │ France │ 67 │
│ 3 │ 2 │ Germany │ 83 │
│ 4 │ 5 │ Italy │ 60 │
│ 5 │ 4 │ Spain │ 47 │
└────┴────────────┴─────────┴────────────┘
from pprint import pprint
from frictionless import extract
data = extract('*-3.csv')
pprint(data)
{'capital-3': [{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}],
'country-3': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
{'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
{'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
{'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
{'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
We can also extract the package from a descriptor file using the package.extract
function (Note: see the Package Class section for the creation of the country.package.yaml
file):
frictionless extract country.package.yaml
─────────────────────────────────── Dataset ────────────────────────────────────
dataset
┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name ┃ type ┃ path ┃
┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ capital │ table │ capital-3.csv │
│ country │ table │ country-3.csv │
└─────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
capital
┏━━━━┳━━━━━━━━┓
┃ id ┃ name ┃
┡━━━━╇━━━━━━━━┩
│ 1 │ London │
│ 2 │ Berlin │
│ 3 │ Paris │
│ 4 │ Madrid │
│ 5 │ Rome │
└────┴────────┘
country
┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ id ┃ capital_id ┃ name ┃ population ┃
┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1 │ 1 │ Britain │ 67 │
│ 2 │ 3 │ France │ 67 │
│ 3 │ 2 │ Germany │ 83 │
│ 4 │ 5 │ Italy │ 60 │
│ 5 │ 4 │ Spain │ 47 │
└────┴────────────┴─────────┴────────────┘
from frictionless import Package
package = Package('country.package.yaml')
pprint(package.extract())
{'capital': [{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}],
'country': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
{'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
{'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
{'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
{'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
You can read more advanced details about the Package Class below.
The following sections contain further, advanced details about the
Resource Class
,Package Class
,Header Class
, andRow Class
.
The Resource class provides metadata about a resource with read and stream functions. The extract
functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data which can be rows
, data
, text
, or bytes
. Let's try reading all of them.
It's a byte representation of the contents:
from pprint import pprint
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_bytes())
(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
b'3\n4,5,Italy,60\n5,4,Spain,47\n')
It's a textual representation of the contents:
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_text())
('id,capital_id,name,population\n'
'1,1,Britain,67\n'
'2,3,France,67\n'
'3,2,Germany,83\n'
'4,5,Italy,60\n'
'5,4,Spain,47\n')
For a tabular data there are raw representaion of the tabular contents:
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_cells())
[['id', 'capital_id', 'name', 'population'],
['1', '1', 'Britain', '67'],
['2', '3', 'France', '67'],
['3', '2', 'Germany', '83'],
['4', '5', 'Italy', '60'],
['5', '4', 'Spain', '47']]
For a tabular data there are row available which is are normalized lists presented as dictionaries:
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_rows())
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
For a tabular data there is the Header object available:
from frictionless import Resource
with Resource('country-3.csv') as resource:
pprint(resource.header)
['id', 'capital_id', 'name', 'population']
It's really handy to read all your data into memory but it's not always possible if a file is very big. For such cases, Frictionless provides streaming functions:
from frictionless import Resource
with Resource('country-3.csv') as resource:
resource.byte_stream
resource.text_stream
resource.list_stream
resource.row_stream
The Package class provides functions to read the contents of a package. First of all, let's create a package descriptor:
frictionless describe *-3.csv --json > country.package.json
from frictionless import describe
package = describe('*-3.csv')
package.to_json('country.package.json')
Note that --json is used here to output the descriptor in JSON format. Without this, the default output is in YAML format as we saw above.
We can create a package from data files (using their paths) and then read the package's resources:
from frictionless import Package
package = Package('*-3.csv')
pprint(package.get_resource('country-3').read_rows())
pprint(package.get_resource('capital-3').read_rows())
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
[{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]
The package by itself doesn't provide any read functions directly because it's just a contrainer. You can select a pacakge's resource and use the Resource API from above for data reading.