Edit page in Livemark
(2024-11-22 08:02)

Extracting Data

This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.

Extracting data means reading tabular data from a source. We can use various customizations for this process such as providing a file format, table schema, limiting fields or rows amount, and much more. This guide will discuss the main extract functions (extract, extract_resource, extract_package) and will then go into more advanced details about the Resource Class, Package Class, Header Class, and Row Class. The output from the extract function is in 'utf-8' encoding scheme.

Let's see this with some real files:

Download country-3.csv to reproduce the examples (right-click and "Save link as").

cat country-3.csv
id,capital_id,name,population
1,1,Britain,67
2,3,France,67
3,2,Germany,83
4,5,Italy,60
5,4,Spain,47
with open('country-3.csv') as file:
    print(file.read())
id,capital_id,name,population
1,1,Britain,67
2,3,France,67
3,2,Germany,83
4,5,Italy,60
5,4,Spain,47

Download capital-3.csv to reproduce the examples (right-click and "Save link as").

cat capital-3.csv
id,name
1,London
2,Berlin
3,Paris
4,Madrid
5,Rome
with open('capital-3.csv') as file:
    print(file.read())
id,name
1,London
2,Berlin
3,Paris
4,Madrid
5,Rome

To start, we will extract data from a resource:

frictionless extract country-3.csv
─────────────────────────────────── Dataset ────────────────────────────────────
               dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ country-3 │ table │ country-3.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
                country-3
┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ id ┃ capital_id ┃ name    ┃ population ┃
┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1  │ 1          │ Britain │ 67         │
│ 2  │ 3          │ France  │ 67         │
│ 3  │ 2          │ Germany │ 83         │
│ 4  │ 5          │ Italy   │ 60         │
│ 5  │ 4          │ Spain   │ 47         │
└────┴────────────┴─────────┴────────────┘
from pprint import pprint
from frictionless import extract

rows = extract('country-3.csv')
pprint(rows)
{'country-3': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
               {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
               {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
               {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
               {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}

Extract Functions

The high-level interface for extracting data provided by Frictionless is a set of extract functions:

As described in more detail in the Introduction, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema.

The command/function would be used as follows:

frictionless extract your-table.csv
frictionless extract your-resource.json --type resource
frictionless extract your-package.json --type package
from frictionless import extract

rows = extract('capital-3.csv')
resource = extract('capital-3.csv', type="resource")
package = extract('capital-3.csv', type="package")

The extract functions always reads data in the form of rows, into memory. The lower-level interfaces will allow you to stream data, which you can read about in the Resource Class section below.

Extracting a Resource

A resource contains only one file. To extract a resource, we have three options. First, we can use the same approach as above, extracting from the data file itself:

frictionless extract capital-3.csv
─────────────────────────────────── Dataset ────────────────────────────────────
               dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ capital-3 │ table │ capital-3.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
   capital-3
┏━━━━┳━━━━━━━━┓
┃ id ┃ name   ┃
┡━━━━╇━━━━━━━━┩
│ 1  │ London │
│ 2  │ Berlin │
│ 3  │ Paris  │
│ 4  │ Madrid │
│ 5  │ Rome   │
└────┴────────┘
from pprint import pprint
from frictionless import extract

rows = extract('capital-3.csv')
pprint(rows)
{'capital-3': [{'id': 1, 'name': 'London'},
               {'id': 2, 'name': 'Berlin'},
               {'id': 3, 'name': 'Paris'},
               {'id': 4, 'name': 'Madrid'},
               {'id': 5, 'name': 'Rome'}]}

Our second option is to extract the resource from a descriptor file by using the extract_resource function. A descriptor file is useful because it can contain different metadata and be stored on the disc.

As an example of how to use extract_resource, let's first create a descriptor file (note: this example uses YAML for the descriptor, but Frictionless also supports JSON):

from frictionless import Resource

resource = Resource('capital-3.csv')
resource.infer()
# as an example, in the next line we will append the schema
resource.schema.missing_values.append('3') # will interpret 3 as a missing value
resource.to_yaml('capital.resource-test.yaml') # use resource.to_json for JSON format

You can also use a pre-made descriptor file.

Now, this descriptor file can be used to extract the resource:

frictionless extract capital.resource-test.yaml
─────────────────────────────────── Dataset ────────────────────────────────────
               dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ capital-3 │ table │ capital-3.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
    capital-3
┏━━━━━━┳━━━━━━━━┓
┃ id   ┃ name   ┃
┡━━━━━━╇━━━━━━━━┩
│ 1    │ London │
│ 2    │ Berlin │
│ None │ Paris  │
│ 4    │ Madrid │
│ 5    │ Rome   │
└──────┴────────┘
from pprint import pprint
from frictionless import extract

rows = extract('capital.resource.yaml')
pprint(rows)
{'capital-invalid': [{'id': 1, 'name': 'London', 'name2': 'Britain'},
                     {'id': 2, 'name': 'Berlin', 'name2': 'Germany'},
                     {'id': 3, 'name': 'Paris', 'name2': 'France'},
                     {'id': 4, 'name': 'Madrid', 'name2': 'Spain'},
                     {'id': 5, 'name': 'Rome', 'name2': 'Italy'},
                     {'id': 6, 'name': 'Zagreb', 'name2': 'Croatia'},
                     {'id': 7, 'name': 'Athens', 'name2': 'Greece'},
                     {'id': 8, 'name': 'Vienna', 'name2': 'Austria'},
                     {'id': 8, 'name': 'Warsaw', 'name2': None},
                     {'id': None, 'name': None, 'name2': None},
                     {'id': None, 'name': 'Tokio', 'name2': 'Japan'}]}

So what has happened in this example? We set the textual representation of the number "3" to be a missing value. In the output we can see how the id number 3 now appears as None representing a missing value. This toy example demonstrates how the metadata in a descriptor can be used; other values like "NA" are more common for missing values.

You can read more advanced details about the Resource Class below.

Extracting a Package

The third way we can extract information is from a package, which is a set of two or more files, for instance, two data files and a corresponding metadata file.

As a primary example, we provide two data files to the extract command which will be enough to detect that it's a dataset. Let's start by using the command-line interface:

frictionless extract *-3.csv
─────────────────────────────────── Dataset ────────────────────────────────────
               dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ capital-3 │ table │ capital-3.csv │
│ country-3 │ table │ country-3.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
   capital-3
┏━━━━┳━━━━━━━━┓
┃ id ┃ name   ┃
┡━━━━╇━━━━━━━━┩
│ 1  │ London │
│ 2  │ Berlin │
│ 3  │ Paris  │
│ 4  │ Madrid │
│ 5  │ Rome   │
└────┴────────┘
                country-3
┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ id ┃ capital_id ┃ name    ┃ population ┃
┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1  │ 1          │ Britain │ 67         │
│ 2  │ 3          │ France  │ 67         │
│ 3  │ 2          │ Germany │ 83         │
│ 4  │ 5          │ Italy   │ 60         │
│ 5  │ 4          │ Spain   │ 47         │
└────┴────────────┴─────────┴────────────┘
from pprint import pprint
from frictionless import extract

data = extract('*-3.csv')
pprint(data)
{'capital-3': [{'id': 1, 'name': 'London'},
               {'id': 2, 'name': 'Berlin'},
               {'id': 3, 'name': 'Paris'},
               {'id': 4, 'name': 'Madrid'},
               {'id': 5, 'name': 'Rome'}],
 'country-3': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
               {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
               {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
               {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
               {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}

We can also extract the package from a descriptor file using the package.extract function (Note: see the Package Class section for the creation of the country.package.yaml file):

frictionless extract country.package.yaml
─────────────────────────────────── Dataset ────────────────────────────────────
              dataset
┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name    ┃ type  ┃ path          ┃
┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ capital │ table │ capital-3.csv │
│ country │ table │ country-3.csv │
└─────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
    capital
┏━━━━┳━━━━━━━━┓
┃ id ┃ name   ┃
┡━━━━╇━━━━━━━━┩
│ 1  │ London │
│ 2  │ Berlin │
│ 3  │ Paris  │
│ 4  │ Madrid │
│ 5  │ Rome   │
└────┴────────┘
                 country
┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ id ┃ capital_id ┃ name    ┃ population ┃
┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1  │ 1          │ Britain │ 67         │
│ 2  │ 3          │ France  │ 67         │
│ 3  │ 2          │ Germany │ 83         │
│ 4  │ 5          │ Italy   │ 60         │
│ 5  │ 4          │ Spain   │ 47         │
└────┴────────────┴─────────┴────────────┘
from frictionless import Package

package = Package('country.package.yaml')
pprint(package.extract())
{'capital': [{'id': 1, 'name': 'London'},
             {'id': 2, 'name': 'Berlin'},
             {'id': 3, 'name': 'Paris'},
             {'id': 4, 'name': 'Madrid'},
             {'id': 5, 'name': 'Rome'}],
 'country': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
             {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
             {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
             {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
             {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}

You can read more advanced details about the Package Class below.

The following sections contain further, advanced details about the Resource Class, Package Class, Header Class, and Row Class.

Resource Class

The Resource class provides metadata about a resource with read and stream functions. The extract functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data which can be rows, data, text, or bytes. Let's try reading all of them.

Reading Bytes

It's a byte representation of the contents:

from pprint import pprint
from frictionless import Resource

resource = Resource('country-3.csv')
pprint(resource.read_bytes())
(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
 b'3\n4,5,Italy,60\n5,4,Spain,47\n')

Reading Text

It's a textual representation of the contents:

from frictionless import Resource

resource = Resource('country-3.csv')
pprint(resource.read_text())
('id,capital_id,name,population\n'
 '1,1,Britain,67\n'
 '2,3,France,67\n'
 '3,2,Germany,83\n'
 '4,5,Italy,60\n'
 '5,4,Spain,47\n')

Reading Cells

For a tabular data there are raw representaion of the tabular contents:

from frictionless import Resource

resource = Resource('country-3.csv')
pprint(resource.read_cells())
[['id', 'capital_id', 'name', 'population'],
 ['1', '1', 'Britain', '67'],
 ['2', '3', 'France', '67'],
 ['3', '2', 'Germany', '83'],
 ['4', '5', 'Italy', '60'],
 ['5', '4', 'Spain', '47']]

Reading Rows

For a tabular data there are row available which is are normalized lists presented as dictionaries:

from frictionless import Resource

resource = Resource('country-3.csv')
pprint(resource.read_rows())
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
 {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
 {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
 {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
 {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]

Reading a Header

For a tabular data there is the Header object available:

from frictionless import Resource

with Resource('country-3.csv') as resource:
    pprint(resource.header)
['id', 'capital_id', 'name', 'population']

Streaming Interfaces

It's really handy to read all your data into memory but it's not always possible if a file is very big. For such cases, Frictionless provides streaming functions:

from frictionless import Resource

with Resource('country-3.csv') as resource:
    resource.byte_stream
    resource.text_stream
    resource.list_stream
    resource.row_stream

Package Class

The Package class provides functions to read the contents of a package. First of all, let's create a package descriptor:

frictionless describe *-3.csv --json > country.package.json
from frictionless import describe

package = describe('*-3.csv')
package.to_json('country.package.json')

Note that --json is used here to output the descriptor in JSON format. Without this, the default output is in YAML format as we saw above.

We can create a package from data files (using their paths) and then read the package's resources:

from frictionless import Package

package = Package('*-3.csv')
pprint(package.get_resource('country-3').read_rows())
pprint(package.get_resource('capital-3').read_rows())
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
 {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
 {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
 {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
 {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
[{'id': 1, 'name': 'London'},
 {'id': 2, 'name': 'Berlin'},
 {'id': 3, 'name': 'Paris'},
 {'id': 4, 'name': 'Madrid'},
 {'id': 5, 'name': 'Rome'}]

The package by itself doesn't provide any read functions directly because it's just a contrainer. You can select a pacakge's resource and use the Resource API from above for data reading.