Edit page in Livemark
(2024-11-07 15:17)

Basic Examples

Let's start with an example dataset. We will look at a few raw data files that have recently been collected by an anthropologist. The anthropologist wants to publish this data in an open repository so her colleagues can also use this data. Before publishing the data, she wants to add metadata and check the data for errors. We are here to help, so let’s start by exploring the data. We see that the quality of data is far from perfect. In fact, the first row contains comments from the anthropologist! To be able to use this data, we need to clean it up a bit.

Download countries.csv to reproduce the examples (right-click and "Save link as").

cat countries.csv
# clean this data!
id,neighbor_id,name,population
1,Ireland,Britain,67
2,3,France,n/a,find the population
3,22,Germany,83
4,,Italy,60
5
with open('countries.csv') as file:
    print(file.read())
# clean this data!
id,neighbor_id,name,population
1,Ireland,Britain,67
2,3,France,n/a,find the population
3,22,Germany,83
4,,Italy,60
5

As we can see, this is data containing information about European countries and their populations. Also, it looks like there are two fields having a relationship based on a country's identifier: neighbor_id is a Foreign Key to id.

Describing Data

First of all, we're going to describe our dataset. Frictionless uses the powerful Frictionless Data Specifications. They are very handy to describe:

Let's describe the countries table:

frictionless describe countries.csv # optionally add --stats to get statistics
─────────────────────────────────── Dataset ────────────────────────────────────
               dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ countries │ table │ countries.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
                   countries
┏━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ id      ┃ neighbor_id ┃ name   ┃ population ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ integer │ string      │ string │ string     │
└─────────┴─────────────┴────────┴────────────┘
from pprint import pprint
from frictionless import describe

resource = describe('countries.csv')
pprint(resource)
{'name': 'countries',
 'type': 'table',
 'path': 'countries.csv',
 'scheme': 'file',
 'format': 'csv',
 'mediatype': 'text/csv',
 'encoding': 'utf-8',
 'dialect': {'headerRows': [2]},
 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
                       {'name': 'neighbor_id', 'type': 'string'},
                       {'name': 'name', 'type': 'string'},
                       {'name': 'population', 'type': 'string'}]}}

As we can see, Frictionless was smart enough to understand that the first row contains a comment. It's good, but we still have a few problems:

Let's update our metadata and save it to the disc:

Open this file in your favorite editor and update as it's shown below

frictionless describe countries.csv --yaml > countries.resource.yaml
editor countries.resource.yaml
from frictionless import Detector, describe

detector = Detector(field_missing_values=["", "n/a"])
resource = describe("countries.csv", detector=detector)
resource.schema.set_field_type("neighbor_id", "integer")
resource.schema.foreign_keys.append(
    {"fields": ["neighbor_id"], "reference": {"resource": "", "fields": ["id"]}}
)
resource.to_yaml("countries.resource.yaml")

Let's see what we have created:

cat countries.resource.yaml
name: countries
type: table
path: countries.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
dialect:
  headerRows:
    - 2
schema:
  fields:
    - name: id
      type: integer
    - name: neighbor_id
      type: integer
    - name: name
      type: string
    - name: population
      type: integer
  missingValues:
    - ''
    - n/a
  foreignKeys:
    - fields:
        - neighbor_id
      reference:
        resource: ''
        fields:
          - id
with open('countries.resource.yaml') as file:
    print(file.read())
name: countries
type: table
path: countries.csv
scheme: file
format: csv
mediatype: text/csv
encoding: utf-8
dialect:
  headerRows:
    - 2
schema:
  fields:
    - name: id
      type: integer
    - name: neighbor_id
      type: integer
    - name: name
      type: string
    - name: population
      type: integer
  missingValues:
    - ''
    - n/a
  foreignKeys:
    - fields:
        - neighbor_id
      reference:
        resource: ''
        fields:
          - id

It has the same metadata as we saw above but also includes our editing related to missing values and data types. We didn't change all the wrong data types manually because providing proper missing values had fixed it automatically. Now we have a resource descriptor. In the next section, we will show why metadata matters and how to use it.

Extracting Data

It's time to try extracting our data as a table. As a first naive attempt, we will ignore the metadata we saved on the previous step:

frictionless extract countries.csv
─────────────────────────────────── Dataset ────────────────────────────────────
               dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ countries │ table │ countries.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
                 countries
┏━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ id ┃ neighbor_id ┃ name    ┃ population ┃
┡━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1  │ Ireland     │ Britain │ 67         │
│ 2  │ 3           │ France  │ n/a        │
│ 3  │ 22          │ Germany │ 83         │
│ 4  │ None        │ Italy   │ 60         │
│ 5  │ None        │ None    │ None       │
└────┴─────────────┴─────────┴────────────┘
from pprint import pprint
from frictionless import extract

rows = extract('countries.csv')
pprint(rows)
{'countries': [{'id': 1,
                'name': 'Britain',
                'neighbor_id': 'Ireland',
                'population': '67'},
               {'id': 2,
                'name': 'France',
                'neighbor_id': '3',
                'population': 'n/a'},
               {'id': 3,
                'name': 'Germany',
                'neighbor_id': '22',
                'population': '83'},
               {'id': 4,
                'name': 'Italy',
                'neighbor_id': None,
                'population': '60'},
               {'id': 5,
                'name': None,
                'neighbor_id': None,
                'population': None}]}

Actually, it doesn't look terrible, but in reality, data like this is not quite useful:

The output of the extract is in 'utf-8' encoding scheme. Let's use the metadata we save to try extracting data with the help of Frictionless Data specifications:

frictionless extract countries.resource.yaml
─────────────────────────────────── Dataset ────────────────────────────────────
               dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ countries │ table │ countries.csv │
└───────────┴───────┴───────────────┘
──────────────────────────────────── Tables ────────────────────────────────────
                 countries
┏━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ id ┃ neighbor_id ┃ name    ┃ population ┃
┡━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1  │ None        │ Britain │ 67         │
│ 2  │ 3           │ France  │ None       │
│ 3  │ 22          │ Germany │ 83         │
│ 4  │ None        │ Italy   │ 60         │
│ 5  │ None        │ None    │ None       │
└────┴─────────────┴─────────┴────────────┘
from pprint import pprint
from frictionless import extract

rows = extract('countries.resource.yaml')
pprint(rows)
{'countries': [{'id': 1,
                'name': 'Britain',
                'neighbor_id': None,
                'population': 67},
               {'id': 2,
                'name': 'France',
                'neighbor_id': 3,
                'population': None},
               {'id': 3,
                'name': 'Germany',
                'neighbor_id': 22,
                'population': 83},
               {'id': 4,
                'name': 'Italy',
                'neighbor_id': None,
                'population': 60},
               {'id': 5,
                'name': None,
                'neighbor_id': None,
                'population': None}]}

It's now much better! Numerical fields are numerical fields, and there are no more textual missing values markers. We can't see in the command-line, but missing values are now None values in Python, and the data can be e.g., exported to SQL. Although, it's still not ready for being published. In the next section, we will validate it!

Validating Data

Data validation with Frictionless is as easy as describing or extracting data:

frictionless validate countries.csv
─────────────────────────────────── Dataset ────────────────────────────────────
                    dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃ status  ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ countries │ table │ countries.csv │ INVALID │
└───────────┴───────┴───────────────┴─────────┘
──────────────────────────────────── Tables ────────────────────────────────────
                                   countries
┏━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Row ┃ Field ┃ Type         ┃ Message                                         ┃
┡━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 4   │ 5     │ extra-cell   │ Row at position "4" has an extra value in field │
│     │       │              │ at position "5"                                 │
│ 7   │ 2     │ missing-cell │ Row at position "7" has a missing cell in field │
│     │       │              │ "neighbor_id" at position "2"                   │
│ 7   │ 3     │ missing-cell │ Row at position "7" has a missing cell in field │
│     │       │              │ "name" at position "3"                          │
│ 7   │ 4     │ missing-cell │ Row at position "7" has a missing cell in field │
│     │       │              │ "population" at position "4"                    │
└─────┴───────┴──────────────┴─────────────────────────────────────────────────┘
from pprint import pprint
from frictionless import validate

report = validate('countries.csv')
pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
[[4, 5, 'extra-cell'],
 [7, 2, 'missing-cell'],
 [7, 3, 'missing-cell'],
 [7, 4, 'missing-cell']]

Ahh, we had seen that coming. The data is not valid; there are some missing and extra cells. But wait a minute, in the first step, we created the metadata file with more information about our table. We have to use it.

frictionless validate countries.resource.yaml
─────────────────────────────────── Dataset ────────────────────────────────────
                    dataset
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ name      ┃ type  ┃ path          ┃ status  ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ countries │ table │ countries.csv │ INVALID │
└───────────┴───────┴───────────────┴─────────┘
──────────────────────────────────── Tables ────────────────────────────────────
                                   countries
┏━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Row ┃ Field ┃ Type         ┃ Message                                         ┃
┡━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 3   │ 2     │ type-error   │ Type error in the cell "Ireland" in row "3" and │
│     │       │              │ field "neighbor_id" at position "2": type is    │
│     │       │              │ "integer/default"                               │
│ 4   │ 5     │ extra-cell   │ Row at position "4" has an extra value in field │
│     │       │              │ at position "5"                                 │
│ 5   │ None  │ foreign-key  │ Row at position "5" violates the foreign key:   │
│     │       │              │ for "neighbor_id": values "22" not found in the │
│     │       │              │ lookup table "" as "id"                         │
│ 7   │ 2     │ missing-cell │ Row at position "7" has a missing cell in field │
│     │       │              │ "neighbor_id" at position "2"                   │
│ 7   │ 3     │ missing-cell │ Row at position "7" has a missing cell in field │
│     │       │              │ "name" at position "3"                          │
│ 7   │ 4     │ missing-cell │ Row at position "7" has a missing cell in field │
│     │       │              │ "population" at position "4"                    │
└─────┴───────┴──────────────┴─────────────────────────────────────────────────┘
from pprint import pprint
from frictionless import validate

report = validate('countries.resource.yaml')
pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
[[3, 2, 'type-error'],
 [4, 5, 'extra-cell'],
 [5, None, 'foreign-key'],
 [7, 2, 'missing-cell'],
 [7, 3, 'missing-cell'],
 [7, 4, 'missing-cell']]

Now it's even worse, but regarding data validation errors, the more, the better, actually. Thanks to the metadata, we were able to reveal some critical errors:

In the next section, we will clean up the data.

Transforming Data

We will use metadata to fix all the data type problems automatically. The only two things we need to handle manually:

cat > countries.pipeline.yaml <<EOF
steps:
  - type: cell-replace
    fieldName: neighbor_id
    pattern: '22'
    replace: '2'
  - type: cell-replace
    fieldName: population
    pattern: 'n/a'
    replace: '67'
  - type: row-filter
    formula: population
  - type: field-update
    name: neighbor_id
    descriptor:
      type: integer
  - type: field-update
    name: population
    descriptor:
      type: integer
  - type: table-normalize
  - type: table-write
    path: countries-cleaned.csv
EOF
frictionless transform countries.csv --pipeline countries.pipeline.yaml
## Schema

+-------------+---------+------------+
| name        | type    | required   |
+=============+=========+============+
| id          | integer |            |
+-------------+---------+------------+
| neighbor_id | integer |            |
+-------------+---------+------------+
| name        | string  |            |
+-------------+---------+------------+
| population  | integer |            |
+-------------+---------+------------+

## Table

+----+-------------+---------+------------+
| id | neighbor_id | name    | population |
+====+=============+=========+============+
|  1 | None        | Britain |         67 |
+----+-------------+---------+------------+
|  2 |           3 | France  |         67 |
+----+-------------+---------+------------+
|  3 |           2 | Germany |         83 |
+----+-------------+---------+------------+
|  4 | None        | Italy   |         60 |
+----+-------------+---------+------------+
from pprint import pprint
from frictionless import Resource, Pipeline, describe, transform, steps

pipeline = Pipeline(steps=[
    steps.cell_replace(field_name='neighbor_id', pattern='22', replace='2'),
    steps.cell_replace(field_name='population', pattern='n/a', replace='67'),
    steps.row_filter(formula='population'),
    steps.field_update(name='neighbor_id', descriptor={"type": "integer"}),
    steps.table_normalize(),
    steps.table_write(path="countries-cleaned.csv"),
])

source = Resource('countries.csv')
target = source.transform(pipeline)
pprint(target.read_rows())
[{'id': 1, 'neighbor_id': None, 'name': 'Britain', 'population': '67'},
 {'id': 2, 'neighbor_id': 3, 'name': 'France', 'population': '67'},
 {'id': 3, 'neighbor_id': 2, 'name': 'Germany', 'population': '83'},
 {'id': 4, 'neighbor_id': None, 'name': 'Italy', 'population': '60'}]

Finally, we've got the cleaned version of our data, which can be exported to a database or published. We have used a CSV as an output format but could have used Excel, JSON, SQL, and others.

cat countries-cleaned.csv
id,neighbor_id,name,population
1,,Britain,67
2,3,France,67
3,2,Germany,83
4,,Italy,60
with open('countries-cleaned.csv') as file:
    print(file.read())
id,neighbor_id,name,population
1,,Britain,67
2,3,France,67
3,2,Germany,83
4,,Italy,60

Basically, that's it; now, we have a valid data file and a corresponding metadata file. It can be shared with other people or stored without fear of type errors or other problems making research data not reproducible.

ls countries-cleaned.*
countries-cleaned.csv
import os

files = [f for f in os.listdir('.') if os.path.isfile(f) and f.startswith('countries-cleaned.')]
print(files)
['countries-cleaned.csv']

In the next articles, we will explore more advanced Frictionless functionality.