Command-line interface

The manage.py script at the root of the repository provides access to a suite of command-line tools to create and manage the summary tables of one or more collections.

To see all available commands, change to the directory containing your copy of the repository, and run:

./manage.py --help

To see the help message for a specific command, run, for example:

./manage.py add --help

add

Creates a schema containing summary tables about one or more collections.

This command will fail if the schema already exists.

Note

If you notice slow queries and are using solid-state drives, consider tuning PostgreSQL by decreasing random_page_cost:

ALTER TABLESPACE pg_default SET (random_page_cost = 1.0);

Summarize one collection

Replace ID with a Kingfisher Process collection ID, replace NOTE with your name and a description of your purpose, and run:

./manage.py add ID "NOTE"

For example:

./manage.py add 123 "Created by Morgan A. to measure procurement indicators"

This creates a schema named summary_collection_123.

Customize the schema’s name

To customize the last part of the schema’s name (for example, collection_123 in summary_collection_123), set the --name argument to a string of letters, numbers and/or underscores. For example:

./manage.py add 123 "The note" --name the_name

This creates a schema named summary_the_name.

Summarize many collections

Instead of passing one collection ID, you can pass many collection IDs, separated by commas. For example:

./manage.py add 4,5,6 "Created by Morgan A. to compare field coverage"

This creates a schema named summary_collection_4_5_6.

If you need to summarize more than five collections, then you must customize the schema’s name.

Create persistent tables for all summary tables

By default, some summary tables are database views and not persistent tables, in order to save disk space. Use the --tables-only option to make all summary tables into persistent tables (similar to materialized views).

./manage.py add 123 "The note" --name the_name --tables-only

Use this option if:

  • The summaries are large (tables are faster to query than views)

  • You want to allow a user to access the schema’s tables, but not Kingfisher Process’ tables

  • You want to make it easier for a user to discover the foreign key relationships between tables (for example, using \d <table> instead of \d+ <view> followed by \d <table>)

  • You are creating the Entity Relationship Diagram

Calculate JSON paths in each JSON object in each summary table

The --field-lists option adds a field_list column to each summary table, which contains all JSON paths (excluding array indices) in the object that the row describes. For example, a field_list value in the awards_summary table will contain the JSON paths in an award object. A field_list value is a JSONB object in which keys are paths and values are numbers of occurrences.

./manage.py add 123 "The note" --field-lists

This can be used to check for the presence of multiple fields. For example, to count the number of awards that have at least one document with an id and at least one item with an id, run:

SELECT count(*) FROM summary_collection_1.awards_summary WHERE field_list ?& ARRAY['documents/id', 'items/id'];

This could also be written as:

SELECT count(*) FROM summary_collection_1.awards_summary WHERE field_list ? 'documents/id' AND field_list ? 'items/id';

The ?& operator tests whether all keys in the right-hand array exist in the left-hand object. The ? operator tests whether one key exists in the left-hand object.

To count the number of awards that have either at least one document with an id or at least one item with an id, run:

SELECT count(*) FROM summary_collection_1.awards_summary WHERE field_list ?| ARRAY['documents/id', 'items/id'];

This could also be written as:

SELECT count(*) FROM summary_collection_1.awards_summary WHERE field_list ? 'documents/id' OR field_list ? 'items/id';

The ?| operator tests whether any key in the right-hand array exists in the left-hand object.

remove

Drops a schema.

Replace NAME with the last part of a schema’s name (the part after summary_), and run:

./manage.py remove NAME

This is equivalent to:

DROP SCHEMA summary_NAME CASCADE;

index

Lists the schemas, with collection IDs and creator’s notes.

./manage.py index

Outputs:

| Name             |   Collections | Note                                                                         |
|------------------|---------------|------------------------------------------------------------------------------|
| collection_4_5_6 | 4, 5, 6       | Created by Morgan A. to compare field coverage (2020-07-31 14:53:38)         |
| collection_123   | 1             | Created by Morgan A. to measure procurement indicators (2020-01-02 03:04:05) |

To list the schemas only, Connect to the database used by Kingfisher Summarize, using the connecting settings you configured earlier, and run:

\dn

Upgrade Kingfisher Summarize

If the new version of Kingfisher Summarize makes changes to SQL statements, you might want to re-create the collection-specific schemas, by running remove then add for the selected collections:

  1. Get the schema’s collections, for example:

    SELECT collection_id from summaries.selected_collections WHERE schema = 'summary_collection_4_5_6';
    
  2. Remove the schema, for example:

    ./manage.py remove collection_4_5_6
    
  3. Re-add the schema, for example:

    ./manage.py add 4,5,6 "NOTE"