Package: backend.datasets

Submodules:

dataset

This module contains the classes used to standardise, format and combine the data sources used to create the map.

The classes defined in this module are:

DataResolution DataFrequency

The dataclasses defined in this module are:

Dataset MasterDataset

class backend.datasets.dataset.DataFrequency

Bases: enum.Enum

Defines a time resolution.

Example

DataFrequency.LIVE

Parameters

Enum (str) – Define the DataFrequency as “live” or “static”

LIVE = 'live'
STATIC = 'static'
class backend.datasets.dataset.DataResolution

Bases: enum.Enum

Defines a geographic resolution.

Example

DataResolution.LA

Parameters

Enum (str) – Define the DataResolution as “LA”, “LSOA”, “MSOA”.

LA = ('LA', 'Local Authority')
LSOA = ('LSOA', 'Lower Super Output Area')
MSOA = ('MSOA', 'Middle Super Output Area')
class backend.datasets.dataset.Dataset(data: pandas.core.frame.DataFrame, res: backend.datasets.dataset.DataResolution, key_col: str, key_is_code: bool, csv_name: str, keep_cols: list = None, bracketed_data_cols: list = None, rename: dict = None)

Bases: object

Class to handle transformations to source datasets, returning them in a standardised format.

data

The source dataset to standardise. Expected to have a row for each goegraphic area with at least one column defining the area name or code.

Type

pd.DataFrame

res

The DataResolution type of the data. See the class for options.

Type

DataResolution

key_col

The column name of the col containing the unique area key.

Type

str

key_is_code

Whether or not the key_col is an area code or area name.

Type

bool

csv_name

Name of the csv to write out to, if called. The resolution is automatically prepended the resolution to the name if written.

Type

str

keep_cols

List of columns to keep in the standardised dataset. If columns have been renamed, use the names given after renaming. Must also include the renamed key_col with its formal name (‘LSOA11CD’, ‘LSOA11NM’, ‘lad19cd’, ‘lad19nm’).

Type

list, optional

bracketed_data_cols

List of columns that have data in the format NUMBER (PERCENTAGE).

Type

list, optional

rename

Dictionary in format {‘old_name’ : ‘new_name’ } for columns to be renamed.

Type

dict, optional

std_data\_

Standardised data, which will have a name and code column, the columns chosen to keep whose contents and column names may be updated based on the args provided.

Type

pd.DataFrame

bracketed_data_cols: list = None
clean_bracketed_data()

For a df with columns in the format ‘NUMBER (PERCENTAGE)’ this function extracts the data into two new columns and deletes the original column.

Notes

The two new columns will be named the same as the original column, but with _count or _pct appended.

Returns

DataFrame with each col replaced by two new columns with the count and percentage.

Return type

pd.DataFrame

static clean_keys(df, res, key_col, key_is_code=True)

Ensures df key column (i.e column used for joining) is correctly formatted for joins in the next steps. Accepts key as a code or name, at LA or LSOA level. Will rename key column if it is not the standard name.

Notes

Renaming of the key columns depends on the DataResolution. For LSOA they will be LSOA11CD and LSOA11NM for code and name respectively. For LA they will be lad19cd and ladnm19 respectively.

Parameters
  • df (pd.DataFrame) – Dataframe to be cleaned

  • res (DataResolution) – Accepts ‘LA’ or ‘LSOA’ as resolution of the data

  • key_col ([type]) – Name of column containing the code or name

  • key_is_code (bool, optional) – If True the key_col is a code. If false, key_col is a name. By default True

Returns

Returns dataframe with key_col stripped of whitespace, filted to only Welsh areas and renamed for consistency.

Return type

pd.DataFrame

Raises

Exception – When the number of rows after merging on the new key column names cannot match the DataResolution.

csv_name: str = None
csv_path()

Generates a name to output to csv.

Returns

csv_name preppended with the data resolution and appended with .csv

Return type

str

data: pd.DataFrame = None
property is_standardised

Returns bool of whether the standardised data has been generated.

keep_cols: list = None
key_col: str = None
key_is_code: bool = None
read_keys()

Reads and returns the LSOA and LA geopandas dataframes as constants ‘LSOA’, ‘LA’.

rename: dict = None
res: DataResolution = None
standardise()

Based on attributes, applies the correct functions to standardise the datasets.

The function will set self.std_data_

Returns

This function will return the class instance.

Return type

Dataset

standardise_keys()

Given dataframe and chosen cols, will use LA or LSOA geopandas dataframes to create standardised columns for area codes and names

Returns

self.data with standardised key codes and names. If keep_cols is used then only those columns will be returned.

Return type

pd.DataFrame

Raises
  • Exception – When the number of rows generated does not match the DataResolution

  • ValueError – When the DataResolution is not LA or LSOA

property standardised_data

Returns the std_data_ object pd.DataFrame.

std_data_: pandas.core.frame.DataFrame = None
write()

Writes the standardised data to csv in the cleaned folder, using naming convention of resolution_name.csv. If the same file already exists it will not write.

Raises

ValueError – If the data has not already been standardised.

class backend.datasets.dataset.MasterDataset(datasets: List[backend.datasets.dataset.Dataset], res: backend.datasets.dataset.DataResolution, freq: backend.datasets.dataset.DataFrequency, from_csv: bool = True)

Bases: object

Used to call or generate the merged ‘master’ dataset used to write to json. Can be used to generate the ‘live’ or ‘static’ master datasets. Will write out to csv if it does not already exist, or if user chooses ‘from_csv’ to be False.

Notes

When the master dataset is created, certain data transformation are assumed, which create variables from expected columns in LSOA/LA STATIC or LIVE datasets. These are detailed in the _create_master_dataset method.

datasets

A list of Dataset instances to be merged into the master dataset.

Type

List[Dataset]

res

The DataResolution of the data. Accepts LA or LSOA.

Type

DataResoltion

freq

The DataFrequency of the data. Accepts STATIC or LIVE.

Type

DataFrequency

from_csv

Whether the master dataset should be built from previously generated csv.

Type

bool

master_dataset\_

The final merged dataset.

Type

pd.DataFrame

datasets: List[Dataset] = None
property file_path

Returns str filepath to write csv to, based on freq and res

freq: DataFrequency = None
from_csv: bool = True
property master_dataset

Returns the master dataset, and generates it if it does not exist.

master_dataset_: pandas.core.frame.DataFrame = None
res: DataResolution = None
static write(data, filepath)

Writes data to csv on the given filepath.

live

Handles live data from reading to master dataset generation.

This module reads, standardises and defines the master dataset for data sources from the data/live folder.

Notes

This module imports classes from the dataset module. The LA_LIVE master dataset definition has from_csv=False. This means that the live dataset master will always be regenerated from the source files given here, rather than read from the existing master csv.

To add a new live datasource, follow the existing examples for a SOURCE_ constant that is read from file, passed to a Dataset definition, and then the Dataset definition defined in the LA_LIVE datasets list to ensure it is included.

static

Module that generates the cleaned datasets for each variable from raw.

This module standardises and defines the master dataset for data sources from the data/static folder.

Notes

This module imports classes from the dataset module, and source dataset constants from the static_source_datasets module.

To add a new datasource, follow the existing examples for a SOURCE_ constant that is passed to a Dataset definition, and then add the Dataset constant to the MasterDataset datasets list to ensure it is included.

The LSOA_STATIC and LA_STATIC master datasets have from_csv=True. This means that the master datasets will always be read from a previously generated master csv, rather than regenerated from source files. If new sources are added, this will need to be run at least once with from_csv=False to integrate new sources into a new master csv.

static_source_datasets

Handles reading of datasets from static source data folder.

generate_gp_online

Generates gp_online.csv

Functions contained in this module take data of patients registered with My Health Online from local data folders. The gp_to_area function matches each GP practice to a postcode, and then matches that postcode to a Local Authority, Lower Super Output Area and Middle Super Output Area.

The mhol_to_percentage function then calculates the overall percentage of patients registered online at the area level chosen.

To run, these functions require local data that is not published in the repository.

backend.datasets.generate_gp_online.gp_to_area(gp_data, postcode_lookup, gp_lookup)

Using data and lookup tables, generates a new dataframe with the LA and LSOA matched to each GP practice code.

Parameters
  • gp_data (pd.DataFrame) – Data on number of patients per GP practice. Should contain cols GP Practice Code, MHOL patient count, Patients (2019).

  • postcode_lookup (pd.DataFrame) – A df with a column of postcodes (pscds) and columns matching each postcode to an LSOA, LA and MSOA area.

  • gp_lookup (pd.DataFrame) – A df with a column of GP practice IDs (practice_ID) and postcodes (postcode).

Returns

A df where each row is a GP practice matched to a postcode, LA, LSOA and MSOA, along with associated data columns from gp_data.

Return type

pd.DataFrame

backend.datasets.generate_gp_online.mhol_to_pct(df, res: backend.datasets.dataset.DataResolution)

For a dataframe with rows of each GP practice mapped to LA or LSOA codes and names, sums “patients_total” and “MHOL_true” across area, and creates new col with total percentage over each LA.

Parameters
  • df (pd.DataFrame) –

    Each row is a unique GP practice with an area code and name, and

    patients_total and MHOL_true columns.

  • res (DataResolution, optional) – The area resolution of the data, specified using the DataResolution class that is imported from the dataset module. By default DataResolution.LA

Returns

Returns df with rows as each area with sum of “patients_total”, “MHOL_true” and new col “MHOL_pct” which is the percentage of MHOL_true out of patients_total.

Return type

pd.DataFrame