Package: backend.datasets¶

Submodules:¶

dataset¶

This module contains the classes used to standardise, format and combine the data sources used to create the map.

The classes defined in this module are:: DataResolution DataFrequency
The dataclasses defined in this module are:: Dataset MasterDataset

class backend.datasets.dataset.DataFrequency¶

Bases: enum.Enum

Defines a time resolution.

Example

DataFrequency.LIVE

Parameters: Enum (str) – Define the DataFrequency as “live” or “static”

LIVE = 'live'¶

STATIC = 'static'¶

class backend.datasets.dataset.DataResolution¶

Bases: enum.Enum

Defines a geographic resolution.

Example

DataResolution.LA

Parameters: Enum (str) – Define the DataResolution as “LA”, “LSOA”, “MSOA”.

LA = ('LA', 'Local Authority')¶

LSOA = ('LSOA', 'Lower Super Output Area')¶

MSOA = ('MSOA', 'Middle Super Output Area')¶

class backend.datasets.dataset.Dataset(data: pandas.core.frame.DataFrame, res: backend.datasets.dataset.DataResolution, key_col: str, key_is_code: bool, csv_name: str, keep_cols: list = None, bracketed_data_cols: list = None, rename: dict = None)¶

Bases: object

Class to handle transformations to source datasets, returning them in a standardised format.

data¶

The source dataset to standardise. Expected to have a row for each goegraphic area with at least one column defining the area name or code.

Type: pd.DataFrame

res¶

The DataResolution type of the data. See the class for options.

Type: DataResolution

key_col¶

The column name of the col containing the unique area key.

Type: str

key_is_code¶

Whether or not the key_col is an area code or area name.

Type: bool

csv_name¶

Name of the csv to write out to, if called. The resolution is automatically prepended the resolution to the name if written.

Type: str

keep_cols¶

List of columns to keep in the standardised dataset. If columns have been renamed, use the names given after renaming. Must also include the renamed key_col with its formal name (‘LSOA11CD’, ‘LSOA11NM’, ‘lad19cd’, ‘lad19nm’).

Type: list, optional

bracketed_data_cols¶

List of columns that have data in the format NUMBER (PERCENTAGE).

Type: list, optional

rename¶

Dictionary in format {‘old_name’ : ‘new_name’ } for columns to be renamed.

Type: dict, optional

std_data\_

Standardised data, which will have a name and code column, the columns chosen to keep whose contents and column names may be updated based on the args provided.

Type: pd.DataFrame

bracketed_data_cols: list = None¶

clean_bracketed_data()¶

For a df with columns in the format ‘NUMBER (PERCENTAGE)’ this function extracts the data into two new columns and deletes the original column.

Notes

The two new columns will be named the same as the original column, but with _count or _pct appended.

Returns: DataFrame with each col replaced by two new columns with the count and percentage.
Return type: pd.DataFrame

static clean_keys(df, res, key_col, key_is_code=True)¶

Ensures df key column (i.e column used for joining) is correctly formatted for joins in the next steps. Accepts key as a code or name, at LA or LSOA level. Will rename key column if it is not the standard name.

Notes

Renaming of the key columns depends on the DataResolution. For LSOA they will be LSOA11CD and LSOA11NM for code and name respectively. For LA they will be lad19cd and ladnm19 respectively.

Parameters

df (pd.DataFrame) – Dataframe to be cleaned
res (DataResolution) – Accepts ‘LA’ or ‘LSOA’ as resolution of the data
key_col ([type]) – Name of column containing the code or name
key_is_code (bool, optional) – If True the key_col is a code. If false, key_col is a name. By default True

Returns

Returns dataframe with key_col stripped of whitespace, filted to only Welsh areas and renamed for consistency.

Return type

pd.DataFrame

Raises

Exception – When the number of rows after merging on the new key column names cannot match the DataResolution.

csv_name: str = None¶

csv_path()¶

Generates a name to output to csv.

Returns: csv_name preppended with the data resolution and appended with .csv
Return type: str

data: pd.DataFrame = None¶

property is_standardised¶: Returns bool of whether the standardised data has been generated.

keep_cols: list = None¶

key_col: str = None¶

key_is_code: bool = None¶

read_keys()¶: Reads and returns the LSOA and LA geopandas dataframes as constants ‘LSOA’, ‘LA’.

rename: dict = None¶

res: DataResolution = None¶

standardise()¶

Based on attributes, applies the correct functions to standardise the datasets.

The function will set self.std_data_

Returns: This function will return the class instance.
Return type: Dataset

standardise_keys()¶

Given dataframe and chosen cols, will use LA or LSOA geopandas dataframes to create standardised columns for area codes and names

Returns

self.data with standardised key codes and names. If keep_cols is used then only those columns will be returned.

Return type

pd.DataFrame

Raises

Exception – When the number of rows generated does not match the DataResolution
ValueError – When the DataResolution is not LA or LSOA

property standardised_data¶: Returns the std_data_ object pd.DataFrame.

std_data_: pandas.core.frame.DataFrame = None¶

write()¶

Writes the standardised data to csv in the cleaned folder, using naming convention of resolution_name.csv. If the same file already exists it will not write.

Raises: ValueError – If the data has not already been standardised.

class backend.datasets.dataset.MasterDataset(datasets: List[backend.datasets.dataset.Dataset], res: backend.datasets.dataset.DataResolution, freq: backend.datasets.dataset.DataFrequency, from_csv: bool = True)¶

Bases: object

Used to call or generate the merged ‘master’ dataset used to write to json. Can be used to generate the ‘live’ or ‘static’ master datasets. Will write out to csv if it does not already exist, or if user chooses ‘from_csv’ to be False.

Notes

When the master dataset is created, certain data transformation are assumed, which create variables from expected columns in LSOA/LA STATIC or LIVE datasets. These are detailed in the _create_master_dataset method.

datasets¶

A list of Dataset instances to be merged into the master dataset.

Type: List[Dataset]

res¶

The DataResolution of the data. Accepts LA or LSOA.

Type: DataResoltion

freq¶

The DataFrequency of the data. Accepts STATIC or LIVE.

Type: DataFrequency

from_csv¶

Whether the master dataset should be built from previously generated csv.

Type: bool

master_dataset\_

The final merged dataset.

Type: pd.DataFrame

datasets: List[Dataset] = None¶

property file_path¶: Returns str filepath to write csv to, based on freq and res

freq: DataFrequency = None¶

from_csv: bool = True¶

property master_dataset¶: Returns the master dataset, and generates it if it does not exist.

master_dataset_: pandas.core.frame.DataFrame = None¶

res: DataResolution = None¶

static write(data, filepath)¶: Writes data to csv on the given filepath.

live¶

Handles live data from reading to master dataset generation.

This module reads, standardises and defines the master dataset for data sources from the data/live folder.

Notes

This module imports classes from the dataset module. The LA_LIVE master dataset definition has from_csv=False. This means that the live dataset master will always be regenerated from the source files given here, rather than read from the existing master csv.

To add a new live datasource, follow the existing examples for a SOURCE_ constant that is read from file, passed to a Dataset definition, and then the Dataset definition defined in the LA_LIVE datasets list to ensure it is included.

static¶

Module that generates the cleaned datasets for each variable from raw.

This module standardises and defines the master dataset for data sources from the data/static folder.

Notes

This module imports classes from the dataset module, and source dataset constants from the static_source_datasets module.

To add a new datasource, follow the existing examples for a SOURCE_ constant that is passed to a Dataset definition, and then add the Dataset constant to the MasterDataset datasets list to ensure it is included.

The LSOA_STATIC and LA_STATIC master datasets have from_csv=True. This means that the master datasets will always be read from a previously generated master csv, rather than regenerated from source files. If new sources are added, this will need to be run at least once with from_csv=False to integrate new sources into a new master csv.

static_source_datasets¶

Handles reading of datasets from static source data folder.

generate_gp_online¶

Generates gp_online.csv

Functions contained in this module take data of patients registered with My Health Online from local data folders. The gp_to_area function matches each GP practice to a postcode, and then matches that postcode to a Local Authority, Lower Super Output Area and Middle Super Output Area.

The mhol_to_percentage function then calculates the overall percentage of patients registered online at the area level chosen.

To run, these functions require local data that is not published in the repository.

backend.datasets.generate_gp_online.gp_to_area(gp_data, postcode_lookup, gp_lookup)¶

Using data and lookup tables, generates a new dataframe with the LA and LSOA matched to each GP practice code.

Parameters

gp_data (pd.DataFrame) – Data on number of patients per GP practice. Should contain cols GP Practice Code, MHOL patient count, Patients (2019).
postcode_lookup (pd.DataFrame) – A df with a column of postcodes (pscds) and columns matching each postcode to an LSOA, LA and MSOA area.
gp_lookup (pd.DataFrame) – A df with a column of GP practice IDs (practice_ID) and postcodes (postcode).

Returns

A df where each row is a GP practice matched to a postcode, LA, LSOA and MSOA, along with associated data columns from gp_data.

Return type

pd.DataFrame

backend.datasets.generate_gp_online.mhol_to_pct(df, res: backend.datasets.dataset.DataResolution)¶

For a dataframe with rows of each GP practice mapped to LA or LSOA codes and names, sums “patients_total” and “MHOL_true” across area, and creates new col with total percentage over each LA.

Parameters

df (pd.DataFrame) –

Each row is a unique GP practice with an area code and name, and
patients_total and MHOL_true columns.
res (DataResolution, optional) – The area resolution of the data, specified using the DataResolution class that is imported from the dataset module. By default DataResolution.LA

Returns

Returns df with rows as each area with sum of “patients_total”, “MHOL_true” and new col “MHOL_pct” which is the percentage of MHOL_true out of patients_total.

Return type

pd.DataFrame