Package: backend.datasets¶
Submodules:¶
dataset¶
This module contains the classes used to standardise, format and combine the data sources used to create the map.
- The classes defined in this module are:
DataResolution DataFrequency
- The dataclasses defined in this module are:
Dataset MasterDataset
-
class
backend.datasets.dataset.
DataFrequency
¶ Bases:
enum.Enum
Defines a time resolution.
Example
DataFrequency.LIVE
- Parameters
Enum (str) – Define the DataFrequency as “live” or “static”
-
LIVE
= 'live'¶
-
STATIC
= 'static'¶
-
class
backend.datasets.dataset.
DataResolution
¶ Bases:
enum.Enum
Defines a geographic resolution.
Example
DataResolution.LA
- Parameters
Enum (str) – Define the DataResolution as “LA”, “LSOA”, “MSOA”.
-
LA
= ('LA', 'Local Authority')¶
-
LSOA
= ('LSOA', 'Lower Super Output Area')¶
-
MSOA
= ('MSOA', 'Middle Super Output Area')¶
-
class
backend.datasets.dataset.
Dataset
(data: pandas.core.frame.DataFrame, res: backend.datasets.dataset.DataResolution, key_col: str, key_is_code: bool, csv_name: str, keep_cols: list = None, bracketed_data_cols: list = None, rename: dict = None)¶ Bases:
object
Class to handle transformations to source datasets, returning them in a standardised format.
-
data
¶ The source dataset to standardise. Expected to have a row for each goegraphic area with at least one column defining the area name or code.
- Type
pd.DataFrame
-
res
¶ The DataResolution type of the data. See the class for options.
- Type
-
key_col
¶ The column name of the col containing the unique area key.
- Type
str
-
key_is_code
¶ Whether or not the key_col is an area code or area name.
- Type
bool
-
csv_name
¶ Name of the csv to write out to, if called. The resolution is automatically prepended the resolution to the name if written.
- Type
str
-
keep_cols
¶ List of columns to keep in the standardised dataset. If columns have been renamed, use the names given after renaming. Must also include the renamed key_col with its formal name (‘LSOA11CD’, ‘LSOA11NM’, ‘lad19cd’, ‘lad19nm’).
- Type
list, optional
-
bracketed_data_cols
¶ List of columns that have data in the format NUMBER (PERCENTAGE).
- Type
list, optional
-
rename
¶ Dictionary in format {‘old_name’ : ‘new_name’ } for columns to be renamed.
- Type
dict, optional
-
std_data\_
Standardised data, which will have a name and code column, the columns chosen to keep whose contents and column names may be updated based on the args provided.
- Type
pd.DataFrame
-
bracketed_data_cols
: list = None¶
-
clean_bracketed_data
()¶ For a df with columns in the format ‘NUMBER (PERCENTAGE)’ this function extracts the data into two new columns and deletes the original column.
Notes
The two new columns will be named the same as the original column, but with _count or _pct appended.
- Returns
DataFrame with each col replaced by two new columns with the count and percentage.
- Return type
pd.DataFrame
-
static
clean_keys
(df, res, key_col, key_is_code=True)¶ Ensures df key column (i.e column used for joining) is correctly formatted for joins in the next steps. Accepts key as a code or name, at LA or LSOA level. Will rename key column if it is not the standard name.
Notes
Renaming of the key columns depends on the DataResolution. For LSOA they will be LSOA11CD and LSOA11NM for code and name respectively. For LA they will be lad19cd and ladnm19 respectively.
- Parameters
df (pd.DataFrame) – Dataframe to be cleaned
res (DataResolution) – Accepts ‘LA’ or ‘LSOA’ as resolution of the data
key_col ([type]) – Name of column containing the code or name
key_is_code (bool, optional) – If True the key_col is a code. If false, key_col is a name. By default True
- Returns
Returns dataframe with key_col stripped of whitespace, filted to only Welsh areas and renamed for consistency.
- Return type
pd.DataFrame
- Raises
Exception – When the number of rows after merging on the new key column names cannot match the DataResolution.
-
csv_name
: str = None¶
-
csv_path
()¶ Generates a name to output to csv.
- Returns
csv_name preppended with the data resolution and appended with .csv
- Return type
str
-
data
: pd.DataFrame = None¶
-
property
is_standardised
¶ Returns bool of whether the standardised data has been generated.
-
keep_cols
: list = None¶
-
key_col
: str = None¶
-
key_is_code
: bool = None¶
-
read_keys
()¶ Reads and returns the LSOA and LA geopandas dataframes as constants ‘LSOA’, ‘LA’.
-
rename
: dict = None¶
-
res
: DataResolution = None¶
-
standardise
()¶ Based on attributes, applies the correct functions to standardise the datasets.
The function will set self.std_data_
- Returns
This function will return the class instance.
- Return type
-
standardise_keys
()¶ Given dataframe and chosen cols, will use LA or LSOA geopandas dataframes to create standardised columns for area codes and names
- Returns
self.data with standardised key codes and names. If keep_cols is used then only those columns will be returned.
- Return type
pd.DataFrame
- Raises
Exception – When the number of rows generated does not match the DataResolution
ValueError – When the DataResolution is not LA or LSOA
-
std_data_
: pandas.core.frame.DataFrame = None¶
-
write
()¶ Writes the standardised data to csv in the cleaned folder, using naming convention of resolution_name.csv. If the same file already exists it will not write.
- Raises
ValueError – If the data has not already been standardised.
-
-
class
backend.datasets.dataset.
MasterDataset
(datasets: List[backend.datasets.dataset.Dataset], res: backend.datasets.dataset.DataResolution, freq: backend.datasets.dataset.DataFrequency, from_csv: bool = True)¶ Bases:
object
Used to call or generate the merged ‘master’ dataset used to write to json. Can be used to generate the ‘live’ or ‘static’ master datasets. Will write out to csv if it does not already exist, or if user chooses ‘from_csv’ to be False.
Notes
When the master dataset is created, certain data transformation are assumed, which create variables from expected columns in LSOA/LA STATIC or LIVE datasets. These are detailed in the _create_master_dataset method.
-
res
¶ The DataResolution of the data. Accepts LA or LSOA.
- Type
DataResoltion
-
freq
¶ The DataFrequency of the data. Accepts STATIC or LIVE.
- Type
-
from_csv
¶ Whether the master dataset should be built from previously generated csv.
- Type
bool
-
master_dataset\_
The final merged dataset.
- Type
pd.DataFrame
-
datasets
: List[Dataset] = None¶
-
property
file_path
¶ Returns str filepath to write csv to, based on freq and res
-
freq
: DataFrequency = None¶
-
from_csv
: bool = True¶
-
property
master_dataset
¶ Returns the master dataset, and generates it if it does not exist.
-
master_dataset_
: pandas.core.frame.DataFrame = None¶
-
res
: DataResolution = None¶
-
static
write
(data, filepath)¶ Writes data to csv on the given filepath.
-
live¶
Handles live data from reading to master dataset generation.
This module reads, standardises and defines the master dataset for data sources from the data/live folder.
Notes
This module imports classes from the dataset module. The LA_LIVE master dataset definition has from_csv=False. This means that the live dataset master will always be regenerated from the source files given here, rather than read from the existing master csv.
To add a new live datasource, follow the existing examples for a SOURCE_ constant that is read from file, passed to a Dataset definition, and then the Dataset definition defined in the LA_LIVE datasets list to ensure it is included.
static¶
Module that generates the cleaned datasets for each variable from raw.
This module standardises and defines the master dataset for data sources from the data/static folder.
Notes
This module imports classes from the dataset module, and source dataset constants from the static_source_datasets module.
To add a new datasource, follow the existing examples for a SOURCE_ constant that is passed to a Dataset definition, and then add the Dataset constant to the MasterDataset datasets list to ensure it is included.
The LSOA_STATIC and LA_STATIC master datasets have from_csv=True. This means that the master datasets will always be read from a previously generated master csv, rather than regenerated from source files. If new sources are added, this will need to be run at least once with from_csv=False to integrate new sources into a new master csv.
static_source_datasets¶
Handles reading of datasets from static source data folder.
generate_gp_online¶
Generates gp_online.csv
Functions contained in this module take data of patients registered with My Health Online from local data folders. The gp_to_area function matches each GP practice to a postcode, and then matches that postcode to a Local Authority, Lower Super Output Area and Middle Super Output Area.
The mhol_to_percentage function then calculates the overall percentage of patients registered online at the area level chosen.
To run, these functions require local data that is not published in the repository.
-
backend.datasets.generate_gp_online.
gp_to_area
(gp_data, postcode_lookup, gp_lookup)¶ Using data and lookup tables, generates a new dataframe with the LA and LSOA matched to each GP practice code.
- Parameters
gp_data (pd.DataFrame) – Data on number of patients per GP practice. Should contain cols GP Practice Code, MHOL patient count, Patients (2019).
postcode_lookup (pd.DataFrame) – A df with a column of postcodes (pscds) and columns matching each postcode to an LSOA, LA and MSOA area.
gp_lookup (pd.DataFrame) – A df with a column of GP practice IDs (practice_ID) and postcodes (postcode).
- Returns
A df where each row is a GP practice matched to a postcode, LA, LSOA and MSOA, along with associated data columns from gp_data.
- Return type
pd.DataFrame
-
backend.datasets.generate_gp_online.
mhol_to_pct
(df, res: backend.datasets.dataset.DataResolution)¶ For a dataframe with rows of each GP practice mapped to LA or LSOA codes and names, sums “patients_total” and “MHOL_true” across area, and creates new col with total percentage over each LA.
- Parameters
df (pd.DataFrame) –
- Each row is a unique GP practice with an area code and name, and
patients_total and MHOL_true columns.
res (DataResolution, optional) – The area resolution of the data, specified using the DataResolution class that is imported from the dataset module. By default DataResolution.LA
- Returns
Returns df with rows as each area with sum of “patients_total”, “MHOL_true” and new col “MHOL_pct” which is the percentage of MHOL_true out of patients_total.
- Return type
pd.DataFrame