class documentation

class Preprocessor(object):

View In Hierarchy

Base class for preprocessing.

Static Method check_file_exists Check all files in file_list exist or not.
Static Method deduplicate Find duplicate files under a subdirectory.
Static Method load_results Load the results from the given files.
Method __init__ No summary
Method generate_file_lists Generate the file lists.
Method parse_gnd Parse a single gnd(xml) file.
Method save_to_outdir Save the files to the output directory.
Method save_to_pkl Save all the info and data to the pkl file (using pickle).
Instance Variable age_logger Undocumented
Instance Variable duplicates Undocumented
Instance Variable gender_logger Undocumented
Instance Variable logger Undocumented
Instance Variable name_logger Undocumented
Instance Variable output_dir Undocumented
Instance Variable root Undocumented
Instance Variable total_files Undocumented
Instance Variable unique_files Undocumented
Method _is_good_sample Check whether the sample is good.
Method _keep_latest_audiograms Keep the latest audiograms.
Method _parse_age Parse the age of the person.
Method _parse_create_time Parse the time of this examination is done.
Method _parse_data Parse the data of the person.
Method _parse_gender Parse the gender of the person.
Method _parse_name Parse the name of the person.
@staticmethod
def check_file_exists(file_list):

Check all files in file_list exist or not.

Parameters
file_list:List[str]path of the files
Returns
boolTrue if all files exist, False otherwise.
@staticmethod
def deduplicate(directory):

Find duplicate files under a subdirectory.

Parameters
directory:strdirectory name
Returns
Tuple[List, List, List]Undocumented
@staticmethod
def load_results(file_paths):

Load the results from the given files.

Parameters
file_paths:listpaths of the files
Returns
List[List[str]]lists of the files
def __init__(self, root, output_dir='dataset/processed'):
Parameters
root:strroot folder
output_dir:str, optionaldirectory of the outputs. Defaults to "dataset/processed".
def generate_file_lists(self):

Generate the file lists.

Generate the lists of total, duplicates, unique files and save the results under the root folder.

Returns
Tuple[List, List, List]total, duplicates, unique file lists
def parse_gnd(self, path):

Parse a single gnd(xml) file.

We use a few _parse_xxx() functions to extract the following information: 1. name 2. gender 3. age 4. create_time, the time of this examination is done. 5. data, report data. 6. good_flag, True if this sample is a good sample

Parameters
path:strpath of the gnd(xml) file
Returns
Tuple[str, str, float, datetime, dict, bool]name, gender, age, create_time, data, good_flag
def save_to_outdir(self, file_list, output_dir, output_name='filepath_mapping.json'):

Save the files to the output directory.

filename format is "{create_time}-{name}.gnd"

Parameters
file_list:List[str]list of files.
output_dir:strdirectory of the output.
output_name:strfilename of the output. Defaults to "filepath_mapping.json"
def save_to_pkl(self, mapping_path, output_file):

Save all the info and data to the pkl file (using pickle).

Parameters
mapping_path:strpath of the filepath mapping.
output_file:strdirectory of the output.
age_logger =

Undocumented

duplicates =

Undocumented

gender_logger =

Undocumented

logger =

Undocumented

name_logger =

Undocumented

output_dir =

Undocumented

root =

Undocumented

total_files =

Undocumented

unique_files =

Undocumented

def _is_good_sample(self, flag_list):

Check whether the sample is good.

Currently, we consider the sample is good only if it has a valid name, time and data.

Parameters
flag_list:List[bool]match flags
Returns
boolTrue if this sample is good
def _keep_latest_audiograms(self, examinations):

Keep the latest audiograms.

Some patients may do multiple times audiogram examinations in one gnd file (because of misoperation, system error, etc) We only keep the audiograms of the latest examination.

Parameters
examinations:listaudiograms of all examinations
Returns
listaudiograms of latest examination
def _parse_age(self, path):

Parse the age of the person.

Parameters
path:strpath of the gnd(xml) file
Returns
floatperson's age match_flag (bool): whether the age is matched
def _parse_create_time(self, soup):

Parse the time of this examination is done.

Parameters
soup:BeautifulSoupfile content in BeautifulSoup format
Returns
datetimeexamination time match_flag (bool): whether the gender is matched
def _parse_data(self, soup):

Parse the data of the person.

Parameters
soup:BeautifulSoupfile content in BeautifulSoup format
Returns
dictperson's data match_flag (bool): whether the data is matched
def _parse_gender(self, path, soup):

Parse the gender of the person.

Parameters
path:strpath of the gnd(xml) file
soup:BeautifulSoupfile content in BeautifulSoup format
Returns
strperson's gender ["male", "female"] match_flag (bool): whether the gender is matched
def _parse_name(self, path, soup):

Parse the name of the person.

Parameters
path:strpath of the gnd(xml) file
soup:BeautifulSoupfile content in BeautifulSoup format
Returns
strperson's name match_flag (bool): whether the name is matched