Better understanding COVID 19 through data mining

The Project

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset.

What Is The Idea?

The idea is to create a language model trained on the “body text” of the research paper available in this dataset. The resulting language model can then be used to build a classifier of some sort.

Importing the necessary libraries

I am using fastai to create the language model. So fast.text is imported. Json is used to parse json data.

from fastai.text import *
import numpy as np
import json

Let’s Set All The Paths?

Let’s set all the path which is required to work on our data.

  • outputPath is where we would be saving our csv , models etc.
  • datasetPath is the directory to the dataset.
path = Path('/kaggle/input')
outputPath = Path('/kaggle/working')
datasetPath = path/'CORD-19-research-challenge/'
  • biorxivMedrxiv is the path to the directory which contains the json files for the research papers sourced from biorxiv-Medrxiv.
pmcCustomLicense = datasetPath/'custom_license/custom_license'
biorxivMedrxiv = datasetPath/'biorxiv_medrxiv/biorxiv_medrxiv'

How To Parse The Data From Json?

The data from the research papers are stored as json in this dataset. We need to parse the data from the json files before we use that.

  • Then this function goes through each of these directories and appends them to a new list fileList. this way the contents of the original lists (i.e. the filpaths) becomes a part of a bigger list fileList.

example usage

filePathList1 = [‘path1’, ‘path2’, ‘path3’] filePathList2 = [‘path4’, ‘path5’, ‘path6’]load_files(filePathList1, filePathList2)Output → [‘path1’, ‘path2’, ‘path3’, ‘path4’, ‘path5’, ‘path6’]def load_files(fileNames: tuple):
    fileList = []for file in fileNames:
        fileList.append(file)
        
    return fileList
files = load_files((biorxivMedrxiv.iterdir()))
  • Iterates through the list and converts each list item i.e. the path objects into string format.
  • Then it puts those string paths into the filePath list.
def filePath(files):
    filePath = []
    
    for file in files:
        filePath.append(str(file))
        
    return filePath
filePaths=filePath(files)
  • Goes through each file path from the list files and uses the json.load() method to read the contents.
  • finally it appends these json object into the rawFiles list.
def getRawFiles(files: list):
    rawFiles = []
        
    for fileName in files:
            rawFile = json.load(open(fileName, 'rb'))
            rawFiles.append(rawFile)
            
    return rawFiles
rawFiles = getRawFiles(filePaths)
  • Joins the names in the following sequence <first name> <middle name> <last name.
  • Function also takes care o fthe fact that if there no middle name then just do <first name> <last name
  • If location details are there in json then put it into a list and return it.
  • If institution is there then join location and instituion details together and put it into a list.
  • Joins the author’s name with the affiliations if affiliations are available.
  • Extracts the text and then appends it into a list.
  • Joins the ‘title’, ‘authors’, ‘venue’, ‘year’ together to form a string.
def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)
    
    return ", ".join(name_ls)def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += textbody = ""for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return bodydef format_bib(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
        formatted.append(", ".join(formatted_ls))return "; ".join(formatted)
  • The resulting dataframe would have the following columns -> ’paper_id’, ‘title’, ‘authors’, ‘affiliations’, ‘abstract’, ‘text’, ‘bibliography’, ‘raw_authors’, ‘raw_bibliography’
def generate_clean_df(all_files):
    cleaned_files = []
    
    for file in all_files:
        features = [
            file['paper_id'],
            file['metadata']['title'],
            format_authors(file['metadata']['authors']),
            format_authors(file['metadata']['authors'], 
                           with_affiliation=True),
            format_body(file['abstract']),
            format_body(file['body_text']),
            format_bib(file['bib_entries']),
            file['metadata']['authors'],
            file['bib_entries']
        ]cleaned_files.append(features)col_names = ['paper_id', 'title', 'authors',
                 'affiliations', 'abstract', 'text', 
                 'bibliography','raw_authors','raw_bibliography']clean_df = pd.DataFrame(cleaned_files, columns=col_names)
    clean_df.head()
    
    return clean_df
pd.set_option('display.max_columns', None) 
cleanDf = generate_clean_df(rawFiles)
cleanDf.head()

Better understanding COVID 19 through data mining 1

cleanDf.to_csv(outputPath/'cleandf.csv')

The Data

We then create a databunch from the csv that we created from the json files.

  • The data in the databunch is fetched from the ‘text’ column of the csv file.
  • 10% of the data is reserved as the validation data.
  • Since the language model that we will create down the line will be a self-supervised model. It takes in the label from the data itself.
  • Thus the label_for_lm() helps us with that.
  • After the databunch is created, it is saved to a pickle file.
  • Next time when this code runs we don’t need to go through the process of creation of the databunch once again. We can use the pickle file.
def createDataBunchForLanguageModel(outputPath: Path,
                                    csvName: str,
                                    textCol: str, 
                                    pickelFileName: str,
                                    splitBy: float, 
                                   batchSize: int):
    data_lm = TextList.from_csv(outputPath,
                                f'{csvName}.csv',
                                cols=textCol)\
                  .split_by_rand_pct(splitBy)\
                  .label_for_lm()\
                  .databunch(bs=batchSize)
    
    data_lm.save(f'{pickelFileName}.pkl')
createDataBunchForLanguageModel(outputPath,
                                    'cleandf',
                                    'text', 
                                    'cleanDf',
                                    0.1,
                                    48)
def loadData(outputPath: Path,
             databunchFileName: str,
             batchSize: int,
             showBatch: bool= False):
    
    data_lm = load_data(outputPath,
                       f'{databunchFileName}.pkl',
                       bs=batchSize)
    
    if showBatch:
        data_lm.show_batch()
        
    return data_lm

Building The Learner

We use ULMFIT to create a language model on the research data corpus and then fine tune this language model.

learner = language_model_learner(loadData(outputPath,
             'cleanDf',
             48,
             showBatch= False),
             AWD_LSTM,
             drop_mult=0.3)
def plotLearningRate(learner, skip_end):
    learner.lr_find()
    learner.recorder.plot(skip_end=skip_end)
plotLearningRate(learner, 15)
Better understanding COVID 19 through data mining 2
learner.fit_one_cycle(1, 1e-02, moms=(0.8,0.7))

Image for post

learner.unfreeze()
learner.fit_one_cycle(10,1e-3, moms=(0.8,0.7))
learner.save('fineTuned')

Testing

Let’s see if the model can connect the information/knowledge from the corpus. We first load the saved model and then try to find prediction for a search term.

learner = learner.load('fineTuned')
TEXT = "Range of incubation periods for the disease in humans"
N_WORDS = 40
N_SENTENCES = 2
print("\n".join(learner.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Better understanding COVID 19 through data mining 3

def save(learner,
         saveEncoder: bool = True):
    
    if saveEncoder:
        learner.save_encoder('fine_tuned_encoder')
        
    learner.save('fine_tuned_model')
save(learner)

End Notes

I hope that this kernel and this language model would be helpful for someone else who might be working on this dataset or any COVID-19 related data.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link