Data Scientist: PubChem API Implementation

The sections that you can visit to learn about how Mandy – the Intelligent Companion works are mentioned in the order in which the application works. These actions start after receiving a question or comment from the user:

What is Mandy – the Intelligent Companion?
Wit.ai Technology Implementation
PubChem API Implementation
Youtube Audio Transcription Implementation

PubChem is a public library that contains around 13 million chemical compounds registered and that can be consulting by using different API endpoints. Mandy uses the chemicals’ names and identified intention to search about useful information that helps users to know what to do with the chemical compounds or products that contain them.

The main python libraries that are used in the PubChem API implementation and that you can install using pip are the following:

You can install those python libraries with pip install [library’s name] on your terminal console.

• re
• requests
• json
• pubchempy
• smptlib,ssl

This section is integrated with three functions that are integrated to provide useful information that is related with the user’s identified intention.

Wit.ai Identified Intention Function Integrated Chemical compatibility • handling_store Storage • toxicity • ghs_classification Information • ghs_classification • handling_store

Chemical Compatibility Intention

Products sometimes are mixed, added, stirred, combined or put into contact so that the interaction of two or the same chemical may result into a third chemical, byproduct, product or waste that is harmful or undesired. It is then that users should know in advance whether the chemical is more likely to produce another chemical, waste or undesired product. For example, you may have baking soda at home and detergents that when they are mixed or are exposed to different environments could cause a problem.

Once wit.ai identified that the user is looking for information that states the chemical compatibility of a chemical compound, Mandy uses an API endpoint from PubChem to search the chemical compound and the most relevant information that is related with its chemical compatibility. The API endpoint response renders the data in a json format so that not only you are able to request data related to the chemical compatibility of a chemical compound, but also your will receive data that is related to structure, chemical and physical properties, safety procedures, chemical compound classification and much more. There are more than twenty different categories for you to identify and know more of a specific chemical compound.

The API endpoint that mandy searches from is this:

API_ENDPOINT=’https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/’+str(cid)+’/JSON‘

It is recommended that you use an online json viewer so that you are well acquainted with the structure and sections that you can extract and use in your own application. We suggest you to use this json online viewer XXX or any other more sophisticated.

An example of the json structure that you will be working with is presented below:

Notice that you will be able to extract the data from the different level and depth with the json_extract function and extract subfunction. These functions parses the data and extract only the requested value or key of the API endpoint response.

Handling_store Function

handling store is the function that extracts from PubChem the data related to the handling procedures and handling guidelines that are suggested to implement when manipulating chemicals. It requires two parameters, chemical_compound which is the chemical’s name and cid which is the id with which Pubchem identifies a specific chemical compound. For example the benzene is registered with the id = 254 so that your API endpoint should be structured in this way:

API_ENDPOINT=’https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/’+str(254)+’/JSON‘

Here is the programming code included in this function. You can verify that it contains the json_extract and extract subfunctions explained before.

def handling_store(chemical_compound,cid):

API_ENDPOINT='https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/'+str(cid)+'/JSON'   
dat={}
    #print(newintent)
headers = {'authorization': 'Bearer ','Content-Type': 'application/json'}
resp=requests.post(API_ENDPOINT,headers=headers,json=dat)
textt=json.loads(resp.content)   
def json_extract(obj,key):

    arr=[]
    def extract(obj,arr,key):
        if isinstance(obj,dict):
            for k,v in obj.items():
                if isinstance(v,(dict,list)):
                    extract(v,arr,key)
                elif v==key:
                    arr.append(obj)
        elif isinstance(obj,list):
            for item in obj:
                extract(item,arr,key)
        return (arr)
    values=extract(obj,arr,key)
    return values
result=json_extract(textt,'Handling and Storage')
result_validate=json_extract(result,'Not Classified')

#print('handling55:',result)
#print(json.dumps(result,indent=2,sort_keys=True))
result_handling_storage=''
response_title='Handling and Storage:\n\n'
#si no hay devolucion de datos
if len(result[0])==0:
    if result_validate[0]['validate']=='Not Classified':
        response_api="There are are not records of hazard classification so that it may not be dangerous, please look for other professional resources"
        result_handling_storage=response_title+response_api
        #print('No results:',result_handling_storage)
else:
    handling_storage={}
    result_handling_storage=''
    for key,value in result[0].items():
        if value not in handling_storage.values():
            handling_storage[key]=value

    result_handling_storage=handling_storage['Section'][0]['Information'][0]['Value']['StringWithMarkup'][0]['String']
return result_handling_storage

Storage Intention

Any house or warehouse in a company has a special place to store chemicals that are considered hazardous or non-hazardous products. It is critical that the user that store chemicals know about the safety procedures to implement them in the places where those chemicals are placed. Not only it is meaningful to know about what to do, but also it is vital to know about the right classification to execute the right procedures with the right chemicals. Functions toxicity and ghs_classification extract the most critical data that users should know about to do a good storage job. GHS abbreviation means Globally Harmonized System which is a classification system that identifies whether a chemical is flammable, explosive, etc. and provides specific recommendations to overcome potential unsafe practices.

Here is the code that you can modify whether you are interested in extracting data related to the toxicity of chemical products.

def toxicity(chemical_compound,cid):

API_ENDPOINT='https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/'+str(cid)+'/JSON'   
dat={}
headers = {'authorization': 'Bearer ','Content-Type': 'application/json'}
resp=requests.post(API_ENDPOINT,headers=headers,json=dat)
textt=json.loads(resp.content)   
def json_extract(obj,key):

    arr=[]
    def extract(obj,arr,key):
        if isinstance(obj,dict):
            for k,v in obj.items():
                if isinstance(v,(dict,list)):
                    extract(v,arr,key)
                elif v==key:
                    arr.append(obj)
        elif isinstance(obj,list):
            for item in obj:
                extract(item,arr,key)
        return (arr)
    values=extract(obj,arr,key)
    return values
result=json_extract(textt,'Toxicity Summary')
result_validate=json_extract(result,'Not Classified')

response_title='Toxicity Summary:\n'
if len(result[0])==0:
    if len(result_validate)>=1:
        response_api="There are are not records of hazard classification so that it may not be dangerous, please look for other professional resources"
        result_toxicity=response_title+response_api
else:
    toxicity={}
    for key,value in result[0].items():
        if value not in toxicity.values():
            toxicity[key]=value

    result_toxicity=toxicity['Information'][0]['Value']['StringWithMarkup'][0]['String']
return result_toxicity

Here is the programming code that you can modify whether you are interested in extracting data regarding the globally harmonized system hazard classification

def ghs_classification(chemical_compound,cid): API_ENDPOINT=’https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/’+str(cid)+’/JSON‘
dat={}

    #print(newintent)
headers = {'authorization': 'Bearer ','Content-Type': 'application/json'}
resp=requests.post(API_ENDPOINT,headers=headers,json=dat)
textt=json.loads(resp.content)   
def json_extract(obj,key):

    arr=[]
    def extract(obj,arr,key):
        if isinstance(obj,dict):
            for k,v in obj.items():
                if isinstance(v,(dict,list)):
                    extract(v,arr,key)
                elif v==key:
                    arr.append(obj)
        elif isinstance(obj,list):
            for item in obj:
                extract(item,arr,key)
        return (arr)
    values=extract(obj,arr,key)
    return values
result=json_extract(textt,'GHS Classification')
result_validate=json_extract(result,'Not Classified')
#print(json.dumps(result_validate,indent=2,sort_keys=True))
#print(json.dumps(result,indent=2,sort_keys=True))

response_title="GHS Classification:\n"

if len(result[0])==0:
    if result_validate[0]['validate']=='Not Classified':
        response_api="There are are not records of hazard classification so that it may not be dangerous, please look for other professional resources"
        response_ghs_classification=response_title+response_api
else:
    results=json_extract(textt,'Pictogram(s)')
    ghs_classification={}
    for key,value in results[0].items():
        if value not in ghs_classification.values():
            ghs_classification[key]=value

    #print(json.dumps(ghs_classification,indent=2,sort_keys=True))
    response_api=""
    response=''
    number_classified=len(ghs_classification['Value']['StringWithMarkup'][0]['Markup'])

    #print("number:",number_classified)
    ghs_class=ghs_classification['Value']['StringWithMarkup'][0]['Markup']

    for ghs in range(number_classified):
       #print(ghs_class[ghs]['Extra'])
        response=ghs_class[ghs]['Extra']+" "
        response_api+=response
    response_ghs_classification=response_title+response_api
    #print(response_ghs_classification)
return response_ghs_classification

Information Search Intention

It is also possible that wit.ai helps you to identify that you are searching for data that helps you handle and use a chemical compound sporadically so that you just need to be aware of the basic data from the chemical compound. Handling_store and ghs_classification functions can extract data that help you to achieve this intention. For example: executing these functions will provide you with the necessary information to know about the type of hazard that they represent and the best procedures to follow to handle those chemicals under specific environments.

You can add or reduce the data that is extracted from these functions so that get a more complete report of the best practices in managing hazardous or non-hazardous chemicals.

You can verify, modify and improved these functions. Here is the programming code for the Handling_store and ghs_classification functions.

The final step of the PubChem Implementation is reached when you configure the function content_sorted. The function keep the data into some main variables that will be sent by email afterwards. The main variables that you have to modify are ureceiver and usender so that ureceiver is the email of final destination and usender is the email from which you will send all the information.

Emails will contain the extracted data that was recovered from the handling_store, ghs_classification and toxicity functions.

In summary, now you know how to extract data from PubChem in an automated way. You may want to modify these piece of code to adapt a more sophisticated application that include chemicals. PubChem not only stores chemicals and their properties, but also stores patents and other useful information in which chemistry is involved.

You may want to continue learning about Mandy – the intelligent companion so that you can check these links to know about the other technologies and code that you can modify.

The sections are mentioned in the order in which the application works and these actions start after receiving a question or comment from the user:

What is Mandy – the Intelligent Companion?
Wit.ai Technology Implementation
PubChem API Implementation
Youtube Audio Transcription Implementation