BioNodes: Finding Associations Among Medical Keywords Using NLTK in Python


A few months ago I started a project with the goal of discovering relationships among medical concepts. The hypothesis was that tons of medical papers are released every day and no one researcher has the time to sift through all of them. If an automated process can go through them and discover relationships among these concepts, then it can help a medical researcher uncover associations or find articles that were not on their radar.

Taking Baby Steps

This idea is not unique, but the implementation greatly impacts the user experience. I decided to start with a simple implementation and enhance different parts of the system iteratively. The overall process consisted of these steps:

  1. Crawl them and scrape those articles.
  2. Have a simple algorithm to extract the most important entities (keywords) in those articles (or just use the ones provided by the author).
  3. Create a simple graph to connect these keywords based on how they show up in the articles.
  4. Put a simple UI on top of it for presentation layer.

Sources for Medical Papers

In order for this idea to work, the application needs lots of data. Not being a medical researcher myself, I searched the web looking for sites with lots of up-to-date medical papers. I found these two sites:

Crawling and Extracting Keywords

For crawling I use requests to make calls to the source server, BeautifulSoup for extracting data out of HTML pages, and PDFMiner to extract text from PDF files since the articles are uploaded as PDF files. For extracting the keywords I use NLTK.

1. Extracting Text from PDF using PDFMiner

The following method, inside can extract the text from a PDF file given its URL. Here, we first load the PDF as steam, and pass that stream to high_level module of pdfminer to extract the text.

2. Extract keywords if they are provided by the author

If keywords are provided in the article, we take them. This is done using this regular expression:

re.findall(“(?<=Keywords:)(.*)(?=\\n)”, content)

3. Use NLTK to extract keywords if keywords are not provided by the author

I originally tried going through the entire article and extracting keywords. This produced a lot of keywords and not all of them seemed relevant. After realizing that the titles of these articles are usually very verbose and descriptive, I decided to extract the keywords only from the titles, and that produced a decent result.

  1. Remove _stop words_ from the keywords we got from the previous step. Stop words are most common words in a language, and in our case should be removed because we don’t want to include them as part of the keywords. There are entire articles on the web that cover this topic. Thankfully NLTK makes this process pretty easy. Once you download the English stop words, they are available inside stopwords module in nltk.corpus. We assign the unique English stop words to our variable: Crawler.stop_words = set(stopwords.words(“english”))
  2. We also _lemmatize_ the keywords. This is the process of “grouping together the inflected forms of a word so they can be analyzed as a single item”. For instance, “walk” is the base form of the word “walking”. So there is no point in having both “walk” and “walking” as keywords.

Creating a Graph inside Neo4j to Store Keywords and their Associations

Neo4j is a graph database that makes modelling the keywords, articles, and their associations easy. It allows us to find all the ways a keyword is associated with another keyword. It’s very similar to how LinkedIn can show us how we’re connected to one another through other colleagues and acquaintances.

  • ArangoDB
  • AgensGraph
  • OrientDB
  • A pure relational database like PostgreSQL
  • All the keywords within the same article are directly associated with each other. In the above example, keywords A, B, and C are directly associated with each other because they all appear in article 1. Similarly keywords C and D are directly associated with one another because they appear in article 2. In other words, their relationship has a “path of length 1”.
  • If an article references keywords that appear in another article, the keywords that do not appear in the same article are also associated with one another, but their relationship has a “path of length 2”. In the above example, keywords A and B are associated with keyword D but their paths has a length of 2. If we had another article that referenced keyword A which is included in article 1, the other keywords in that article would have been associated with keyword D that show up in article 2, but length of that path would have been 3. When we query the associations we can decide the maximum length that still gives us a meaningful association between keywords.

Connecting to Neo4j in Python

DbConnection class in module provides a singleton driver method that can be used anywhere in the application. This way the connection to DB is established only once:

Creating nodes in Neo4j

This can be done with the following simple syntax. DbConnection class refers to the class above.

Creating Relationship Between Keywords

The Cypher query for this is as simple as below:

"MATCH (k1:keyword {name:'Keyword A'}), (k2:keyword {name:'Keyword B'}) CREATE (k1)-[:associated_with]->(k2);"

The UI and Finishing Touches

The main application is hosted in a Flask application. There are Flask endpoints that call various methods in our main library and allow us to get all the keywords, get the path among keywords, get the list of articles, do a full text search in articles, etc.

A Little Demo

This is what the UI looks like:

  1. “influenza” and all the associated keywords are displayed in the “Graph” and all the articles that mention “influenza” appear in the “Articles” section.
  2. We search for “immunity” and add it to the Selected Keywords. Now we can see the path between “influenza” and “immunity”. They are connected to each other through the “vaccination” keyword. We select “vaccination” in the graph to add it to the list of “Selected Keywords”.
  3. Now we can see the articles that include “vaccination” and “influenza”, and “vaccination” and “immunity”. Now we can click on those articles which take us to MedRxiv website.

What’s Next

These are some of the main issues:

  • A more sophisticated NLP algorithm can improve the extraction of the keywords.
  • I tried hosting this in AWS but it costed too much. We need a decent machine for scraping and extracting keywords. I tried hosting the main application on a small instance and run the crawler on a scheduler using AWS Fargate, but the cost of running it for just 1 hour every day was too much.

Final Thoughts

I hope you find this project useful. Please feel free to use any part of it in your own project. If you have any ideas for improvement I would love to hear them. If you would like to contribute and make pull requests, that would be awesome as well. Thank you.

Big data and full stack engineer | Founder at AtlasRain consulting

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store