Our data comes from articles that are about asthma, and are about adverse effects. Our base assumption for the model is that given all the articles are about adverse reactions to drugs or other chemicals, if a word frequently occurs with a side effect, it is associated with that side effect.
For each side effect, we run a logistic regression to see the likelihood of co-occurrences between drugs and side effects. There are more than 1,000 side effects, so we have narrowed it down to 53 general disease categories, such as heart disease, nose disease, and metabolic diseases. These are the labels of the logistic regressions. The features of the logistic regression are the drugs and the most frequent 15,000 words occurring in the articles. The logistic regressions then produce an odds ratio for the likelihood of co-occurrence between each side effect and each word.
We then take the odds ratios of all disease categories for all drugs, and use those to create thresholds of likeliness of association. If an odds ratio is higher than 75 percent of the odds ratios of the drugs, the likelihood of association is high. If it is between 50 and 75 percent, it is medium. When the user types in a drug, our model runs 53 logistic regressions to find the odds ratios for each of the 53 categories of side effects. After finding the best fitting odds ratios, the model returns an ordered list of the drugs starting from the one that is most likely to co-occur with the side effect. This model is easy to understand, interpret, and train.