Software engineering; Corpus construction and annotation; References; Du kanske gillar. Modelling Language Sylviane Cardey Inbunden.
Automatic Summarization Inderjeet Mani Inbunden. Inbunden Engelska, Spara som favorit. Skickas inom vardagar. Skickas inom vardagar specialorder.
Biomedical Natural Language Processing is a comprehensive tour through the classic and current work in the field. It discusses all subjects from both a rule-based and a machine learning approach, and also describes each subject from the perspective of both biological science and clinical medicine. The intended audience is readers who already have a background in natural language processing, but a clear introduction makes it accessible to readers from the fields of bioinformatics and computational biology, as well.
The final PubMed text data contain 3,,, tokens. The MeSH terms consist of descriptor terms, qualifier terms and supplementary concept record terms. The MeSH descriptor terms are known as main headings for describing the core subjects of a PubMed article.
Thus in this study, we focus on the MeSH descriptors terms. Note that it is not straightforward to handle punctuation marks such as comma in MeSH descriptor terms given their different uses. Hence in this work, we simply removed them from the MeSH descriptor terms this pre-processing step is to be improved in the future but is beyond the scope of this work and converted the words to lowercase. Our word embeddings were trained by the following hyper-parameters empirically. For the sampling strategy, the two parameters p and q were set as 2 and 1, respectively.
The dimension of the word vectors was set to be , and the size of the negative sample size was set to be Similar to Bojanowski et al. Word embeddings are commonly used and evaluated in two types of Bio- NLP tasks: intrinsic and extrinsic. For intrinsic tasks, word embeddings are used to calculate or predict semantic similarity between words, terms or sentences. For extrinsic tasks, word embeddings are used as the input for various downstream NLP tasks, such as relation extraction or text classification.
Chiu et al. In our preliminary experiments, we also observed similar results: when setting the context window size as 20 and 5, our word embedding achieved the highest performance in intrinsic and extrinsic evaluation, respectively. Hence in this work, we followed their lead and created two specialized, task-dependent sets of word embeddings via setting the context window size as 20 and 5, respectively. Our BioWordVec data are freely available on Figshare Both sets are in binary format and contain 2,, distinct words in total where 2,, words come from the PubMed and 15, from MeSH.
All words were converted to lowercase and the number of dimensions is Our word embeddings can effectively integrate the MeSH term sequences to improve the representation of such terms or concepts. For a good word embedding method, it should yield similarly high cosine similarity scores for these pairs. On the other hand, it is difficult to determine how high their absolute cosine similarity score should be.
In particular, our word embeddings can make good use of the sub-word information and internal structure of words to improve the representations of the rare words, which is highly valuable for BioNLP applications. For evaluation, we first use word embeddings to calculate a cosine similarity score for each term pair. We compared our method with several state-of-the-art methods 1 , 8 , 11 , Mikolov et al. Using the same word2vec model, Chui et al. The results suggest that the subword information and MeSH data are valuable and helpful in biomedical domain.
We also noticed that the biomedical corpus was more suitable in this case than the general English corpus. Both Mikolov et al.
Chui et al. It can be seen that the performance improvement is greater by our method on these two common sets. We demonstrate here the application of BioWordVec in two separate use cases: finding similar sentences and extracting biomedical relations. Word embeddings are often used to calculate sentence pair similarity In the general domain, the SemEval Semantic Textual Similarity SemEval STS challenge has been organized for over five years, which calls for effective models to measure sentence similarity Averaged word embeddings are used as a baseline to measure sentence pair similarity in the challenges: each sentence is transformed as a vector by averaging the word vectors for each word in the sentence and sentence pair similarity is effectively measured by the similarity between the averaged vectors using common measures such as Cosine and Euclidean similarity.
Sentence similarity is also critical in biomedical and clinical domains 24 , We conducted a case study to quantify the effectiveness of the proposed embeddings in the task of computing sentence pair similarity on clinical texts. The top-ranked submission model used average embeddings with different similarity functions, which was shown effective to capture sentence similarity We applied averaged word embedding approach and adopted Cosine, Euclidean and City Block similarity to measure the averaged vectors.
Our proposed embeddings achieved higher correlations in all three similarity measures. This demonstrates that the proposed embeddings more effectively capture the semantic meaning.
Given that word embeddings are often used as the input in recent deep-learning based methods for various biomedical NLP tasks 4 , 5 , 6 , below we evaluate its effect in some biomedical relation extraction tasks. In our experiments, we evaluate its effect in two biomedical relation extraction tasks: protein-protein interaction PPI extraction and drug-drug interaction DDI extraction, respectively.
The former is a binary relation extraction task, whereas the latter a multi-class relation extraction task. Following previous studies 4 , 28 , we use precision, recall and F-score as evaluation metrics and choose the same baseline methods. For this binary relation extraction task, we implemented a convolutional neural network CNN model and used the dropout layer with a dropout rate of 0.
For the input of our CNN model, we combine the position embeddings with the word embeddings as they have been shown to be effective The PPI extraction experiments were evaluated with fold document-level cross validation. Knowledge from MeSH was helpful in all datasets.
Since the DDI extraction is a multi-class relation extraction task, we compute the micro average to evaluate the overall performances 37 , In the DDI corpus, the training set and test set contain 27, and 5, instances, respectively. To further evaluate the performance of different word embeddings on more complex neural models, we also conducted a comparison experiment using a recent state-of-the-art DDI extraction model 4 which is a hierarchical RNNs with a input attention layer based on the sentence sequence and shortest dependency path.
We also noticed that our method achieved more significant advantage on the simple CNN model than the complex RNN model. For example, our method and Mikolov et al. The advantage of F-score between our method and Mikolov et al. When employing the state-of-the-art RNN model, the improvement of F-score reduces to 0. This is likely due to the fact that the state-of-the-art DDI extraction model 4 already integrates the shortest dependency path information and part-of-speech embedding, as well as using the multiple layer of bidirectional long short-term memory networks LSTMs to boost the performance.
Mikolov, T. Distributed representations of words and phrases and their compositionality.
Mnih, A. Learning word embeddings efficiently with noise-contrastive estimation.
Bengio, Y. A neural probabilistic language model. Journal of Machine Learning Research 3 , — Zhang, Y. Drug—drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics 34 , — Tang, D. Learning sentiment-specific word embedding for twitter sentiment classification.
Ebell , M. Bridging semantics and syntax with graph algorithms—state-of-the-art of extracting biomedical relations. Such models are not particularly good at learning rare or out of vocabulary OOV words in the training data. Averaged word embeddings are used as a baseline to measure sentence pair similarity in the challenges: each sentence is transformed as a vector by averaging the word vectors for each word in the sentence and sentence pair similarity is effectively measured by the similarity between the averaged vectors using common measures such as Cosine and Euclidean similarity. Pyysalo , S. Unlike previous studies such as 16 , 17 , 18 that aim to learn embeddings for nodes, we transform MeSH IDs sequences into word sequences so that they can be treated equally as PubMed sentences during the learning of word embeddings. Jiang , J.
Ganguly, D. Word embedding based generalized language model for information retrieval. Pennington, J. Glove: global vectors for word representation.
Abstract: A major obstacle to the development of Natural Language Processing ( NLP) methods in the biomedical domain is data accessibility. Biomedical Natural Language Processing is a comprehensive tour through the classic and current work in the field, and is suitable as a reference, as well as a.
Chiu, B. How to train good word embeddings for biomedical NLP. Wang, Y. A comparison of word embeddings for the biomedical natural language processing. Journal of Biomedical Informatics 87 , 12—20 Smalheiser, N. Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are complementary to neural embeddings. Bojanowski, P. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 , — Faruqui, M. Retrofitting word vectors to semantic lexicons.
Yamada, I. Joint learning of the embedding of words and entities for named entity disambiguation. Han, X. Joint representation learning of text and knowledge for knowledge graph completion. Cao, Y. Bridge text and knowledge by learning multi-prototype entity mention embedding.
Perozzi, B. DeepWalk: online learning of social representations. Tang, J. Line: Large-scale information network embedding. Grover, A. Node2vec: scalable feature learning for networks.