ICPSR – Machine Learning for the Analysis of Text as Data
July 16 @ 9:00 am - July 20 @ 5:00 pm
Brice Acree, Ohio State University
Quantitative analysis of digitized text represents an exciting and challenging frontier of data science across a broad spectrum of disciplines. From the analysis of physicians’ notes to identify patients with diabetes, to the assessment of global happiness through the analysis of speech on Twitter, patterns in massive text corpora have led to important scientific advancements. In this course we will cover several central computational and statistical methods for the analysis of text as data. Topics will include the manipulation and summarization of text data, dictionary methods of text analysis, prediction and classification with textual data, document clustering, text reuse measurement, and statistical topic models. Each method will be illustrated with hands-on examples using R. Participants will develop an understanding of the challenges and opportunities presented by the analysis of text as data, as well as the practical computational skills to complete independent analyses. The R packages covered in this course include tm, lda, textreuse, glmnet and openNLP.
One distinguishing focus of this course will be the use of text analytics for the reliable and valid development and testing of scientific theory. Most methods of text analysis have been developed with predictive or descriptive motivations. For each method we cover in the current course, we will review how the method has been and can be applied to draw theoretical inferences regarding processes surrounding text generation.
Prerequisites: Participants should be familiar with linear and generalized linear models (e.g. logit, poisson, etc.), and have at least some exposure to the R environment before the workshop. The class will review aspects of R on the first day. No prior knowledge of text processing or modeling is assumed.
Fee: Members = $1700; Non-members = $3200
For registration details, click here.