ICPSR – Machine Learning for the Analysis of Text as Data
Quantitative analysis of digitized text represents an exciting and challenging frontier of data science across a broad spectrum of disciplines. From the analysis of physicians’ notes to identify patients with diabetes, to the assessment of global happiness through the analysis of speech on Twitter, patterns in massive text corpora have led to important scientific advancements.
In this course we will cover several central computational and statistical methods for the analysis of text as data. Topics will include the manipulation and summarization of text data, dictionary methods of text analysis, prediction and classification with textual data, document clustering, text reuse measurement, and statistical topic models.
Each method will be illustrated with hands-on examples using R. Participants will develop and understanding of the challenges and opportunities presented by the analysis of text as data, as well as the practical computational skills to complete independent analyses. The R packages covered in this course include tm, lda, textreuse, glmnet, and openNLP.
One distinguishing focus of this course will be the use of text analytics for the reliable and valid development and testing of scientific theory. Most methods of text analysis have been developed with predictive or descriptive motivations. For each method we cover in the current course, we will review how the method has been and can be applied to draw theoretical inferences regarding processes surrounding text generation.
Instructor: Brice Acree
Dates: June 19 – June 23, 2017
Times: 9:00 AM – 4:30 PM
Location: 219 Davis Library, UNC-Chapel Hill, Chapel Hill, NC
For registration details, click here.
- Members: $1700
- Non-members: $3200