Approaches to Data Science with Text

Info

This course introduces the students a set of methods to transform, model, analyze, and reason about text as data. Over the course of the semester, we'll learn to apply natural language processing methods to problems that span the areas of social sciences and humanities where the data is in the form of text.

The objective of the course is for the students to learn the application of text processing libraries including scikit-learn, gensim, spacy, and huggingface on problems; learn techniques to collect and label text; perform exploratory data analysis; learn to statistically test hypotheses using textual data; represent text both in terms of linguistic structure features and low-dimensional distributed representations of words, sentences, and documents; perform text-driven prediction; and learn about ethical issues surrounding the use of text as data.

The course is targeted towards undergraduate students from various disciplines such as computer science, law, sociology, etc. No formal technical background is assumed though some programming knowledge in Python is expected (tutorials on Python will be shared before the course starts).

What this course is not about? This is not a course to learn the intricate details such of the algorithms or the model architectures that power natural language processing methods; see CS 329 if you want to learn that. Instead, we'll focus on using NLP methods as algorithmic instruments to perform measurements on text data, and, through practice, learn the underlying challenges in this enterprise.

Office hours

Sandeep Soni (PAIS 588)

Wednesday, 11am-12pm (in person); Friday, 11am-12pm (via Zoom); or by appointment

Prerequisites

QTM 151 or CS 170; no technical background in data science is assumed but students are expected to know the basics of programming, such as in Python.

Syllabus

Please see the course syllabus on this page

Grading

Please see the grade requirements on this page

Project

Students will work in groups of 3 or 4 on a project with the following components.

Proposal and literature review

Students will propose the research question, motivate its rationale as an interesting question worth asking, provide a sketch of the tools, methods and the timeline for the deliverables, and situate situating their proposed work for the gap it will fill with respect to existing scientific literature on the topic (Deliverable: 2 pages; minimum 5 sources)

Midterm report

Students will be asked to submit a midterm report describing the results from initial experiments. Emphasis in this report should be on describing the methodology, establishing a concrete set of experiments to answer the empirical question in the project, and establishing a validation strategy for the final experimentation (Deliverable: 4 pages; minimum 10 sources)

Final report

The most important deliverable of the project is a final report that will include a complete description of the work. The report will summarize the data and their collection methodology, methods, experimental details and results, plus a thorough analysis. The report should be of high quality according to the standards used to judge a conference submission (Deliverable: 4 pages, not including references)

To create the final report, you must use the template from this repo.

Presentation

Teams will present their work by preparing the poster and presenting it to the class and other Emory students/faculty. The poster should give an adequate but high-level summary of the project. (Deliverable: a poster)

Policies

Academic Integrity

All students will follow the Emory honor code. With the exception of the group project, in which collaboration is allowed and encouraged, all submissions (homeworks and problem sets) must be completed independently. The use of large language models (eg. ChatGPT) and other generative AI technologies is discouraged for writing as well as source code. Both for writing and source code, cite the appropriate source if you end up mentioning or using someone else's work. All submission deadlines for homeworks and project deliverables will be strictly enforced; exceptions will be made on a case-by-case basis and only if the student has a valid reason for needing an exception. Students who violate the Honor Code may be subject to a variety of sanctions and are likely to fail the course.

Students with Disabilities

We will strive to make the class accessible to all students. To this end, if you need disability-related accommodations and have an accommodation letter from OAS, please inform me.