This course introduces the students a set of methods to transform, model, analyze, and reason about text as 
									data. Over the course of the semester, we'll learn to apply natural language processing methods to problems 
									that span the areas of social sciences and humanities where the data is in the form of text.
								
								
									The objective of the course is for the students to learn the application of text processing libraries 
									including scikit-learn, 
									gensim, spacy, 
									and huggingface on problems; learn techniques to collect and label 
									text; perform exploratory data analysis; learn to statistically test hypotheses using textual data; represent 
									text both in terms of linguistic structure features and low-dimensional distributed representations of words, 
									sentences, and documents; perform text-driven prediction; and learn about ethical issues surrounding the use 
									of text as data.
								
								
									The course is targeted towards undergraduate students from various disciplines such as computer science, law, 
									sociology, etc. No formal technical background is assumed though some programming knowledge in Python is 
									expected (tutorials on Python will be shared before the course starts).
								
								
									 What this course is not about? This is not a course to learn the intricate details such of the 
									algorithms or the model architectures that power natural language processing methods; see CS 329 if you want 
									to learn that. Instead, we'll focus on using NLP methods as algorithmic instruments to perform measurements 
									on text data, and, through practice, learn the underlying challenges in this enterprise.
								
								
								
									
										Sandeep Soni (PAIS 588)
										
Wednesday, 11am-12pm (in person); Friday, 11am-12pm (via Zoom); or by appointment 
									
								
							
					
							
								
								
									QTM 151 or CS 170; no technical background in data science is assumed but students are expected to know the 
									basics of programming, such as in Python.
								
							
							
						
							
							
							    Students will work in groups of 3 or 4 on a project with the following components.
							
							Proposal and literature review 
							 
							    Students will propose the research question, motivate its rationale as an interesting question worth asking, 
							    provide a sketch of the tools, methods and the timeline for the deliverables, and situate situating their 
							    proposed work for the gap it will fill with respect to existing scientific literature on the topic 
							    (Deliverable: 2 pages; minimum 5 sources)
							
    							Midterm report 
 
							
							    Students will be asked to submit a midterm report describing the results from initial experiments. 
							    Emphasis in this report should be on describing the methodology, establishing a concrete set of experiments 
							    to answer the empirical question in the project, and establishing a validation strategy for the final 
							    experimentation (Deliverable: 4 pages; minimum 10 sources)
							
							 Final report 
 
							 
							    The most important deliverable of the project is a final report that will include a complete description of 
							    the work. The report will summarize the data and their collection methodology, methods, experimental 
							    details and results, plus a thorough analysis. The report should be of high quality according to the standards
							    used to judge a conference submission (Deliverable: 4 pages, not including references)
							
							
							   To create the final report, you must use the template from this repo. 
							
							 Presentation 
							
							    Teams will present their work by preparing the poster and presenting it to the class and other Emory students/faculty. 
							    The poster should give an adequate but high-level summary of the project. 
							    (Deliverable: a poster) 
							
						
					
					
						
							
      
						Academic Integrity
						
							All students will follow the Emory honor code. 
							With the exception of the group project, in which collaboration is allowed and encouraged, all submissions (homeworks 
							and problem sets) must be completed independently.
							The use of large language models (eg. ChatGPT) and other generative AI technologies is discouraged for writing as well as source code.
							Both for writing and source code, cite the appropriate source if you end up mentioning or using someone else's work.
							All submission deadlines for homeworks and project deliverables will be strictly enforced; 
							exceptions will be made on a case-by-case basis and only if the student has a valid reason for needing an exception. 
							Students who violate the Honor Code may be subject to a variety of sanctions and are likely to fail the course.
						
      						Students with Disabilities
      						 
							We will strive to make the class accessible to all students. To this end, if you need disability-related accommodations and 
							have an accommodation letter from OAS, please inform me.