|Professor:||Jonathan North Washington|
|Office hours:||T 13:30-15:00|
& by appointment
also available by messaging on Google Chat/Hangouts
|Email/messaging:||jwashin1@swarthno scrapers please...more.edu|
|Meeting time:||TTh 9:55-11:10|
|Meeting modality:||Mixed (in person as possible)|
|Physical classroom:||Clothier 16|
|Online classroom:||Gather (see Moodle for meeting URL)|
|Course Piazza site:||LING 073|
|Course Moodle site:||LING073-01-CPSC013-01-S22|
In terms of goals for student learning, this course is designed for students learn to develop language technology for communities lacking in existing resources. Here's the official description:
This course explores the possibilities for creating computational resources for languages for which vast collections of text don’t exist. Students will choose a language lacking in computational resources and develop tools for it. The focus will be on creating nuanced symbolic representations of the language that can be employed by computers, to the benefit of both language researchers who wish to test grammatical models, and language communities which lack the social capital to benefit from corporately developed resources. Topics covered include input methods and spell-checking, morphological analysis and disambiguation, syntactic parsing, building corpora, and rule-based machine translation, with an emphasis on anti-colonial methodologies and free/open-source technologies.
Prerequisites: LING 001 (or equivalent), or CPSC 021 (or equivalent), or permission of the instructor.
Additionally, students will:
Towards these goals, and related to particular course activities, students will
The general structure of the course will be centred around student projects. At the beginning of the course, each student will choose an under-resourced language to work on (in consultation with the professor), and will, with a partner, spend the semester developing materials for the language as lab assignments.
In general, we will spend both class meetings per week (Tuesday and Thursday) on each topic, where the first day (Tuesday) will be more focussed on discussing the topic (overview, general issues and solutions, etc.) and the second day (Thursday) will be a lab day dedicated to guided lab work on the problem. The week's assignment will generally be due after the lab day (by the end of the day Friday), so the lab day provides an opportunity to get started on the lab and get assistance from the professor or course assistant on difficult areas. Additional optional lab hours will be scheduled near the end of the week to give an extra opportunity for assistance from the professor or course assistant.
This course has a prerequisite of LING 001 (or equivalent), or CPSC 021 (or equivalent), or the permission of the instructor; i.e., any level of background in either computer science or linguistics is needed. Background beyond the introductory level in either field (or both!) will give students an advantage, but nothing beyond a previous intro to at least one of them is necessary.
All required skills will be imparted throughout the course. There will be no conventional programming required of students, but we will be using command-line tools and several different types of declarative syntax. No previous knowledge of linguistics is required for students with CS background, but a focus of the course will be coming to understand linguistic phenomena (by implementing models of them computationally). No previous knowledge of CS is required for students with a linguistics background, but a focus of the course will be implementing models (of linguistic phenomena) computationally. The challenge of the course is less about learning the computational formalisms or understanding the patterns in the languages from a formal linguistics perspective (though skills in either field will be leveraged and skills in both fields will be strengthened by the course), and more in learning to use the formalisms to implement models of the linguistic patterns computationally. It is expected that some students will grasp these different aspects of the course with different levels of ease, which provides a great opportunity for students to share knowledge and skills with one another.
No textbook is required, but you will need to have access to the following resources:
We'll be using Gather (listed above, and linked to from the Moodle course) for our meetings.
We will be using Swarthmore GitHub (github.swarthmore.edu) for a number of purposes. Most assignments will be submitted by a script that automatically clones the relevant repo as class begins.
Also, you'll need to be able to access Moodle (moodle.swarthmore.edu). Some materials we use for the course will be available there (readings, etc.), as will your grades, so make sure you can access it as soon as possible. If you have any trouble with it, notify me as soon as you can. Non-Swarthmore TriCo students may not have access to Moodle immediately at the beginning of the semester—let me know if this is the case for you, and I will make sure you have access to resources in some other way.
We'll be using Piazza (piazza.com, listed above, and linked to from the Moodle course) for responding to readings. You can also use it as a way to ask questions about assignments and other course content.
The course website (listed above, and linked to from the Moodle course) contains the schedule for the semester, which will be updated regularly with links to various resources. It's recommended to check the website at least a couple times per a week. I will make announcements about any major changes.
The course wiki (wikis.swarthmore.edu/ling073) will be where we organise the resources developed in the course. It is there not just for your professor and classmates, but for anyone in the world to access, and so may end up attracting the attention of speakers of the languages or other people interested in them. It will also be a model for future students of this course to look to.
I hold regular office hours (listed above), and can be available at other times by appointment—just send me an e-mail letting me know when you might prefer to meet.
If you are having any trouble with class, such as with understanding a concept or completing an assignment, please don't hesitate to ask me for help. I'm here to help you learn, so I encourage you to take advantage of my availability.
But even if you're not having trouble, it never hurts to come to office hours from time to time. We can discuss course content, ideas for the final project, or whatever's on your mind in a more relaxed atmosphere.
In addition, we will hold lab hours at a time to be announced every week (probably in the evening). The course assistant and/or the instructor will be available for consultation. Attendance is entirely optional, but it never hurts to come to lab hours. At the very least, it's a relaxed atmosphere to work in solidarity with your classmates and with access to me and/or the course assistant if you have a quick question or the like. It's an especially great time to make progress on your assignments!
Show up on time and silence cell phones. Food and drinks are generally not allowed in lab, per the policies for the room. However, I don't mind as long as you don't damage the equipment or disturb your classmates. If you need to step out of the class for any reason (bathroom, emergency phone call, etc.), please do so with minimum disruption (i.e., don't ask for permission).
Due to the nature of the course, we will be using computers in almost every class. This brings about the potential for a number of distractions, so please use the computer only for relevant classroom activities. In other words, please refrain from any sort of non-class-related activities, including messaging (e-mail, social media, etc.), homework for other courses, or even catching up on course reading. Even the best multitaskers are still not participating fully when they're engaging in unrelated endeavours. If it's too difficult to avoid the temptation of these other distractions, you may try strategies like disabling the computer's internet connection, using a filter for web usage, or similar.
Note on pronouns: I'm happy to be referred to by any pronoun, and welcome all of you to share the pronouns you would like to be referred to by in this class. If you notice anyone using the wrong pronouns, including me, please feel free to let us know so we can get it right.
All material covered during course-related activities—including assigned readings, quizzes, and labs—should be assumed to be required course content, and will be assumed background for later activities. It is each student's responsibility to attend all classes to learn the material covered. If you must miss a class (e.g., for an athletic or religious reason), it is courteous to notify your professor ahead of time if at all possible, but it will be your responsibility to learn about missed material from classmates. It is not my responsibility to make up for your absence or re-teach the material. (That said, let me know if you're having trouble making something up, and we'll figure something out.) With so few class meetings dedicated to each topic and the cumulative nature of the topics, missing one day can be a very big deal—so I really recommend trying not to miss class. Also see COVID considerations below.
The assigned readings are to be read in advance of the class dates they're assigned for. The readings complement in-class activities and provide the necessary background; however, you should not assume that they will be fully summarized or reviewed in class. Students should be prepared to evaluate, integrate, or respond to the readings in class discussions.
Any excuse for missing any course-related activities will need to be handled by your class dean. Please see the Medical Excuse Policy (http://www.swarthmore.edu/student-health/medical-excuse-policy), and remember to contact your class dean as soon as you can so that they can work with you.
Readings will generally be due at the beginning of class on Tuesdays. You will be expected to respond to the reading on Piazza by 24 hours before Tuesday's class starts (i.e., Monday morning) so that you and your classmates can then follow up on one another's posts.
Lab assignments will generally be due at the end of each week (Friday evening). Work on the assignment must be complete in order to move on to the next topic, so it is essential that assignments be submitted on time.
You will submit lab assignments almost exclusively on GitHub and the course wiki (each assignment will say explicitly how to submit it), both of which keep timestamps. These two methods also both allow for incremental submissions, so you may often commit and push (GitHub) and save your work (wiki) as you work on it. This means both that I can see exactly what was there at the deadline, but also that partial work may be there as of the deadline.
Any work submitted between the deadline and when the assignments are graded (usually not before the next day) will receive only half credit—e.g., if you submit about 75% of the assignment before the deadline and 100% of the assignment is there when it is graded, you can at most receive 87.5% on the assignment.
Using words or ideas from another source without attribution constitutes plagiarism, and misrepresenting another student's work as your own (or allowing another student to misrepresent your work as their own) is cheating. Please see the student handbook for the College's policies on academic misconduct (http://www.swarthmore.edu/student-handbook/academic-policies#academic_misconduct). Suspected cases of academic misconduct will be pursued to the full extent of College policy, including referral to the College Judicial Committee.
You are always expected to do your own work on assignments. You may (and are encouraged to) ask one another for and provide one another with assistance on assignments. If you are providing assistance, you must not provide the solution—you may only provide guidance that will help the other student(s) find the solution on their own. If the work in this course were a real-world FOSS project, providing the solution would be okay (and perhaps even encouraged), but the requirement that each student be evaluated on their own work is incompatible with this model (at least on the surface).
With every assignment you should include an AUTHORS file in the top directory, just like you might find in an open source project. If you receive assistance on any assignment from anywhere (a classmate, a website, a native speaker of the language, a stranger on the internet, etc.), please acknowledge them in the AUTHORS file.
In some instances you may work with your classmates. For lab assignments where you are working in a group or with a partner, you may divide the work as appropriate, within the parameters of the assignment. Ideally this means team-coding, where each person takes turns at the keyboard, with alternation also between who's committing the code. You may also discuss generalities of lab assignments with your classmates, such as what is expected from you, and even examine each others' work. If you are able to make sense of the way someone implemented something for another language and it inspires a way to implement something in the language you're working on, then you've used critical thinking to solve a problem, and have learned something—so this is very much encouraged too. And of course, any discussion of course content in general is strongly encouraged.
In short, submitting work that is not your own or providing a classmate with a solution will be considered academic misconduct and will be addressed as such (see above-mentioned policies). So please just be honest. And if you have any questions about what's considered acceptable, ask me first.
Let us implement as a class a few policies related to the ongoing pandemic in an effort to keep as many people safe as possible. Not all students on campus are vaccinated against SARS-CoV-2, and members of my household are not currently eligible for vaccination. I am a potential vector for virus transmission in both directions. These individuals represent both added risk to and added vulnerability in our community. The policies below were designed to keep everyone involved as safe as possible. I also want to make sure that everyone keeps in mind the continuing risk and vulnerability in our community.
There are some special considerations to take into account given that some of this course may be conducted online.
There's a form on Moodle for anonymous feedback on the course. I encourage everyone to periodically consider what's going well and what they would like to see changed about the course and let me know via the form.
If you are vulnerable and need to make alternate arrangements because you or someone you live with has health or other issues that put you at higher risk, please reach out and make alternate arrangements with me as soon as possible. I don't need to know any details, just that you deem your situation to be high risk and would like alternate arrangements.
College policy is that masks must be worn at all times in classrooms, so please make sure that you practice correct and consistent mask use during class. If you're not entirely sure what this entails, I'd be happy to help clarify.
Due to the limited effectiveness of cloth face-coverings in limiting the spread of recent variants of COVID-19, and despite the fact that College policy may not require it, I'd like to ask that everyone wear N95, KN95, or equivalent respirators during in-person instruction. A mask of this sort can be provided for you to use if needed. They may be reused multiple times; please look up guidance on that. Wearing a cloth mask over a disposable surgical mask is another allowed option.
Recent College communications have said that masks may be lowered when speaking in class. However, due to the safety issues involved, I need to ask that you not do that.
We will decide on class-specific rules to determine whether masks may be lowered briefly to drink or not, based on the most conservative level of comfort in the room. If you have any concerns about any of this, or questions about what was decided, please let me know. Class policy may be adjusted accordingly after the initial decision, but only in a more conservative direction. You may submit anonymous comments through the anonymous course feedback area on Moodle (and, of course, I won't retaliate or reveal your identity to anyone even if you tell me your concerns directly).
The College is not requiring or recommending any sort of physical distancing. However, in the interest of everyone's safety, I would like to ask that an effort be made. Please try to space your seating away from others as possible, and try to maintain reasonable distance from others.
I plan to "live-stream" every class meeting by having a computer logged in to Gather. This allows for hybrid instruction whenever needed, for example if someone is not able to physically come to class (e.g., if you are required to quarantine or isolate; see also illness sections below).
I would also like to record every class, because there may be days when some of you are not able to attend at all, even online, and will want to use the recording to review the material. We will discuss on the first day whether everyone is okay with this.
If we do decide to record classes, then these recordings may not be shared with anyone outside of this class!
Social time. The course meeting platform should allow students to join any time, even when I am not present. I encourage everyone to arrive a few minutes early if possible to just hang out and get to know your classmates better. Furthermore, the meeting is recurring (meaning we'll use the same link for all our meetings this semester), so you may use the meeting any time outside of class time as well, e.g. to discuss assignments with one another or similar.
At the end of each class, I will also wait to leave the meeting until everyone else has left. This is to encourage you to stick around if there's anything you'd like to talk about.
Communication during class. During online instruction, the Gather chat will be available for use during class. We may leverage it for certain purposes, but I don't expect most of our interactions to take place through that modality. If you'd like to speak and can't find a moment to interject into whatever is going on, please raise your hand physically so that I can see in the video, or use the Gather hand-raising feature. I may not notice either, so you may also simply interrupt if needed. There will probably be a lot of awkwardness around these issues, and that's okay.
Engagement. I expect everyone to engage with the course (see "Engagement" below), but I recognise that engagement will look different for every student. This applies regardless of modality, although the range of what engagement will look like will depend on modality. See also the section on course etiquette above.
Camera privacy. When conducting class online, no one is required to turn on their camera if they don't want to. I do hope that most of you will become comfortable turning on your cameras in most environments, and encourage you to build your comfort with this if you haven't yet. It is especially okay to disable your camera if the class is being recorded.
If you are feeling unwell, and your symptoms are consistent with COVID, please do not come to class in person. The slightest sniffle, headache, scratchy throat, etc., is a legitimate excuse for missing class, as long as you make an effort to make up any material you miss. I do not need details and will not require any sort of formal excuse for you to miss class in person. However, if you must miss class entirely (i.e., are not able to join virtually), please be in touch.
If you're unable to attend class in person and are feeling up to it, you are encouraged to attend class virtually. While this is a no-questions-asked policy, please try not to make a habit of attending virtually if instruction is otherwise in person. This option should only be reserved for cases of illness or other situations out of your control.
If you must miss class, regardless of modality, you should strive to make up any material you missed. To make up material, please start by consulting with a classmate and the relevant class recording (if we decide to make recordings), and I will be available to help fill in the gaps.
I will also plan not to show up to class in person if I'm feeling any symptoms of illness, or if I have to quarantine due to an illness in my household. In this situation, if I am able, I will conduct class online. I will just ask that one of you who does show up in person log into the virtual classroom on the projector in the physical classroom (e.g., using the classroom computer) so that everyone in the classroom is able to interact with me. I may also ask Media Services to assist with this. I will try to let everyone know ahead of time so that you may choose to attend the virtual class from home if you prefer. If illness prevents me from joining even online on a given day, you should still expect notification before the beginning of class.
I will be in my office (or the Phonetics Lab) during regularly scheduled in-person office hours, but I am also happy to meet using our online classroom, either during that time or during a different time. Please first message me at my Swarthmore email address on Google Hangouts or Google Chat, and I'll let you know whether I'm available or not (no response means not available). Consider this the equivalent of knocking on my office door. I won't be sitting on Gather waiting for students to join the meeting during office hours, but if I'm available, I will get a notification if you message me on one of those services. We can also conduct the entire conversation through chat if you'd like. This is an option outside of regular office hours too, though I cannot guarantee an immediate response at other times. And as always, if regular office hours are not convenient, I'm happy to schedule a meeting at another time—just send me a message (by email or one of the chat services) and let me know what might be convenient for you.
We may revise these and other COVID-related class practices as needed and as appropriate throughout the semester. College policy may change, for example, and we can adjust accordingly. My priority in all this, again, is our collective and individual health and safety.
Keep in mind that the course schedule may be adjusted as part of this revision; I will keep you updated on all of this.
Most of your assignments will be graded with a fine-grained measure of completion and correctness based on normal letters grades and grade points (A = 4.0, B = 3.0, C = 2.0, D = 1.0, and F = 0.0), with the standard modifiers + (one-third of a grade point higher) and - (one-third of a grade point lower). In addition, intermediate grades using parentheses or a slash may be used, giving the following correspondence between letter grade and grade points:
What you will be graded on for each assignment varies in specifics, but generally it will include mastery of the relevant linguistic generalisations and interaction with the language and existing resources, mastery of the computational formalism(s), evaluation of what you did, organisation and cleanliness of your code, documentation (both on the wiki and e.g. in your README), and completion of the assignment.
The grade in this course is broken down into the following components. Each component is expounded upon following the table.
Lab assignments will be due nearly every week of class. Each assignment will be a new tool (or analysis) for the language you are working with throughout the course.
Usually at least one class session will be dedicated to working on the assignment, so you can get a head start on it, and work through any problems that might come up during the assignment.
Some labs may not be entirely applicable to all languages; these labs will include an alternate option, with data for another language provided. You may only submit this alternate assignment for credit if you've consulted with the professor first. It's your responsibility to start each assignment early enough to consult with the professor in time to do the alternative assignment if your language will not work for the assignment. Such assignments will make it clear what's necessary for you to identify in the language, and the professor is available to help you figure out how your language fits the requirements.
Your midterm demonstration will be a short presentation clearly outlining what you have developed so far in the semester, how well it performs, and some examples of one or two issues unique to your language that you find particularly interesting (whether solved yet or not). The amount of time available for this presentation will be announced ahead of time and will depend on how many students are in the class—it will probably be very short (on the order of a couple of minutes). You'll be expected to use the time efficiently and not go over. A short question-answer section may also be included.
The final project will expand your work throughout the semester into one final domain, to be chosen from among the topics discussed over the last days of class, or another topic relevant to the course.
You should consider ahead of time what you might be interested in—that may be either interesting to work on or useful for a language community—and speak with the professor about how to approach the problem. Several options will be provided which include some guidelines for how to complete them; there will be some options both for those who are less technologically adventurous but are willing to do difficult work with a language and for those who are more technologically capable but not as interested in doing linguistic analyses. If you're not sure what might be a good idea given your background and strengths in the course up to that point, please talk with the professor.
If the project involves a translation pair, then you may collaborate in groups as you did when working on translation pairs for lab assignments. Your project should include, among other things, an evaluation component (i.e., test how well what you did works), and should be released publicly with an open source license (even if not fully useful [yet]). During our final exam time, you will give a poster presentation on your project. The content of the project and the presentation will each constitute half of your final project grade. More information on the project will be provided later in the semester.
I do not grade on attendance, but you will be graded on engagement in the class, and this requires attendance. Beyond simply showing up and participating, you're encouraged to contribute to discussions by asking questions, answering prompts, making relevant comments, working with classmates on in-class activities, etc. You will not be ridiculed for asking even simple questions—I want to make sure everyone grasps the concepts, and many are not as straightforward as they may first seem (or as I think they are).
You are also expected to come to class prepared for discussion; this includes having completed readings and assignments due by class.
You are encouraged to engage in relevant discussion electronically as well—e.g., via Piazza and in issues posted on GitHub. The course will also have an IRC channel, which you're encouraged to be logged into when you can. This is a good way to get support from your classmates (and your professor!) outside of class. Just be sure not to share solutions to assignments!
Assigned readings for a class period are included in engagement. You should read the assigned reading, respond to the prompt on Piazza at least one day (24 hours) before the class meeting, and respond to two classmates' responses (with an attempt to make sure everyone's response is responded to at least once) by the class meeting.
If you believe you need accommodations for a disability or a chronic medical condition, please contact Student Disability Services via email at email@example.com to arrange an appointment to discuss your needs. As appropriate, the office will issue students with documented disabilities or medical conditions a formal Accommodations Letter. Since accommodations require early planning and are not retroactive, please contact Student Disability Services as soon as possible. For details about the accommodations process, visit the Student Disability Services website (https://www.swarthmore.edu/office-academic-success/welcome-to-student-disability-services). You are also welcome to contact me privately to discuss your academic needs. However, all disability-related accommodations must be arranged, in advance, through Student Disability Services.
|week||date||topic||to read (by class) / due (on Friday)|
|1||18 Jan|| |
|20 Jan|| |
|2||25 Jan|| |
What (and why) is CL (and NLP)?
Long (2007) - Chilean Mapuches in language row with Microsoft
|27 Jan|| |
lab 1 - documentation of resources + Initial corpus assembly
|3||1 Feb|| |
|3 Feb|| |
lab 2 - keyboard layout
|4||8 Feb|| |
Janhunen & Gruzdeva (2016) - Bringing the orthography of an indigenous language to the digital age: The case of Nivkh in the Russian Far East
|10 Feb|| |
lab 3 - Grammar documentation
|5||15 Feb|| |
FSTs and morphology
Bird (2009) - Natural Language Processing and Linguistic Fieldwork
|17 Feb|| |
lab 4 - Basic morphological analyser
|6||22 Feb|| |
FSTs and phonology
|24 Feb|| |
lab 5 - Basic morphological generator
|7||1 Mar|| |
Moshagen & Trosterud (2019) - Rich Morphology, No Corpus – And We Still Made It. The Sámi Experience
|3 Mar|| |
lab 6 - Basic CG disambiguator
|8 Mar|| |
|10 Mar|| |
|8||15 Mar|| |
Natural Language Processing methodology (guest)
|17 Mar|| |
Midterm project demos
midterm project demos (for class)
|9||22 Mar|| |
Community project presentation (guest)
|24 Mar|| |
Community project LAB (guest)
special lab - Community service projects
|10||29 Mar|| |
Khanna et al. (2021) - Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages (§1, §2, §5)
|31 Mar|| |
lab 7 - Lexical transfer
|11||5 Apr|| |
Models of development, FOSS
|7 Apr|| |
lab 8 - Lexical selection
|12||12 Apr|| |
Mahelona (2020) - Te reo Māori Speech Recognition: A Story of Community, Trust, and Sovereignty
|14 Apr|| |
lab 9 - Contrastive grammar
|13||19 Apr|| |
Romero (2016) - Bill Gates speaks Kʼicheeʼ! The corporatization of linguistic revitalization in Guatemala
|21 Apr|| |
lab 10 - Structural transfer
|14||26 Apr|| |
|28 Apr|| |
lab 11 - Polished basic RBMT system