Week 8

Goals

The main goals for Week 8 are to:

  • Understand the proof that Regular Expressions are equivalent in power to Deterministic Finite Automata. This is a long proof, split over 4 videos, but is something we hope everyone understands in detail. The methods used in the proof are similar to ones we have seen before in showing equivalence of NAND-Circ and Boolean circuits, but is a fair bit more complex. This general method of showing equivalence is something we will see again in this class, and you will encounter many more times in other classes and work that you will do.

  • Become familiar with the power and strangeness of nondeterminism. For finite automata, we prove this week that despite is seeming super-power, nondeterministic finite automata cannot compute any functions (or recognize any languages) that cannot be computed by deterministic finite automata. The main topic for Weeks 8-11 will be considering this question for more powerul computing models (whi

Schedule

Day “Monday” Cohort “Tuesday” Cohort “Wednesday” Cohort “Thursday” Cohort “Friday” Cohort
Wed 13 Oct Preparation (Week 7) (Week 7) (Week 7) (Week 7)
Thurs 14 Oct Preparation Preparation (Week 7) (Week 7) (Week 7)
Fri 15 Oct Preparation Preparation Preparation (Week 7) (Week 7)
Sat 16 Oct Prep Cohort Meeting Preparation Preparation Preparation (Week 7)
Sun 17 Oct Revision Prep Cohort Meeting Preparation Preparation Preparation
Mon 18 Oct Assessed Meeting Revision Prep Cohort Meeting Preparation Preparation
Tues 19 Oct Writeup Due Assessed Meeting Revision Prep Cohort Meeting Preparation
Wed 20 Oct (Week 9) Writeup Due Assessed Meeting Revision Prep Cohort Meeting
Thurs 21 Oct (Week 9) (Week 9) Writeup Due Assessed Meeting Revision
Fri 22 Oct (Week 9) (Week 9) (Week 9) Writeup Due Assessed Meeting
Sat 23 Oct (Week 9) (Week 9) (Week 9) (Week 9) Writeup Due

Cohort Problems

These are the problems you should discuss in your Cohort Meeting, and everyone in your cohort should be prepared to present and discuss solutions to at the Assessed Cohort Meeting:

The problems are posted here and we think its a good idea to look at them early, but you’re not expected to be able to solve them until after doing the readings and watching the videos below.

Programming

Before you begin on this assignment, you will need to download three python files and place them in the same project:

The NFA program implements nfa simulation and various closure constructions. The regex program defines a class that reads regular expressions and converts them to NFAs. The week8 program contains example uses of the NFA and regex functionality, and also contains the stub for the huntingtons function you must implement.

You’ll need to use the following functions (example usage of all of them are in week8.py):

  • regex.regex_to_NFA(the_regex): If the_regex is a string, the function regex_to_NFA found in regex.py returns and NFA that computes the language indicated by the_regex
  • the_nfa.execute(the_string): If the_nfa is an instance of the NFA class defined in NFA.py, then invoking the class method execute on string the_string will return a Boolean value indicating whether or not the_string belonged to the language of the NFA.
  • the_nfa.toDot(): If the_nfa is an instance of the NFA class defined in NFA.py, then invoking the class method toDot() will result in a dot (which is a graph specification language) description of the NFA being printed to standard out. You can then copy-paste that description here to see the actual NFA: https://dreampuf.github.io/GraphvizOnline/

What to do

First, read through all of week8.py and follow the instructions therein to make sure you understand what’s going on. After this you’ll implement the huntingtons function.

For the huntingtons function you’re asked to implement, the idea is for you to use regular expressions to diagnose whether a person has Huntington’s disease by looking at their genome.

A DNA sequence is a string of the characters a, t, g, c, representing the sequence of nucleotides that comprise an individual’s DNA molecules. Very often, the human genome has regions where the same nucleotide sequences repeat many times. Sometimes, certain numbers of these repeats will result in a genetic disorder.

Huntington’s Disease is caused by having too many consecutive copies of the sequence “cag”. By looking at a DNA sequence, we can categorize how a person might be affected by Huntington’s in the following way:

  • An individual with less than 26 sequential repeats of “cag” in their genome is considered to be “normal”.
  • An individual with between 26 and 35 repeats of “cag” is considered to be a “carrier”, and may give the disease to their children.
  • An individual with between 36 and 39 repeats is said to be “at risk”, and may or may not ever show symptoms.
  • An individual with 40 or more repeats is said to be “affected”, and will eventually show symptoms of the disease.

Your huntingtons function should take a DNA sequence (a string from the alphabet a,g,t,c) and determine the classification for the individual (i.e. normal, carrier, at risk, or affected). To do this, define a regex for each category, convert each to a nfa, then check which category the given DNA sequence falls into. Once you have determined the category, the function should return the appropriate string of: “normal”, “carrier”, “at risk”, or “affected”

Note that the DNA sequence may have characters before or after the “cag” repeats.

Source describing Huntingtons.

Reading

This week continues with the material in Chapter 6: Functions with Infinite domains, Automata, and Regular expressions that we started on last week.

Section 6.4.2 of the book includes a proof that regular expressions are equivalent in power to deterministic finite automata, but it is quite different from the proof we do in the videos. Ambitious students are encouraged to understand both proofs, and to consider the advantages and disadvantages of the approach taken in the book compared to our approach (which is the one done more traditionally in theory of computation courses, at least going back to Mike Sipser’s courses in the 1980s).

There are many books (and other videos) that do present material similar to what is in the videos this week, but we are not assigning any additional readings.

Videos

You can play all the videos using this playlist, but don’t forget to take breaks: Week 8 Playlist (Note that this content was week 7 in prior semesters).

These videos are edited from these cs3102 classes (we don’t generally recommend watching the unedited versions, but they are available if you want to):

Nondeterministic Nathans (1:23)

Warm-Up: Complementing the XOR Language (5:54)

Equivalence of Regular Expressions and FSAs (6:12)

Proving FSAs are as Powerful as Regular Expressions (Part 1: Proof Strategy) (4:35)

Proving FSAs are as Powerful as Regular Expressions (Part 2: Bases, Union) (22:55)

Nondeterminism (11:14)

Nondeterministic Finite Automata (NFAs) (13:48)

Equivalence of NFAs and DFAs (10:01)

Proving FSAs are as Powerful as Regular Expressions (Part 3: Concatenation) (14:58)

Proving FSAs are as Powerful as Regular Expressions (Part 4: Kleene Star) (8:21)