Jekyll2021-10-03T14:27:26-07:00https://hunglvosu.github.io/feed.xmlHung Le / Home Pagepersonal descriptionHung LeCMPSCI 611 : Advanced Algorithms2021-03-18T00:00:00-07:002021-03-18T00:00:00-07:00https://hunglvosu.github.io/posts/2021/03/Syllabus-Algs<p><strong>Objectives</strong>: This course provides students with skills in designing efficient algorithms. After completing this course, students are expected to be able to formulate an algorithmic problem, design an algorithm for the problem, prove the correctness, and analyze the running time. This course will illustrate these skills through various algorithmic problems and important design techniques.</p>
<p><strong>Prerequisites</strong>: Students are expected to have mathematical maturity and knowledge of COMPSCI 311 or equivalence.</p>
<p><strong>Location</strong>: Agricultural Engineering Building, Room 119.</p>
<p><strong>Teaching Staffs</strong>:</p>
<ul>
<li>Instructor: Hung Le.
<ul>
<li>Email: hungle@cs.umss.du</li>
<li>Office: 332 CS Building</li>
<li>Weekly Office Hours: Monday 11am -12 pm, Friday 3pm-4pm.</li>
</ul>
<p>If my office hours do not work for you and you want to see me, you could either talk to me right after the class (preferred) or set up an appointment by email.</p>
</li>
<li>
<p>TAs:</p>
<ul>
<li>Hamid Mozaffari (hamid@cs.umass.edu), office hours: TBA</li>
</ul>
</li>
<li>Graders:
<ul>
<li>Fenil Manish Doshi (fdoshi@umass.edu)</li>
<li>Shanmukh Swaroop Srinivas (shanmukhswar@umass.edu)</li>
<li>Divya Katkam (dkatkam@umass.edu)</li>
<li>Mohith Akhilesh Dhulipalla (mdhulipalla@umass.edu)</li>
</ul>
</li>
</ul>
<p><strong>Grading</strong></p>
<ul>
<li>Homework (40%): Homework is <strong>bi-weekly</strong> and includes 6 assignments. The lowest assignment will be weighted 50% only.</li>
<li>Weekly Quizzes (10%): We will have 11 quizzes total, and the lowest quiz will be drop.</li>
<li>Midterm 1 (15%): <strong>Thu, Oct 07</strong>. Midterm 1 will cover divide and conquer, greedy algorithms, and dynamic programming.</li>
<li>Midterm 2 (15%): <strong>Tue, Nov 16</strong>. Midterm 2 will cover randomized algorithms, network flow, and linear programming.</li>
<li>Final (20%): Scheduled by the university and will be comprehensive.</li>
</ul>
<p><strong>Academic Honesty and Collaboration Policy:</strong></p>
<ul>
<li>You must do exams and quizzes on your own. No collaboration is allowed.</li>
<li>You might collaborate with <strong>at most 2 other students</strong> on homework. You must specify anyone you collaborated with in your submissions. The collaboration is <strong>verbal</strong> only. The write-up must be your own. You are NOT allowed to talk about the homework with anyone else outside your group (except TAs and the instructor). You are NOT allowed to consult any material on the Internet to do your homework.</li>
<li>You are allowed to bring at most 2 pages of A4 cheatsheets to the exams. NO other materials are allowed.</li>
<li>DO ask if you have any questions regarding academic honesty.</li>
</ul>
<p>As members of the College of Information and Computer Sciences at UMass Amherst, we expect everyone to behave responsibly and honorably. In particular, we expect each of you not to give, receive, or use aid in examinations, nor to give, receive, or use unpermitted aid in any academic work. Doing your part in observing this code, and ensuring that others do likewise is essential for having a community of respect, integrity, fairness, and trust.
If you cheat in a course, you are taking away from your own opportunity to learn and develop as a professional. You also hurt your colleagues, and this will hurt people you will work with in the future, who expect an honest and responsible professional.</p>
<p>As faculty, we pledge to use academic policies designed for fairness, avoiding situations that are conducive to violating academic honesty, as well as unreasonable or unusual procedures that assume dishonesty. We will follow the university’s <a href="https://www.umass.edu/honesty/">Academic Honesty Policy and Procedures</a>. This means we will report instances of dishonesty, which may lead to formal sanction and/or failing the course.</p>
<p><strong>Late Policy:</strong> You have <strong>one late day</strong> on any HW of your choice. Late submissions otherwise will not be graded unless you have a good medical reason. Try your best to honor the deadlines.</p>
<p><strong>Posting Policy:</strong> You are not allowed to post any material in this course to public websites without the permission of the instructor.</p>
<p><strong>Tentative topics:</strong></p>
<ul>
<li>Divide and Conquer (3 lectures)</li>
<li>Dynamic Programming (3 lectures)</li>
<li>Greedy Algorithms (3 lectures)</li>
<li>Randomized Algorithms (3 lectures)</li>
<li>Network Flow (3 lectures)</li>
<li>Linear Programming (3 lectures)</li>
<li>NP-Completeness (2 lectures)</li>
<li>Approximation Algorithms (3 lectures)</li>
</ul>
<p><strong>Required Textbook:</strong> Lectures will be based on <a href="http://jeffe.cs.illinois.edu/teaching/algorithms/">Jeff Erickson notes</a>. Slides will be posted on Moodle.</p>
<p><strong>Optional Textbook:</strong></p>
<ul>
<li>Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein.</li>
<li>Algorithm Design by Kleinberg and Tardos (KT).</li>
<li>Algorithms by Dasgupta, Papadimitriou, Vazirani (DPV).</li>
<li>Randomized Algorithms by Motwani and Raghavan (MR).</li>
<li>Probability and Computing by Mitzenmacher and Upfal (MU).</li>
<li>Approximation Algorithms by Vazirani.</li>
</ul>
<p><strong>Schedule:</strong></p>
<p>The following tentative schedule might suffer changes.</p>
<table>
<thead>
<tr>
<th>Date</th>
<th>Topics</th>
<th>Readings</th>
</tr>
</thead>
<tbody>
<tr>
<td>02 Sept</td>
<td>Intro, Master theorem, Mergesort</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/01-recursion.pdf">Erickson’s note on recursion</a></td>
</tr>
<tr>
<td>07 Sept</td>
<td>Closest Pair, Matrix Multiplication</td>
<td><a href="https://people.eecs.berkeley.edu/~vazirani/algorithms/chap2.pdf">DPV’s chapter 2</a></td>
</tr>
<tr>
<td>09 Sept</td>
<td>Fast Fourier Transform</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/A-fft.pdf">Erickson’s note on FFT</a></td>
</tr>
<tr>
<td>14 Sept</td>
<td>Intro Greedy, Job Scheduling</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/04-greedy.pdf">Erickson’s note on geedy algs</a></td>
</tr>
<tr>
<td>16 Sept</td>
<td>Minimum Spanning Tree</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/07-mst.pdf">Erickson’s note on MST</a></td>
</tr>
<tr>
<td>21 Sept</td>
<td>Matroid</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/E-matroids.pdf">Erickson’s note on matroid</a></td>
</tr>
<tr>
<td>23 Sept</td>
<td>Subset Sum, Optimal BST</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/03-dynprog.pdf">Erickson’s note on DP</a></td>
</tr>
<tr>
<td>28 Sept</td>
<td>SSSP and APSP</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/08-sssp.pdf">Erickson’s note on SSSP</a> and <a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/09-apsp.pdf">APSP</a></td>
</tr>
<tr>
<td>30 Sept</td>
<td>TSP and Independent Set on Trees</td>
<td><a href="https://people.eecs.berkeley.edu/~vazirani/algorithms/chap6.pdf">DPV’s chapter 6</a> and <a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/03-dynprog.pdf">Erickson’s note on DP</a></td>
</tr>
<tr>
<td>05 Oct</td>
<td>Nuts and Bolts, Quicksort</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/02-nutsbolts.pdf">Erickson’s note on Randomized Algs</a></td>
</tr>
<tr>
<td>07 Oct</td>
<td>Midterm 1</td>
<td>Covering D&C, DP, and Greedy</td>
</tr>
<tr>
<td>12 Oct</td>
<td>Balls and Bins, Chernoff’s Bounds</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/05-hashing.pdf">Erickson’s note on Hashing</a></td>
</tr>
<tr>
<td>14 Oct</td>
<td>Bloom Filter</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/06-bloom.pdf">Erickson’s note on filtering and streaming</a></td>
</tr>
<tr>
<td>19 Oct</td>
<td>Maxflow-Mincut</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/10-maxflow.pdf">Erickson’s note on Maxflow</a></td>
</tr>
<tr>
<td>21 Oct</td>
<td>Applications of Maxflow</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/11-maxflowapps.pdf">Erickson’s note on Applications of Maxflow</a></td>
</tr>
<tr>
<td>26 Oct</td>
<td>Maxflow in Strongly PolyTime</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/10-maxflow.pdf">Erickson’s note on Maxflow</a></td>
</tr>
<tr>
<td>28 Oct</td>
<td>Introduction to Linear Programming</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/H-lp.pdf">Erickson’s note on LP</a></td>
</tr>
<tr>
<td>02 Nov</td>
<td>LP Duality</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/H-lp.pdf">Erickson’s note on LP</a></td>
</tr>
<tr>
<td>04 Nov</td>
<td>Simplex Algorithm</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/I-simplex.pdf">Erickson’s note on Simplex Algorithm</a></td>
</tr>
<tr>
<td>09 Nov</td>
<td>P vs NP</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/12-nphard.pdf">Erickson’s note on NP-hardness</a></td>
</tr>
<tr>
<td>11 Nov</td>
<td>Veterans Day</td>
<td> </td>
</tr>
<tr>
<td>16 Nov</td>
<td>Midterm 2</td>
<td>Covering Randomized Algorithms, Maxflow, and LP</td>
</tr>
<tr>
<td>18 Nov</td>
<td>NP-complete Problems</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/book/12-nphard.pdf">Erickson’s note on NP-hardness</a></td>
</tr>
<tr>
<td>23 Nov</td>
<td>Vertex Cover,Set Cover</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/J-approx.pdf">Erickson’s note on approximation algorithms</a></td>
</tr>
<tr>
<td>25 Nov</td>
<td>Thanksgiving</td>
<td> </td>
</tr>
<tr>
<td>30 Nov</td>
<td>TSP, $k$-Center</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/J-approx.pdf">Erickson’s note on approximation algorithms</a></td>
</tr>
<tr>
<td>02 Dec</td>
<td>Subset Sum</td>
<td><a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/J-approx.pdf">Erickson’s note on approximation algorithms</a></td>
</tr>
<tr>
<td>07 Dec</td>
<td>Review</td>
<td> </td>
</tr>
<tr>
<td>10 Oct - 16 Oct</td>
<td>Final Exam (exact date will be announced later)</td>
<td>Covering everything</td>
</tr>
</tbody>
</table>
<p><strong>Platforms:</strong> We will use Moodle for general logistics, Campuswire for discussion and Gradescopes for homework assignments.</p>
<p><strong>Equity and Inclusion Statement</strong>: We are committed to fostering a culture of diversity and inclusion, where everyone is treated with dignity and respect. This course is for everyone. This course is for you, regardless of your age, background, citizenship, disability, sex, education, ethnicity, family status, gender, gender identity, geographical origin, language, military experience, political views, race, religion, sexual orientation, socioeconomic status, or work experience. Because of that, we should realize that we will be bringing different skills to the course, and we will all be learning from and with each other. We may have different backgrounds and skills in courses taken, mathematical, algorithmic, coding or testing background, ways to communicate orally and in writing, working alone or in groups, or plans for professional careers.</p>
<p>Please be kind and courteous. There’s no need to be mean or rude. Respect that people have differences of opinion, and work and approach problems differently. There is seldom a single right answer to complicated questions. Please keep unstructured critique to a minimum; any criticism should be constructive.</p>
<p>Disruptive behavior is not welcome, and insulting, demeaning, or harassing anyone is unacceptable. In particular, we don’t tolerate behavior that excludes people in socially marginalized groups. If you feel you have been or are being harassed or made uncomfortable by someone in this class, please contact a member of the course staff immediately, or if you feel uncomfortable doing so, contact the Dean of Students office.</p>
<p>This course is for all of us. We will all learn from each other. Welcome!</p>
<p><strong>Accommodations for Disabilities</strong>: The University of Massachusetts Amherst is committed to making reasonable, effective and appropriate accommodations to meet the needs of students with disabilities and help create a barrier-free campus. If you have a disability and require accommodations, please register with Disability Services, located in 161 Whitmore Hall, (413) 545-0892, to have an accommodation letter sent to your faculty. Information on services and materials for registering is available on the <a href="https://www.umass.edu/disability/">University of Massachusetts Amherst Disability Services</a> page.</p>Hung LeObjectives: This course provides students with skills in designing efficient algorithms. After completing this course, students are expected to be able to formulate an algorithmic problem, design an algorithm for the problem, prove the correctness, and analyze the running time. This course will illustrate these skills through various algorithmic problems and important design techniques.Prospective students2020-09-01T00:00:00-07:002020-09-01T00:00:00-07:00https://hunglvosu.github.io/posts/2020/09/ProsStudents<p>Thank you for being interested in working with me. I enjoy working with students. And yes, <strong>I am looking for PhD students</strong> starting from Fall 20201 (application due by December 15, 2020). If you are interested, please be in touch with me <strong>after</strong> you complete your application. Please check out <a href="https://www.cics.umass.edu/admissions/application-instructions">here</a> for general requirement; GRE is NOT required for PhD admission. Having a good math background will be appreciated. Note that I cannot answer questions regarding your chance of being admitted, so please do not ask.</p>Hung LeThank you for being interested in working with me. I enjoy working with students. And yes, I am looking for PhD students starting from Fall 20201 (application due by December 15, 2020). If you are interested, please be in touch with me after you complete your application. Please check out here for general requirement; GRE is NOT required for PhD admission. Having a good math background will be appreciated. Note that I cannot answer questions regarding your chance of being admitted, so please do not ask.Programming Assignment 1 Instructions2020-07-16T00:00:00-07:002020-07-16T00:00:00-07:00https://hunglvosu.github.io/posts/2020/07/PA1<p><strong>Due by Jan 28, 2019 11:55 pm</strong></p>
<p><strong>Note</strong> <a href="https://hunglvosu.github.io/res/HW1.pdf">written homework 1</a> is up.</p>
<h1 id="problem-specification">Problem Specification</h1>
<p><strong>Goal:</strong> In this assignment, we will apply the locality sensitive hashing technique learned in the lecture to a question dataset. The goal is: <b>for each</b> question X, find a set of questions Y in the data set such that Sim(X,Y) ⩾ 0.6, where the similarity is Jaccard.</p>
<p><strong>Input Format:</strong> The datasets are given in tvs (tab-separated) format. The file contains two columns: <font face="Courier" size="3" color="blue">qid</font> and <font face="Courier" size="3" color="blue">question</font>. Four datasets provided in a single zip-compressed file are:</p>
<ol> <li><font face="Verdana,Arial,Helvetica" size="3" color="blue">question_4k.tsv</font>: This dataset contains 4,000 questions.</li>
<li><font face="Verdana,Arial,Helvetica" size="3" color="blue">question_50k.tsv</font>: This dataset contains 50,000 questions.</li>
<li><font face="Verdana,Arial,Helvetica" size="3" color="blue">question_150k.tsv</font>: This dataset contains 150,000 questions.</li>
<li><font face="Verdana,Arial,Helvetica" size="3" color="blue">question_290k.tsv</font>: This dataset contains 290,000 questions.</li>
</ol>
<p>The dataset can be downloaded from <a href="https://hunglvosu.github.io/res/question-dataset.zip">here</a>.</p>
<p><strong>Output Format:</strong> output must be given in tsv forrmat, with two columns: <font face="Courier" size="3" color="blue">qid</font> and <font face="Courier" size="3" color="blue">similar-qids</font> where <font face="Courier" size="3" color="blue">qid</font> is the qid of the queried question and <font face="Courier" size="3" color="blue">similar-qids</font> is the set of similar questions given by their qids. The format of column <font face="Courier" size="3" color="blue">similar-qids</font> is comma-separated. If a question has no similar question, then this column is empty. Below is an example of the output format:
<br />
<br /></p>
<table border="0">
<tr>
<th>qid</th>
<th>similar-qids</th>
</tr>
<tr>
<td>11</td>
<td></td>
</tr>
<tr>
<td>13</td>
<td>145970</td>
</tr>
<tr>
<td>15</td>
<td>229098,280602,6603,204128,164826,238609,65667,139632,265843,143673,217736,38330</td>
</tr>
</table>
<p><br />
The way to interpret the above sample output is: the question of qid 11 has no similar question, the question of qid 13 has 1 similar question of qid 145970 and the question of qid 15 has 12 similar questions. You can download a sample output tsv file <a href="https://hunglvosu.github.io/res/sample_output.tsv">here</a>. The name of the output file must be <font face="Courier" size="3" color="blue">question_sim_[*].tsv</font> where [*] is replaced by the size of the dataset. For example, the output of the 4k question data set must be <font face="Courier" size="3" color="blue">question_sim_4k.tsv</font>.
<br />
<br />
There are two questions in this assigment. The first question is worth 15 points and the second question is worth 35 points, all of 50 points total.
<br />
<br />
<b>Question 1 (15 points): </b> Implement the native algorithm that, for each question, loops through the database, computes the Jaccard similarity and output questions of similarity at least 0.6. For full score, your algorithm must run in <b>less than 3 minutes</b> on the dataset <font face="Verdana,Arial,Helvetica" size="3" color="blue">question_4k.tsv</font>.
<br />
<br />
<b>Question 2 (35 points):</b> Implement the locality sensitive hashing algorithm we learned in the class, with x = 0.6, s = 14 and r = 6, where s is the number of hash tables (we use b instead in the lecture slide) and r is the size of the minhash signature. For full score, your algorithm must run in <b>less than 10 minutes</b> on the dataset <font face="Verdana,Arial,Helvetica" size="3" color="blue">question_150k.tsv</font>.
<br />
<br />
<b>Note 1:</b> As you may understand from the lecture, it could be that two non-similar questions are mapped to the same location in the locality sensitive data structure. This is called false positive. You must remove all false positives before writing to the output file.
<br />
<br />
<b>Note 2:</b> Submit your code and output data to the Connex</p>
<h1 id="faq">FAQ</h1>
<p><br />
<b>Q1:</b> Will 50k and 290k question datasets be graded?
<br />
<b>Answer:</b> No. They are provided for learning purposes.
<br />
<br />
<b>Q2:</b> How can we generate a random number in Python3?
<br />
<b>Answer:</b> <a href="https://hunglvosu.github.io/res/random-gen.py">Here</a> is an example code that I use for generating a random 64-bit integer in my implementation.
<br />
<br />
<b>Q3:</b> What kind of hash function do you recommend for computing the minHash signature?
<br />
<b>Answer:</b> In my implementation, I use the linear hash function h(x) = (a*x +b) mod p, where a,b are two random 64-bits integers and p is a 64-bit prime integer. I set p = 15373875993579943603 for all hash functions.</p>
<p><br />
<br />
<b>Q4:</b> How can I map a string (and a word specifically for this homework) to an integer so that I can feed it to the linear hash function in Q3.
<br />
<b>Answer:</b> I recommend the FNV hash function. You can download and install following the instruction in <a href="https://pypi.org/project/fnv/">here</a>. However, I use this library in a slightly different way. Here are steps: I download the libarary, look for the file name “<strong>init</strong>.py” in the downloaded package, rename it to <a href="https://hunglvosu.github.io/res/fnv.py">“fnv.py”</a>, put to the source code folder and import to my code. Here is <a href="https://hunglvosu.github.io/res/fnv-example.py">an example</a> of how to import it. You may notice that there are three diffent hash functions in the example. I use this function <font face="Courier" size="3" color="blue">hash(data, bits=64)</font> in my implementation.
<br />
<br />
<b>Q5:</b> If I don’t use python, where can I find a version of the FNV function implementation in other languages?
<br />
<b>Answer:</b> You can visit <a href="http://isthe.com/chongo/tech/comp/fnv/">this site</a>. It might have what you want.
<br />
<br />
<b>Q6:</b> Do you apply any advanced processing technique to nomarlize the datasets?
<br />
<b>Answer:</b> I don’t. I want to keep the implementation as simple as possible for learning purpose. I do use <font face="Courier" size="3" color="blue">question.strip()</font> to remove possible white-space characters ended at each question. Then, I just use split function of Python3 <font face="Courier" size="3" color="blue">question.split()</font> to break a question into words. You may notice that in this implementation, “what” and “What” would be regarded as different words because I do not handle capitalization. You are welcome to use any technique that can help you improve the correctness of your algorithm, but keep in mind the running time constraint.
<br /></p>
<p><br />
<br />
<b>Q7:</b> If the outputs of my implementation and another group’s implementation are different, is this a problem?
<br />
<b>Answer:</b> No. Because the nature of randomness in locality sensitive hashing, I expect differences in the output. The assignment will mainly be graded based on: speed and your understanding of the algorithm reflected in your code. And don’t forget the dicussion policy that I specified in class.
<br />
<br />
<br /></p>Hung LeDue by Jan 28, 2019 11:55 pm Note written homework 1 is up. Problem Specification Goal: In this assignment, we will apply the locality sensitive hashing technique learned in the lecture to a question dataset. The goal is: for each question X, find a set of questions Y in the data set such that Sim(X,Y) ⩾ 0.6, where the similarity is Jaccard. Input Format: The datasets are given in tvs (tab-separated) format. The file contains two columns: qid and question. Four datasets provided in a single zip-compressed file are: question_4k.tsv: This dataset contains 4,000 questions. question_50k.tsv: This dataset contains 50,000 questions. question_150k.tsv: This dataset contains 150,000 questions. question_290k.tsv: This dataset contains 290,000 questions. The dataset can be downloaded from here. Output Format: output must be given in tsv forrmat, with two columns: qid and similar-qids where qid is the qid of the queried question and similar-qids is the set of similar questions given by their qids. The format of column similar-qids is comma-separated. If a question has no similar question, then this column is empty. Below is an example of the output format: qid similar-qids 11 13 145970 15 229098,280602,6603,204128,164826,238609,65667,139632,265843,143673,217736,38330 The way to interpret the above sample output is: the question of qid 11 has no similar question, the question of qid 13 has 1 similar question of qid 145970 and the question of qid 15 has 12 similar questions. You can download a sample output tsv file here. The name of the output file must be question_sim_[*].tsv where [*] is replaced by the size of the dataset. For example, the output of the 4k question data set must be question_sim_4k.tsv. There are two questions in this assigment. The first question is worth 15 points and the second question is worth 35 points, all of 50 points total. Question 1 (15 points): Implement the native algorithm that, for each question, loops through the database, computes the Jaccard similarity and output questions of similarity at least 0.6. For full score, your algorithm must run in less than 3 minutes on the dataset question_4k.tsv. Question 2 (35 points): Implement the locality sensitive hashing algorithm we learned in the class, with x = 0.6, s = 14 and r = 6, where s is the number of hash tables (we use b instead in the lecture slide) and r is the size of the minhash signature. For full score, your algorithm must run in less than 10 minutes on the dataset question_150k.tsv. Note 1: As you may understand from the lecture, it could be that two non-similar questions are mapped to the same location in the locality sensitive data structure. This is called false positive. You must remove all false positives before writing to the output file. Note 2: Submit your code and output data to the Connex FAQ Q1: Will 50k and 290k question datasets be graded? Answer: No. They are provided for learning purposes. Q2: How can we generate a random number in Python3? Answer: Here is an example code that I use for generating a random 64-bit integer in my implementation. Q3: What kind of hash function do you recommend for computing the minHash signature? Answer: In my implementation, I use the linear hash function h(x) = (a*x +b) mod p, where a,b are two random 64-bits integers and p is a 64-bit prime integer. I set p = 15373875993579943603 for all hash functions. Q4: How can I map a string (and a word specifically for this homework) to an integer so that I can feed it to the linear hash function in Q3. Answer: I recommend the FNV hash function. You can download and install following the instruction in here. However, I use this library in a slightly different way. Here are steps: I download the libarary, look for the file name “init.py” in the downloaded package, rename it to “fnv.py”, put to the source code folder and import to my code. Here is an example of how to import it. You may notice that there are three diffent hash functions in the example. I use this function hash(data, bits=64) in my implementation. Q5: If I don’t use python, where can I find a version of the FNV function implementation in other languages? Answer: You can visit this site. It might have what you want. Q6: Do you apply any advanced processing technique to nomarlize the datasets? Answer: I don’t. I want to keep the implementation as simple as possible for learning purpose. I do use question.strip() to remove possible white-space characters ended at each question. Then, I just use split function of Python3 question.split() to break a question into words. You may notice that in this implementation, “what” and “What” would be regarded as different words because I do not handle capitalization. You are welcome to use any technique that can help you improve the correctness of your algorithm, but keep in mind the running time constraint. Q7: If the outputs of my implementation and another group’s implementation are different, is this a problem? Answer: No. Because the nature of randomness in locality sensitive hashing, I expect differences in the output. The assignment will mainly be graded based on: speed and your understanding of the algorithm reflected in your code. And don’t forget the dicussion policy that I specified in class.Programming Assignment 2 Instructions2020-07-16T00:00:00-07:002020-07-16T00:00:00-07:00https://hunglvosu.github.io/posts/2020/07/PA2<p><strong>Due by February 11, 2019 11:55 pm</strong></p>
<p><strong>Please note that <a href="https://hunglvosu.github.io/res/HW2.pdf">written homework 2 is up</a>.</strong></p>
<h1 id="problem-specification">Problem Specification</h1>
<p><strong>Goal:</strong> In this assignment, we will experiment with three different algorithms to train a linear regression models: solving normal equations, batch gradient descent, stochastic gradient descent.
<br />
<br />
<b>Input Format:</b> The datasets are given in tvs (tab-separated) format. The file format is:</p>
<ul> <li>1st row: the numer of data points N.</li>
<li>2nd row: the number of features D.</li>
<li>3rd row: the first column is the label, and following columns are feature names.</li>
<li>N following rows: each has (D+1) columns where the the first column is the label and following D columns are features.</li>
</ul>
<p>An example file can be found <a href="https://hunglvosu.github.io/res/pa2-sample.tsv">here</a>. There are two dataset that we will work with in this assignment.</p>
<ol> <li><font face="Verdana,Arial,Helvetica" size="3" color="blue">data_10k_100.tsv</font>: This dataset contains 10,000 points, each with 100 features.</li>
<li><font face="Verdana,Arial,Helvetica" size="3" color="blue">data_100k_300.tsv</font>: This dataset contains 100,000 points, each with 300 features.</li>
</ol>
<p>The dataset can be downloaded from <a href="https://hunglvosu.github.io/res/pa2-data.zip">here</a>.
<br />
<br />
<b>Output Format:</b> output must be given in tsv format, with (D+1) columns and two rows:</p>
<ul> <li>The first row is the coefficient names of the linear regression model. The first D columns contain <font face="Courier" size="3" color="blue">w1</font>, <font face="Courier" size="3" color="blue">w2</font> up to <font face="Courier" size="3" color="blue">wD</font>, where <font face="Courier" size="3" color="blue">wi</font> is the coefficient of the i-th feature. The bias term, named <font face="Courier" size="3" color="blue">w0</font>, is in the last column. </li>
<li>The second row contains values corresponding to the coefficents of the regression model.</li>
</ul>
<p>The sample output for the sample dataset above can be downloaded <a href="https://hunglvosu.github.io/res/pa2-sample_model.tsv">here</a>.</p>
<p><br />
There are three questions in this assigment. The first and second question are worth 10 points each where the third question is worth 30 points, all of 50 points total.
<br />
<br />
<b>Question 1 (10 points): </b> Implement the algoithm that solves the normal equation to learn linear regression models. For full score, your algorithm must run in <b>less than 1 minutes</b> on the dataset <font face="Verdana,Arial,Helvetica" size="3" color="blue">data_100k_300.tsv</font>, with the loss function value less than 70.
<br />
<br />
<b>Question 2 (10 points):</b> Implement the batch gradient descent algorithm, with T = 200 epochs, learning rate η = 0.000001 (this is 10<sup>-6</sup>). For full score, your algorithm must run in <b>less than 5 minutes</b> on the dataset <font face="Verdana,Arial,Helvetica" size="3" color="blue">data_10k_100.tsv</font> with loss value less than 270,000 (this is 27x10<sup>4</sup>).
<br />
<br />
<b>Question 3 (30 points):</b> Implement the stochastic gradient descent algorithm with:
<ol> <li>T = 20 epochs, learning rate η = 0.000001 (this is 10<sup>-6</sup>) and batch size m = 1 on the dataset <font face="Verdana,Arial,Helvetica" size="3" color="blue">data_10k_100.tsv</font>. For full score, your algorithm must run in <b>less than 1 minutes</b> with loss value less than 30.</li>
<li>T = 12 epochs, learning rate η = 0.0000001 (this is 10<sup>-7</sup>) and batch size m = 1 on the dataset <font face="Verdana,Arial,Helvetica" size="3" color="blue">data_100k_300.tsv</font>. For full score, your algorithm must run in <b>less than 10 minutes</b> with loss value less than 70.</li>
</ol>
Each part in question 3 is worth 15 points.
<br />
<br />
<b>Note 1:</b> Submit your code and output data to the Connex</p>
<h1 id="faq">FAQ</h1>
<p><br />
<b>Q1:</b> Can I use libarary for computing matrix inversion in Question 1.
<br />
<b>Answer:</b> Yes. You are allowed Numpy in question 1. You can also use Numpy for other questions as well.
<br />
<br />
<b>Q2:</b> How do I initiate the weight vector for gradient descent?
<br />
<b>Answer:</b> I initiat the weight vector randomly where each component is drawn from [0,1] randomly using <font face="Courier" size="3" color="blue">numpy.random.random_sample()</font>
<br />
<br />
<b>Q3:</b> What loss function should I use?
<br />
<b>Answer:</b> For all questions, you should use this loss function: <img src="https://hunglvosu.github.io/res/loss-PA2.png" width="145" height="50" />
<br />
<br /></p>Hung LeDue by February 11, 2019 11:55 pm Please note that written homework 2 is up. Problem Specification Goal: In this assignment, we will experiment with three different algorithms to train a linear regression models: solving normal equations, batch gradient descent, stochastic gradient descent. Input Format: The datasets are given in tvs (tab-separated) format. The file format is: 1st row: the numer of data points N. 2nd row: the number of features D. 3rd row: the first column is the label, and following columns are feature names. N following rows: each has (D+1) columns where the the first column is the label and following D columns are features. An example file can be found here. There are two dataset that we will work with in this assignment. data_10k_100.tsv: This dataset contains 10,000 points, each with 100 features. data_100k_300.tsv: This dataset contains 100,000 points, each with 300 features. The dataset can be downloaded from here. Output Format: output must be given in tsv format, with (D+1) columns and two rows: The first row is the coefficient names of the linear regression model. The first D columns contain w1, w2 up to wD, where wi is the coefficient of the i-th feature. The bias term, named w0, is in the last column. The second row contains values corresponding to the coefficents of the regression model. The sample output for the sample dataset above can be downloaded here. There are three questions in this assigment. The first and second question are worth 10 points each where the third question is worth 30 points, all of 50 points total. Question 1 (10 points): Implement the algoithm that solves the normal equation to learn linear regression models. For full score, your algorithm must run in less than 1 minutes on the dataset data_100k_300.tsv, with the loss function value less than 70. Question 2 (10 points): Implement the batch gradient descent algorithm, with T = 200 epochs, learning rate η = 0.000001 (this is 10-6). For full score, your algorithm must run in less than 5 minutes on the dataset data_10k_100.tsv with loss value less than 270,000 (this is 27x104). Question 3 (30 points): Implement the stochastic gradient descent algorithm with: <ol> <li>T = 20 epochs, learning rate η = 0.000001 (this is 10-6) and batch size m = 1 on the dataset data_10k_100.tsv. For full score, your algorithm must run in less than 1 minutes with loss value less than 30.</li> <li>T = 12 epochs, learning rate η = 0.0000001 (this is 10-7) and batch size m = 1 on the dataset data_100k_300.tsv. For full score, your algorithm must run in less than 10 minutes with loss value less than 70.</li> </ol> Each part in question 3 is worth 15 points. Note 1: Submit your code and output data to the Connex FAQ Q1: Can I use libarary for computing matrix inversion in Question 1. Answer: Yes. You are allowed Numpy in question 1. You can also use Numpy for other questions as well. Q2: How do I initiate the weight vector for gradient descent? Answer: I initiat the weight vector randomly where each component is drawn from [0,1] randomly using numpy.random.random_sample() Q3: What loss function should I use? Answer: For all questions, you should use this loss function:Programming Assignment 3 Instructions2020-07-16T00:00:00-07:002020-07-16T00:00:00-07:00https://hunglvosu.github.io/posts/2020/07/PA3<p><strong>Due by March 04, 2019 11:55 pm</strong></p>
<p><strong>Please note that <a href="https://hunglvosu.github.io/res/HW3.pdf">written homework 3 is up</a>.</strong></p>
<h1 id="problem-specification">Problem Specification</h1>
<p><b>Goal:</b> In this assignment, we will compute PageRank score for the web dataset provided by Google in a programming challenge in a programming constest in 2002.
<br />
<br />
<b>Input Format:</b> The datasets are given in txt. The file format is:</p>
<ul> <li>Rows from 1 to 4: Metadata. They give information about the dataset and are self-explained.</li>
<li>Following rows: each row consists of 2 values represents the link from the web page in the 1st column to the web page in the 2nd column. For example, if the row is <font face="Courier" size="3" color="blue">0 	 11342</font>, this means there is a directed link from the page id 0 to the page id 11324.</li>
</ul>
<p>There are two dataset that we will work with in this assignment.</p>
<ol> <li><font face="Verdana,Arial,Helvetica" size="3" color="blue">web-Google_10k.txt</font>: This dataset contains 10,000 web pages and 78323 links. The dataset can be downloaded from <a href="https://hunglvosu.github.io/res/web-Google_10k.zip">here</a>. DO NOT assume that page ids are from 0 to 10,000.
</li>
<li><font face="Verdana,Arial,Helvetica" size="3" color="blue">web-Google.txt</font>: This dataset contains 875,713 web pages and 5,105,039 links. The dataset can be downloaded from <a href="https://snap.stanford.edu/data/web-Google.txt.gz">here</a>. DO NOT assume that page ids are from 0 to 875,713.</li>
</ol>
<p>Also, it’s helpful to test your algorithm with this <a href="https://hunglvosu.github.io/res/toy_graph.txt">toy dataset</a>.
<br />
<br />
<br />
<b>Output Format:</b> the output format for each quesion will be specified below.
<br />
<br />
<br />
There are two questions in this assigment worth 50 points total.
<br />
<br />
<b>Question 1 (20 points): </b> Find all dead ends. A node is a dead end if it has no out-going edges or all its outoging edges points to dead ends. For example, consider the graph A->B->C->D. All nodes A,B,C,D are dead ends by this definition. D is a dead end because it has no outgoing edge. C is a dead end because its only out-going neighbor, D, is a dead end. B is a dead end for the same reason, so is A.
<ol><li>(10 points) Find all dead ends of the dataset <font face="Verdana,Arial,Helvetica" size="3" color="blue">web-Google_10k.txt</font>. For full score, your algorithm must run in <b>less than 15 seconds</b>. The output must be written to a file named <font face="Verdana,Arial,Helvetica" size="3" color="blue">deadends_10k.tsv</font></li>
<li>(10 points) Find all dead ends of the dataset <font face="Verdana,Arial,Helvetica" size="3" color="blue">web-Google_800k.txt</font>. For full score, your algorithm must run in <b>less than 1 minute</b>. The output must be written to a file named <font face="Verdana,Arial,Helvetica" size="3" color="blue">deadends_800k.tsv</font></li>
</ol>
The output format for Question 1 is single column, where each column is the id of an dead end. See <a href="https://hunglvosu.github.io/res/deadends_toy.tsv">here</a> for a sample output for the toy dataset.
<br />
<br />
<br />
<b>Question 2 (30 points):</b> Implement the PageRank algorithm for both datasets. The taxation parameter for both dataset is β = 0.85 and the number of PageRank iterations is T = 10.
<ol> <li>(15 points)Run your algorithm for <font face="Verdana,Arial,Helvetica" size="3" color="blue">web-Google_10k.txt</font> dataset. For full score, your algorithm must run in <b>less than 30 seconds</b>. The output must be written to a file named <font face="Verdana,Arial,Helvetica" size="3" color="blue">PR_10k.tsv</font></li>
<li>(15 points)Run your algorithm for <font face="Verdana,Arial,Helvetica" size="3" color="blue">web-Google.txt</font> dataset. For full score, your algorithm must run in <b>less than 2 minutes</b>. The output must be written to a file named <font face="Verdana,Arial,Helvetica" size="3" color="blue">PR_800k.tsv</font></li>
</ol>
The output format for Question 2 is two-column:</p>
<ul> <li>The first column is the PageRank score.</li>
<li>The second column is the corresponding web page id.</li>
</ul>
<font color="red">The output must be sorted by descending order of the PageRank scores.</font>
<p><a href="https://hunglvosu.github.io/res/toy_output.tsv">Here</a> is a sample output for the toy dataset above.
</ol>
<br />
<br />
<br />
<b>Note 1:</b> Submit your code and output data to the Connex</p>
<h1 id="faq">FAQ</h1>
<p><b>Q1:</b> How do I deal with dead ends?
<br />
<b>Answer:</b> I deal with deadend by recursively removing dead ends from the graph until there is no dead end. Then, I calculate the PageRank for the remaining nodes. Upon having the PageRank scores, I update the score for dead ends, by the <font color="red">reverse</font> removing oder. Here I stress that the update order is reverse.
<br />
<br />
<b>Q2:</b> Do I initiate the PageRank score?
<br />
<b>Answer:</b> You should initiate the PageRank score for each page to be the same. Remember that we only run the actual PageRank after removing dead ends. Let’s say the number of pages after removing dead ends is <font face="Verdana,Arial,Helvetica" size="3" color="blue">Np</font>, then each node should be initialized a PageRank score of <font face="Verdana,Arial,Helvetica" size="3" color="blue">1.0/Np.</font> It does not matter how do you initialze PageRanke score for dead ends because they are not involved in the actual PageRank calculation.
<br />
<br />
<b>Q3:</b> How do I know that my calculation is correct?
<br />
<b>Answer:</b> Run your algorithm on the sample input, make sure that the order of the pages by the PageRank scores matches with that of the sample output. There may be a slight difference in the PageRanke scores itself (because of round-off error), but the oder of the pages should be unaffected.</p>
<p>Also, check with the following outputs, that I take 10 pages with highest PageRank scores for each dataset:</p>
<ol> <li><font face="Verdana,Arial,Helvetica" size="3" color="blue">web-Google_10k.txt</font>: here is a <a href="https://hunglvosu.github.io/res/sample_output_10k.tsv">sample output</a>. This data has <font face="Verdana,Arial,Helvetica" size="3" color="blue">1544</font> dead ends total.</li>
<li><font face="Verdana,Arial,Helvetica" size="3" color="blue">web-Google.txt</font>: here is a <a href="https://hunglvosu.github.io/res/sample_output_800k.tsv">sample output</a>. This data has <font face="Verdana,Arial,Helvetica" size="3" color="blue">181057</font> dead ends total.</li>
</ol>
<p><br />
<b>Q4:</b> What do I do if I get the out of memory error on 800K dataset?
<br />
<b>Answer:</b> It’s probably because you construct a transition matrix to do PageRank computation. This matrix takes about 5TB (not GB) of memory, so it’s is natural that you will run out of memory. The way to get around is using a adjacency list, say L, together with the algorithm in page 21 of my note. For node i, L[i] is the set of nodes that link to i. Also, you should use a degree array D, where D[i] is the out-degree of i. That is, D[i] is the number of links from i to other nodes.
<br />
<br />
<b>Q5:</b> How do I find dead ends efficiently?
<br />
<b>Answer:</b> You probably want to check <a href="https://hunglvosu.github.io/res/deadend.pdf">this</a> out.
<br />
<br />
<br />
<br /></p>Hung LeDue by March 04, 2019 11:55 pm Please note that written homework 3 is up. Problem Specification Goal: In this assignment, we will compute PageRank score for the web dataset provided by Google in a programming challenge in a programming constest in 2002. Input Format: The datasets are given in txt. The file format is: Rows from 1 to 4: Metadata. They give information about the dataset and are self-explained. Following rows: each row consists of 2 values represents the link from the web page in the 1st column to the web page in the 2nd column. For example, if the row is 0 	 11342, this means there is a directed link from the page id 0 to the page id 11324. There are two dataset that we will work with in this assignment. web-Google_10k.txt: This dataset contains 10,000 web pages and 78323 links. The dataset can be downloaded from here. DO NOT assume that page ids are from 0 to 10,000. web-Google.txt: This dataset contains 875,713 web pages and 5,105,039 links. The dataset can be downloaded from here. DO NOT assume that page ids are from 0 to 875,713. Also, it’s helpful to test your algorithm with this toy dataset. Output Format: the output format for each quesion will be specified below. There are two questions in this assigment worth 50 points total. Question 1 (20 points): Find all dead ends. A node is a dead end if it has no out-going edges or all its outoging edges points to dead ends. For example, consider the graph A->B->C->D. All nodes A,B,C,D are dead ends by this definition. D is a dead end because it has no outgoing edge. C is a dead end because its only out-going neighbor, D, is a dead end. B is a dead end for the same reason, so is A. <ol><li>(10 points) Find all dead ends of the dataset web-Google_10k.txt. For full score, your algorithm must run in less than 15 seconds. The output must be written to a file named deadends_10k.tsv</li> <li>(10 points) Find all dead ends of the dataset web-Google_800k.txt. For full score, your algorithm must run in less than 1 minute. The output must be written to a file named deadends_800k.tsv</li> </ol> The output format for Question 1 is single column, where each column is the id of an dead end. See here for a sample output for the toy dataset. Question 2 (30 points): Implement the PageRank algorithm for both datasets. The taxation parameter for both dataset is β = 0.85 and the number of PageRank iterations is T = 10. <ol> <li>(15 points)Run your algorithm for web-Google_10k.txt dataset. For full score, your algorithm must run in less than 30 seconds. The output must be written to a file named PR_10k.tsv</li> <li>(15 points)Run your algorithm for web-Google.txt dataset. For full score, your algorithm must run in less than 2 minutes. The output must be written to a file named PR_800k.tsv</li> </ol> The output format for Question 2 is two-column: The first column is the PageRank score. The second column is the corresponding web page id. The output must be sorted by descending order of the PageRank scores. Here is a sample output for the toy dataset above. </ol> Note 1: Submit your code and output data to the Connex FAQ Q1: How do I deal with dead ends? Answer: I deal with deadend by recursively removing dead ends from the graph until there is no dead end. Then, I calculate the PageRank for the remaining nodes. Upon having the PageRank scores, I update the score for dead ends, by the reverse removing oder. Here I stress that the update order is reverse. Q2: Do I initiate the PageRank score? Answer: You should initiate the PageRank score for each page to be the same. Remember that we only run the actual PageRank after removing dead ends. Let’s say the number of pages after removing dead ends is Np, then each node should be initialized a PageRank score of 1.0/Np. It does not matter how do you initialze PageRanke score for dead ends because they are not involved in the actual PageRank calculation. Q3: How do I know that my calculation is correct? Answer: Run your algorithm on the sample input, make sure that the order of the pages by the PageRank scores matches with that of the sample output. There may be a slight difference in the PageRanke scores itself (because of round-off error), but the oder of the pages should be unaffected. Also, check with the following outputs, that I take 10 pages with highest PageRank scores for each dataset: web-Google_10k.txt: here is a sample output. This data has 1544 dead ends total. web-Google.txt: here is a sample output. This data has 181057 dead ends total. Q4: What do I do if I get the out of memory error on 800K dataset? Answer: It’s probably because you construct a transition matrix to do PageRank computation. This matrix takes about 5TB (not GB) of memory, so it’s is natural that you will run out of memory. The way to get around is using a adjacency list, say L, together with the algorithm in page 21 of my note. For node i, L[i] is the set of nodes that link to i. Also, you should use a degree array D, where D[i] is the out-degree of i. That is, D[i] is the number of links from i to other nodes. Q5: How do I find dead ends efficiently? Answer: You probably want to check this out.Programming Assignment 4 Instructions2020-07-16T00:00:00-07:002020-07-16T00:00:00-07:00https://hunglvosu.github.io/posts/2020/07/PA4<p><strong>Due by March 25, 2019 11:55 pm</strong></p>
<h1 id="problem-specification">Problem Specification</h1>
<p><strong>Goal:</strong> In this assignment, we learn how to factorize the utility matrix to build recommender systems. We will use the <a href="https://grouplens.org/datasets/movielens/100k/">MovieLens 100k Dataset</a>. This dataset contains about 100k ratings from <font face="Verdana,Arial,Helvetica" size="3" color="blue">n = 943</font> users and <font face="Verdana,Arial,Helvetica" size="3" color="blue">m = 1682</font> movies. We will factorize the utility matrix into two matrices U, V of dimensions <font face="Verdana,Arial,Helvetica" size="3" color="blue">nxd</font> and <font face="Verdana,Arial,Helvetica" size="3" color="blue">dxm</font>, respectively, where <font face="Verdana,Arial,Helvetica" size="3" color="blue">d = 20</font>.
<br />
<br />
<b>Input File:</b> Dowload file <a href="http://files.grouplens.org/datasets/movielens/ml-100k.zip">ml-100k.zip</a>, look for the file name <font face="Verdana,Arial,Helvetica" size="3" color="blue">u.data</font>. We only use data in this file to do factorization. DO NOT assume that users and movies are indexed from 0 to <font face="Verdana,Arial,Helvetica" size="3" color="blue">n</font> and <font face="Verdana,Arial,Helvetica" size="3" color="blue">m</font>, respectively.
<br />
<br />
<b>Input Format:</b> Each row has four tab-separated columns of the form:
<br />
<center><font face="Courier" size="3" color="blue">UserId 	 MovieId 	 Rating 	 Timestamp</font></center>
For example, the first line is:</p>
<center><font face="Courier" size="3" color="blue">196 	 242 	 3 	 881250949</font></center>
<p>which means that user 196 gave a rating of 3 to movie 242 at timestamp 881250949. For the matrix factorization approach, we will ignore the timestamp feature. It may be helpful to look at the <a href="https://hunglvosu.github.io/res/toy_rating.data">toy dataset</a>.
<br />
<br />
<br />
<b>Output Format:</b> Two files, named <font face="Verdana,Arial,Helvetica" size="3" color="blue">UT.tsv</font> and <font face="Verdana,Arial,Helvetica" size="3" color="blue">VT.tsv</font>, correspond to two matrices U and V:</p>
<ul> <li><font face="Verdana,Arial,Helvetica" size="3" color="blue">UT.tsv</font>: Each row of the file correspond to each row of the matrix U where the first column is the <font face="Courier" size="3" color="blue">UserId</font> and <font face="Verdana,Arial,Helvetica" size="3" color="blue">d</font> (20 in this assignment) following columns represent the corresponding row of the user in U.
</li>
<li><font face="Verdana,Arial,Helvetica" size="3" color="blue">VT.tsv</font>: Each row of the file correspond to each <font color="red">column</font> of the matrix V where the first column is the <font face="Courier" size="3" color="blue">MovieId</font> and <font face="Verdana,Arial,Helvetica" size="3" color="blue">d</font> (20 in this assignment) following columns represent the corresponding <font color="red">column</font> of the movie in V.
</li>
</ul>
<p>See <a href="https://hunglvosu.github.io/res/UT.tsv">UT.tsv</a> and <a href="https://hunglvosu.github.io/res/VT.tsv">VT.tsv</a> for sample outputs of the toy dataset with <font face="Verdana,Arial,Helvetica" size="3" color="blue">d = 2</font>.
<br />
<br />
<br />
There is only one question worth 50 points.
<br />
<br />
<b>Question (50 points): </b> Factorize the utility matrix into two matrix U and V. You should run your algorithm with <font face="Verdana,Arial,Helvetica" size="3" color="blue">T = 20</font> iterations. For full score, your algorithm must run in <b>less than 5 minutes</b> with RMSE less than <font face="Verdana,Arial,Helvetica" size="3" color="blue">0.62</font>.
<br />
<br />
<br />
<br />
<br />
<br />
<b>Note 1:</b> Submit your code and output data to the Connex</p>
<h1 id="faq">FAQ</h1>
<p><b>Q1:</b> How do I initialize matrices U and V?
<br />
<b>Answer:</b> I initialize entries of U and V by randomly selecting numbers from [0,1] using <font face="Courier" size="3" color="blue">numpy.random.random_sample()</font>.</p>Hung LeDue by March 25, 2019 11:55 pm Problem Specification Goal: In this assignment, we learn how to factorize the utility matrix to build recommender systems. We will use the MovieLens 100k Dataset. This dataset contains about 100k ratings from n = 943 users and m = 1682 movies. We will factorize the utility matrix into two matrices U, V of dimensions nxd and dxm, respectively, where d = 20. Input File: Dowload file ml-100k.zip, look for the file name u.data. We only use data in this file to do factorization. DO NOT assume that users and movies are indexed from 0 to n and m, respectively. Input Format: Each row has four tab-separated columns of the form: UserId 	 MovieId 	 Rating 	 Timestamp For example, the first line is: 196 	 242 	 3 	 881250949 which means that user 196 gave a rating of 3 to movie 242 at timestamp 881250949. For the matrix factorization approach, we will ignore the timestamp feature. It may be helpful to look at the toy dataset. Output Format: Two files, named UT.tsv and VT.tsv, correspond to two matrices U and V: UT.tsv: Each row of the file correspond to each row of the matrix U where the first column is the UserId and d (20 in this assignment) following columns represent the corresponding row of the user in U. VT.tsv: Each row of the file correspond to each column of the matrix V where the first column is the MovieId and d (20 in this assignment) following columns represent the corresponding column of the movie in V. See UT.tsv and VT.tsv for sample outputs of the toy dataset with d = 2. There is only one question worth 50 points. Question (50 points): Factorize the utility matrix into two matrix U and V. You should run your algorithm with T = 20 iterations. For full score, your algorithm must run in less than 5 minutes with RMSE less than 0.62. Note 1: Submit your code and output data to the Connex FAQ Q1: How do I initialize matrices U and V? Answer: I initialize entries of U and V by randomly selecting numbers from [0,1] using numpy.random.random_sample().Data Mining SENG 474/ CSC 578D2020-07-16T00:00:00-07:002020-07-16T00:00:00-07:00https://hunglvosu.github.io/posts/2020/07/Syllabus-DM<p><strong>Course website on Heat</strong> for <a href="https://heat.csc.uvic.ca/coview/outline/2019/Spring/SENG/474">SENG 474</a> and <a href="https://heat.csc.uvic.ca/coview/outline/2019/Spring/CSC/578D">CSC 578D</a>. There will be no further update on these two sites. More up-to-date contents will be on this site.</p>
<p><strong>Tentative topics:</strong></p>
<ul>
<li>Finding similar items</li>
<li>Frequent itemsets</li>
<li>Classification</li>
<li>Regression</li>
<li>Clustering</li>
<li>Recommender Systems</li>
<li>Mining Social-Network Graphs</li>
<li>Link Analysis</li>
<li>Advertising on the Web</li>
<li>A/B Testing</li>
</ul>
<p><strong>Textbook</strong> Materials in the class will be drawn mostly from the free-and-great book <a href="http://www.mmds.org">Mining of Massive Datasets</a> by Jure Leskovec, Anand Rajaraman, Jeff Ullman</p>
<h1 id="teaching-staffs">Teaching Staffs</h1>
<ul>
<li>Instructor: Hung Le.
<ul>
<li>Email: hungle@uvic.ca</li>
<li>Office: ECS 621</li>
<li>Weekly Office Hours: 10:30 am - 12:30 pm Friday</li>
</ul>
</li>
<li>TAs:
<ul>
<li>Sajjad Azami (Email: sajjadaazami@gmail.com)</li>
<li>Cole Peterson (Email: colpeterson@gmail.com)</li>
<li>Jasbir Singh (Email: jasbircheema96@gmail.com)</li>
<li>Weekly Office Hours: Monday 11:00 am - 12:30 pm and Tuesday 1:30-3:00 pm, all at ECS 253</li>
</ul>
</li>
</ul>
<h1 id="annoucement">Annoucement</h1>
<p><br />
<b>Final Soluttion</b> is posted. Check out <a href="https://hunglvosu.github.io/res/final-sol.pdf">here</a>.
<br />
<br />
<b>Practice Problems</b> for the final exam is posted. Check out <a href="https://hunglvosu.github.io/res/Practice.pdf">here</a>.
<br />
<br />
<b>Written Assignment 4</b> is posted. Check out <a href="https://hunglvosu.github.io/res/HW4.pdf">here</a>. Check out <a href="https://hunglvosu.github.io/res/HW4_solution.pdf">the solution</a> written by Jasbir Singh.
<br />
<br />
<b>Programming Assignment 4:</b> The description for the programming assignemnet 4 is online. Plese checkout <a href="/posts/2020/07/PA4/">here</a>. Ask me or the TA if you have any confusion.
<br />
<br />
<b>Midterm solution</b> is posted. Check out <a href="https://hunglvosu.github.io/res/midterm-sol.pdf">here</a>.
<br />
<br />
<b>Project guideline</b> is posted. Check out <a href="https://hunglvosu.github.io/res/guide-line.pdf">here</a>.
<br />
<br />
<b>Written Assignment 3</b> is released. Check out <a href="https://hunglvosu.github.io/res/HW3.pdf">here</a>. Check out <a href="https://hunglvosu.github.io/res/HW3_solution.pdf">the solution</a> written by Jasbir Singh.
<br />
<br />
<b>Programming Assignment 3:</b> The description for the programming assignemnet 3 is online. Plese checkout <a href="/posts/2020/07/PA3/">here</a>. Ask me or the TA if you have any confusion.
<br />
<br />
<b>Written Assignment 2</b> is released. Check out <a href="https://hunglvosu.github.io/res/HW2.pdf">here</a>. Check out <a href="https://hunglvosu.github.io/res/sol-hw2.pdf">the solution</a>. The write up for Q2 and Q4b and Q5b is by Hung Le, for others is by Sajjad Azami.</p>
<p><strong>Programming Assignment 2:</strong> The description for the programming assignemnet 2 is online. Plese checkout <a href="/posts/2020/07/PA2/">here</a>. Ask me or the TA if you have any confusion.</p>
<p><strong>Written Assignment 1</strong> is released. Check out <a href="https://hunglvosu.github.io/res/HW1.pdf">here</a>. Check out <a href="https://hunglvosu.github.io/res/HW1solution.pdf">the solution</a> written by Jasbir Singh.</p>
<p><strong>Programming Assignment 1:</strong> The description for the programming assignemnet 1 is online. Plese checkout <a href="/posts/2020/07/PA1/">here</a>. Ask me or the TA if you have any confusion. Here is a <a href="https://hunglvosu.github.io/res/PA1-sol.py">sample solution</a>.</p>
<h1 id="lectures">Lectures</h1>
<ol>
<li><b>Tue 08/01:</b> Introduction. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch1.pdf">chapter 1</a> of MMD book and <a href="https://hunglvosu.github.io/res/lect1-intro.pdf">my own note</a>. Many interesting <a href="https://en.wikipedia.org/wiki/Big_data#Case_studies">case studies</a> on Big Data on Wikipedia. </li>
<br />
<li><b>Wed 09/01:</b> Review of hashing. Guest lectured by Dr. <a href="http://web.uvic.ca/~nmehta/">Nishant Mehta</a>. See <a href="http://jeffe.cs.illinois.edu/teaching/algorithms/notes/05-hashing.pdf">this note</a> by Jeff Erickson.</li>
<br />
<li><b>Fri 11/01:</b> Finding Similar Items. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch3.pdf">chapter 3</a> of MMD book and <a href="https://hunglvosu.github.io/res/lect2-sim.pdf">my own note</a>. See <a href="https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf">here</a> for the Amazon.com recommendation paper mentioned in class. Other similarity measures (or distances) beside the one we saw in class, you can checkout <a href="https://en.wikipedia.org/wiki/Cosine_similarity">here</a> or <a href="https://en.wikipedia.org/wiki/Jaccard_index">there</a>.</li>
<br />
<li><b>Tue 15/01:</b> Finding Similar Items (Continued). See the references above.</li>
<br />
<li><b>Wed 16/01:</b> Frequent Itemsets. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch6.pdf">chapter 6</a> of MMDS book and <a href="https://hunglvosu.github.io/res/lect3-frequent.pdf">my own note</a>. The <a href="https://tdwi.org/articles/2016/11/15/beer-and-diapers-impossible-correlation.aspx">diapers and beer</a> story.</li>
<br />
<li><b>Fri 18/01:</b> Frequent Itemsets (Continued). See the references above.</li>
<br />
<li><b>Tue 22/01:</b> Frequent Itemsets (Continued). See the references above.</li>
<br />
<li><b>Wed 23/01:</b> Linear Regression. See <a href="http://cs229.stanford.edu/notes/cs229-notes1.pdf">Andrew Ng note </a> and <a href="https://hunglvosu.github.io/res/lect4-linear.pdf">my own note</a>. Also, for a review of linear algebra, see this <a href="http://cs229.stanford.edu/section/cs229-linalg.pdf">note</a>. <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2012/01/tricks-2012.pdf">Stochastic Gradient Descent Tricks</a> by Léon Bottou.</li>
<br />
<li><b>Fri 27/01:</b> Linear Regression (Continued). See the references above.</li>
<br />
<li><b>Tue 29/01:</b> Linear Regression (Continued). See the references above.</li>
<br />
<li><b>Wed 30/01:</b> Support Vector Machine. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch12.pdf">chapter 12</a> of MMDS book and <a href="https://hunglvosu.github.io/res/lect5-SVM.pdf">my own note</a>.</li>
<br />
<li><b>Fri 01/02:</b> Support Vector Machine (Continued). See the references above.</li>
<br />
<li><b>Tue 05/02:</b> Link Analysis. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch5.pdf">chapter 5</a> of MMDS book and and <a href="https://hunglvosu.github.io/res/lect6-Link.pdf">my own note</a>. Many good materials on link analysis from <a href="http://people.seas.harvard.edu/~babis/amazing.html">Harvard Amazing project</a>. <a href="http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf">PageRank</a> paper and <a href="http://www.cs.cornell.edu/home/kleinber/auth.pdf">HITS</a> paper.</li>
<br />
<li><b>Wed 06/02:</b> Link Analysis (Continued). See the references above.</li>
<br />
<li><b>Fri 08/02:</b> Link Analysis (Continued). See the references above.</li>
<br />
<li><b>Tue 12/02:</b> Class canceled due to bad weather. The university is shutdown.</li>
<br />
<li><b>Wed 13/02:</b> Midterm.</li>
<br />
<li><b>Fri 15/02:</b> Link Analysis (Continued). See the references above. </li>
<br />
<li><b>Tue 19/02:</b> Reading break. </li>
<br />
<li><b>Wed 20/02:</b> Reading break. </li>
<br />
<li><b>Fri 22/02:</b> Reading break. </li>
<br />
<li><b>Tue 26/02:</b> Clustering. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch7.pdf">chapter 7</a> of MMDS book and and <a href="https://hunglvosu.github.io/res/lect7-clustering.pdf">my own note</a>. </li>
<br />
<li><b>Wed 27/02:</b> Clustering (Continued). See the references above. </li>
<br />
<li><b>Fri 01/03:</b> Advertising on the Web. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch8.pdf">chapter 8</a> of MMDS book and and <a href="https://hunglvosu.github.io/res/lect8-advertising.pdf">my own note</a>. </li>
<br />
<li><b>Tue 05/03:</b> Advertising on the Web (Continued). See the references above. </li>
<br />
<li><b>Wed 06/03:</b> Advertising on the Web (Continued). See the references above. </li>
<br />
<li><b>Fri 08/03:</b> Recommendation System. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch9.pdf">chapter 9</a> of MMDS book and and <a href="https://hunglvosu.github.io/res/lect9-recommendation.pdf">my own note</a>. </li>
<br />
<li><b>Tue 12/03:</b> Recommendation System (Continued). See the references above. </li>
<br />
<li><b>Wed 13/03:</b> Mining Social Network Graphs. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch10.pdf">chapter 10</a> of MMDS book and and <a href="https://hunglvosu.github.io/res/lect10-social-net.pdf">my own note</a>. </li>
<br />
<li><b>Fri 15/03:</b> Mining Social Network Graphs (Continued). See the references above.</li>
<br />
<li><b>Tue 19/03:</b> Mining Social Network Graphs (Continued). See the references above.</li>
<br />
<li><b>Wed 20/03:</b> Mining Social Network Graphs (Continued). See the references above.</li>
<br />
<li><b>Fri 22/03:</b> Mining Social Network Graphs (Continued). See the references above.</li>
<br />
<li><b>Tue 26/03:</b> Dimensionality Reduction. See <a href="http://infolab.stanford.edu/~ullman/mmds/ch11.pdf">chapter 11</a> of MMDS book and and <a href="https://hunglvosu.github.io/res/lect11-dim-reduction.pdf">my own note</a>. </li>
<br />
</ol>Hung LeCourse website on Heat for SENG 474 and CSC 578D. There will be no further update on these two sites. More up-to-date contents will be on this site.