From 25c83bbb5d447cee8b3617193f014106a17699dc Mon Sep 17 00:00:00 2001
From: Victor Shnayder <victor@mitx.mit.edu>
Date: Sun, 26 Aug 2012 21:55:55 -0400
Subject: [PATCH] xml format docs!

---
 doc/xml-format.md | 370 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 264 insertions(+), 106 deletions(-)
diff --git a/doc/xml-format.md b/doc/xml-format.md
index 2a9e379ccc..55bcda4480 100644
--- a/doc/xml-format.md
+++ b/doc/xml-format.md
@@ -1,147 +1,305 @@
-This doc is a rough spec of our xml format
+# edX xml format tutorial
 
-Every content element (within a course) should have a unique id.  This id is formed as {category}/{url_name}.  Categories are the different tag types ('chapter', 'problem', 'html', 'sequential', etc).  Url_name is a string containing a-z, A-Z, dot (.) and _.  This is what appears in urls that point to this object.
+## Goals of this document
 
-File layout:
+*	This was written assuming the reader has no prior programming/CS knowledge and has jumped cold turkey into the edX platform.
+*	To educate the reader on how to build and maintain the back end structure of the course content. This is important for debugging and standardization.
+*	After reading this, you should be able to add content to a course and make sure it shows up in the courseware and does not break the code.
+* __Prerequisites:__ it would be helpful to know a little bit about xml.  Here is a [simple example](http://www.ultraslavonic.info/intro-to-xml/) if you've never seen it before.
 
-- Xml files have content
-- "policy", which is also called metadata in various places, should live in a policy file.
+## Outline
 
-- each module (except customtag and course, which are special, see below) should live in a file, located at {category}/{url_name].xml
-To include this module in another one (e.g. to put a problem in a vertical), put in a "pointer tag":  <{category} url_name="{url_name}"/>.  When we read that, we'll load the actual contents.
+*	First, we will show a sample course structure as a case study/model of how xml and files in a course are organized to introductory understanding.
 
-Customtag is already a pointer, you can just use it in place: <customtag url_name="my_custom_tag" impl="blah" attr1="..."/>
+*	More technical details are below, including discussion of some special cases.
 
-Course tags:
-  - the top level course pointer tag lives in course.xml
-  - have 2 extra required attributes: "org" and "course" -- organization name, and course name.  Note that the course name is referring to the platonic ideal of this course, not to any particular run of this course.  The url_name should be particular run of this course.  E.g.
 
-If course.xml contains:
-<course org="HarvardX" course="cs50" url_name="2012"/>
+## Introduction
 
-we would load the actual course definition from course/2012.xml
+*	The course is organized hierarchically.  We start by describing course-wide parameters, then break the course into chapters, and then go deeper and deeper until we reach a specific pset, video, etc.
 
-To support multiple different runs of the course, you could have a different course.xml, containing
+*	You could make an analogy to finding a green shirt in your house - front door -> bedroom -> closet -> drawer -> shirts -> green shirt
 
-<course org="HarvardX" course="cs50" url_name="2012H"/>
 
-which would load the Harvard-internal version from course/2012H.xml
+## Case Study
 
-If there is only one run of the course for now, just have a single course.xml with the right url_name.
+Let's jump right in by looking at the directory structure of a very simple toy course:
 
-If there is more than one run of the course, the different course root pointer files should live in
-roots/url_name.xml, and course.xml should be a symbolic link to the one you want to run in your dev instance.
+    toy/
+        course
+        course.xml
+        problem
+        policies
+        roots
 
-If you want to run both versions, you need to checkout the repo twice, and have course.xml point to different root/{url_name}.xml files.
+The only top level file is `course.xml`, which should contain one line, looking something like this:
 
-Policies:
- - the policy for a course url_name lives in policies/{url_name}.json
+    <course org="edX" course="toy" url_name="2012_Fall"/>
 
-The format is called "json", and is best shown by example (though also feel free to google :)
+This gives all the information to uniquely identify a particular run of any course--which organization is producing the course, what the course name is, and what "run" this is, specified via the `url_name` attribute.
 
-the file is a dictionary (mapping from keys to values, syntax "{ key : value, key2 : value2, etc}"
+Obviously, this doesn't actually specify any of the course content, so we need to find that next.  To know where to look, you need to know the standard organizational structure of our system: _course elements are uniquely identified by the combination `(category, url_name)`_.  In this case, we are looking for a `course` element with the `url_name` "2012_Fall".  The definition of this element will be in `course/2012_Fall.xml`.  Let's look there next:
 
-Keys are in the form "{category}/{url_name}", which should uniquely id a content element.
-Values are dictionaries of the form {"metadata-key" : "metadata-value"}.
+`course/2012_Fall.xml`
 
-metadata can also live in the xml files, but anything defined in the policy file overrides anything in the xml.  This is primarily for backwards compatibility, and you should probably  not use both.  If you do leave some metadata tags in the xml, please be consistent (e.g. if display_names stay in xml, they should all stay in xml).
-   - note, some xml attributes are not metadata.  e.g. in <video youtube="xyz987293487293847"/>, the youtube attribute specifies what video this is, and is logically part of the content, not the policy, so it should stay in video/{url_name}.xml.
+    <course>
+      <chapter url_name="Overview">
+        <videosequence url_name="Toy_Videos">
+          <problem url_name="warmup"/>
+          <video url_name="Video_Resources" youtube="1.0:1bK-WdDi6Qw"/>
+        </videosequence>
+        <video url_name="Welcome" youtube="1.0:p2Q6BrNhdh8"/>
+      </chapter>
+    </course>
 
-Example policy file:
-{
-    "course/2012": {
-        "graceperiod": "1 day",
-        "start": "2012-10-15T12:00",
-        "display_name": "Introduction to Computer Science I",
-        "xqa_key": "z1y4vdYcy0izkoPeihtPClDxmbY1ogDK"
-    },
-    "chapter/Week_0": {
-        "display_name": "Week 0"
-    },
-    "sequential/Pre-Course_Survey": {
-        "display_name": "Pre-Course Survey",
-        "format": "Survey"
+Aha.  Now we found some content.  We can see that the course is organized hierarchically, in this case with only one chapter, with `url_name` "Overview".   The chapter contains a `videosequence` and a `video`, with the sequence containing a problem and another video.  When viewed in the courseware, chapters are shown at the top level of the navigation accordion on the left, with any elements directly included in the chapter below.
+
+Looking at this file, we can see the course structure, and the youtube urls for the videos, but what about the "warmup" problem?  There is no problem content here!    Where should we look?  This is a good time to pause and try to answer that question based on our organizational structure above.
+
+As you hopefully guessed, the problem would be in `problem/warmup.xml`.  (Note: This tutorial doesn't discuss the xml format for problems--there are chapters of edx4edx that describe it.)  This is an instance of a _pointer tag:_ any xml tag with only the category and a url_name attribute will point to the file `{category}/{url_name}.xml`.  For example, this means that our toy `course.xml` could have also been written as
+
+`course/2012_Fall.xml`
+
+    <course>
+      <chapter url_name="Overview"/>
+    </course>
+
+with `chapter/Overview.xml` containing
+
+    <chapter>
+        <videosequence url_name="Toy_Videos">
+          <problem url_name="warmup"/>
+          <video url_name="Video_Resources" youtube="1.0:1bK-WdDi6Qw"/>
+        </videosequence>
+        <video url_name="Welcome" youtube="1.0:p2Q6BrNhdh8"/>
+    </chapter>
+
+In fact, this is the recommended structure for real courses--putting each chapter into its own file makes it easy to have different people work on each without conflicting or having to merge.  Similarly, as sequences get large, it can be handy to split them out as well (in `sequence/{url_name}.xml`, of course).
+
+Note that the `url_name` is only specified once per element--either the inline definition, or in the pointer tag.
+
+## Policy files
+
+We still haven't looked at two of the directoies in the top-level listing above: `policies` and `roots`.  Let's look at policies next.  The policy directory contains one file:
+
+    policies:
+        2012_Fall.json
+
+and that file is named {course-url_name}.json.  As you might expect, this file contains a policy for the course.  In our example, it looks like this:
+
+    2012_Fall.json:
+    {
+        "course/2012_Fall": {
+            "graceperiod": "2 days 5 hours 59 minutes 59 seconds",
+            "start": "2015-07-17T12:00",
+            "display_name": "Toy Course"
+        },
+        "chapter/Overview": {
+            "display_name": "Overview"
+        },
+        "videosequence/Toy_Videos": {
+            "display_name": "Toy Videos",
+            "format": "Lecture Sequence"
+        },
+        "problem/warmup": {
+            "display_name": "Getting ready for the semester"
+        },
+        "video/Video_Resources": {
+            "display_name": "Video Resources"
+        },
+        "video/Welcome": {
+            "display_name": "Welcome"
+        }
     }
-}
 
-NOTE: json is picky about commas.  If you have trailing commas before closing braces, it will complain and refuse to parse the file.  This is irritating.
+The policy specifies metadata about the content elements--things which are not inherent to the definition of the content, but which describe how the content is presented to the user and used in the course.  See below for a full list of metadata attributes; as the example shows, they include `display_name`, which is what is shown when this piece of content is referenced or shown in the courseware, and various dates and times, like `start`, which specifies when the content becomes visible to students, and various problem-specific parameters like the allowed number of attempts.  One important point is that some metadata is inherited: for example, specifying the start date on the course makes it the default for every element in the course.  See below for more details.
+
+It is possible to put metadata directly in the xml, as attributes of the appropriate tag, but using a policy file has two benefits: it puts all the policy in one place, making it easier to check that things like due dates are set properly, and it allows the content definitions to be easily used in another run of the same course, with the same or similar content, but different policy.
+
+## Roots
+
+The last directory in the top level listing is `roots`.  In our toy course, it contains a single file:
+
+    roots/
+        2012_Fall.xml
+
+This file is identical to the top-level `course.xml`, containing
+
+    <course org="edX" course="toy" url_name="2012_Fall"/>
+
+In fact, the top level `course.xml` is a symbolic link to this file.  When there is only one run of a course, the roots directory is not really necessary, and the top-level course.xml file can just specify the `url_name` of the course.  However, if we wanted to make a second run of our toy course, we could add another file called, e.g., `roots/2013_Spring.xml`, containing
+
+    <course org="edX" course="toy" url_name="2013_Spring"/>
+
+After creating `course/2013_Spring.xml` with the course structure (possibly as a symbolic link or copy of `course/2012_Fall.xml` if no content was changing), and `policies/2013_Spring.json`, we would have two different runs of the toy course in the course repository.  Our build system understands this roots structure, and will build a course package for each root.  (Dev note: if you're using a local development environment, make the top level `course.xml` point to the desired root, and check out the repo multiple times if you need multiple runs simultaneously).
+
+That's basically all there is to the organizational structure.  Read the next section for details on the tags we support, including some special case tags like `customtag` and `html` invariants, and look at the end for some tips that will make the editing process easier.
+
+----------
+
+# Tag types
+
+* `abtest` -- Support for A/B testing.  TODO: add details..
+* `chapter` -- top level organization unit of a course.   The courseware display code currently expects the top level `course` element to contain only chapters, though there is no philosophical reason why this is required, so we may change it to properly display non-chapters at the top level.
+* `course` -- top level tag.  Contains everything else.
+* `customtag` -- render an html template, filling in some parameters, and return the resulting html.  See below for details.
+* `html` -- a reference to an html file.
+* `error`  -- don't put these in by hand :)   The internal representation of content that has an error, such as malformed xml or some broken invariant.  You may see this in the xml once the CMS is in use...
+* `problem` -- a problem.  See elsewhere in edx4edx for documentation on the format.
+* `problemset` -- logically, a series of related problems.  Currently displayed vertically.  May contain explanatory html, videos, etc.
+* `sequential` -- a sequence of content, currently displayed with a horizontal list of tabs.  If possible, use a more semantically meaningful tag (currently, we only have `videosequence`).
+* `vertical` -- a sequence of content, displayed vertically.  If possible, use a more semantically meaningful tag (currently, we only have `problemset`).
+* `video`  -- a link to a video, currently expected to be hosted on youtube.
+* `videosequence` -- a sequence of videos.  This can contain various non-video content; it just signals to the system that this is logically part of an explanatory sequence of content, as opposed to say an exam sequence.
+
+## Tag details
+
+### Container tags
+
+Container tags include `chapter`, `sequential`, `videosequence`, `vertical`, and `problemset`.  They are all specified in the same way in the xml, as shown in the tutorial above.
+
+### `course`
+
+`course` is also a container, and is similar, with one extra wrinkle: the top level pointer tag _must_ have  `org` and `course` attributes specified--the organization name, and course name.  Note that `course` is referring to the platonic ideal of this course (e.g. "6.002x"), not to any particular run of this course.  The `url_name` should be the particular run of this course.
+
+### `customtag`
+
+When we see `<customtag impl="special" animal="unicorn" hat="blue"/>`, we will:
+
+* look for a file called `custom_tags/special`  in your course dir.
+* render it as a mako template, passing parameters {'animal':'unicorn', 'hat':'blue'}, generating html.  (Google `mako` for template syntax, or look at existing examples).
+
+Since `customtag` is already a pointer, there is generally no need to put it into a separate file--just use it in place: <customtag url_name="my_custom_tag" impl="blah" attr1="..."/>
 
 
-Valid tag categories:
+### `html`
 
-abtest
-chapter
-course
-customtag
-html
-error  -- don't put these in by hand :)
-problem
-problemset
-sequential
-vertical
-video
-videosequence
+Most of our content is in xml, but some html content may not be proper xml (all tags matched, single top-level tag, etc), since browsers are fairly lenient in what they'll display.  So, there are two ways to include html content:
 
-Obsolete tags:
-Use customtag instead:
-  videodev
-  book
-  slides
-  image
-  discuss
+* If your html content is in a proper xml format, just put it in `html/{url_name}.xml`.
+* If your html content is not in proper xml format, you can put it in `html/{filename}.html`, and put `<html filename={filename} />` in `html/{filename}.xml`.  This allows another level of indirection, and makes sure that we can read the xml file and then just return the actual html content without trying to parse it.
 
-Ex: instead of <book page="12"/>, use <customtag impl="book" page="12"/>
+### `video`
 
-Use something semantic instead, as makes sense: sequential, vertical, videosequence if it's actually a sequence.  If the section would only contain a single element, just include that element directly.
-  section
+Videos have an attribute youtube, which specifies a series of speeds + youtube videos id:
 
-In general, prefer the most "semantic" name for containers: e.g. use problemset rather than vertical for a problem set.  That way, if we decide to display problem sets differently, we don't have to change the xml.
+    <video youtube="0.75:1yk1A8-FPbw,1.0:vNMrbPvwhU4,1.25:gBW_wqe7rDc,1.50:7AE_TKgaBwA" url_name="S15V14_Response_to_impulse_limit_case"/>
 
-How customtags work:
- When we see <customtag impl="special" animal="unicorn" hat="blue"/>, we will:
+This video has been encoded at 4 different speeds: 0.75x, 1x, 1.25x, and 1.5x.
 
- - look for a file called custom_tags/special  in your course dir.
- - render it as a mako template, passing parameters {'animal':'unicorn', 'hat':'blue'}, generating html.
+## More on `url_name`s
+
+Every content element (within a course) should have a unique id.  This id is formed as `{category}/{url_name}`, or automatically generated from the content if `url_name` is not specified.  Categories are the different tag types ('chapter', 'problem', 'html', 'sequential', etc).  Url_name is a string containing a-z, A-Z, dot (.) and _.  This is what appears in urls that point to this object.
+
+__IMPORTANT__: A student's state for a particular content element is tied to the element id, so the automatic id generation if only ok for elements that do not need to store any student state (e.g. verticals or customtags).  For problems, sequentials, and videos, and any other element where we keep track of what the student has done and where they are at, you should specify a unique `url_name`.  Of course, any content element that is split out into a file will need a `url_name` to specify where to find the definition.  When the CMS comes online, it will use these ids to enable content reuse, so if there is a logical name for something, please do specify it.
+
+-----
+
+## Policy files
+
+*	A policy file is useful when running different versions of a course e.g. internal, external, fall, spring, etc. as you can change due dates, etc, by creating multiple policy files.
+*	A policy file provides information on the metadata of the course--things that are not inherent to the definitions of the contents, but that may vary from run to run.
+* Note: We will be expanding our understanding and format for metadata in the not-too-distant future, but for now it is simply a set of key-value pairs.
+
+### Policy file location
+* The policy for a course run `some_url_name` lives in `policies/some_url_name.json`
+
+### Policy file contents
+* The file format is "json", and is best shown by example, as in the tutorial above (though also feel free to google :)
+* The expected contents are a dictionary mapping from keys to values (syntax "{ key : value, key2 : value2, etc}")
+* Keys are in the form "{category}/{url_name}", which should uniquely identify a content element.
+Values are dictionaries of the form {"metadata-key" : "metadata-value"}.
+* The order in which things appear does not matter, though it may be helpful to organize the file in the same order as things appear in the content.
+* NOTE: json is picky about commas.  If you have trailing commas before closing braces, it will complain and refuse to parse the file.  This can be irritating at first.
+
+### Available metadata
+
+__Not inherited:__
+
+* `display_name` - name that will appear when this content is displayed in the courseware.  Useful for all tag types.
+*	`format` - subheading under display name -- currently only displayed for chapter sub-sections.
+* `hide_from_toc` -- If set to true for a chapter or chapter subsection, will hide that element from the courseware navigation accordion.  This is useful if you'd like to link to the content directly instead (e.g. for tutorials)
+* `ispublic` -- specify whether the course is public.  You should be able to use start dates instead (?)
+
+__Inherited:__
+
+* `start` -- when this content should be shown to students.  Note that anyone with staff access to the course will always see everything.
+*	`showanswer` - only for psets, is binary (closed/open).
+*	`graded` - Tutorial vs. grade, again binary (true/false). If true, will be used in calculation of student grade.
+*	`rerandomise` - Provide different numbers/variables for problems to prevent cheating. Provide different answers from questions bank?
+*	`due` - Due date for assignment. Assignment will be closed after that. This is a very important function of a policy file.
+* `graceperiod` -
+* `xqa_key` -- for integration with Ike's content QA server. -- should typically be specified at the course level.
+
+__Inheritance example:__
+
+This is a sketch ("tue" is not a valid start date), that should help illustrate how metadata inheritance works.
+
+    <course start="tue">
+      <chap1> -- start tue
+        <problem>   --- start tue
+      </chap1>
+      <chap2 start="wed">  -- start wed
+       <problem2 start="thu">  -- start thu
+       <problem3>      -- start wed
+      </chap2>
+    </course>
 
 
-METADATA
+## Specifying metadata in the xml file
 
-Metadata that we generally understand:
-Only on course tag in courses/url_name.xml
-  ispublic
-  xqa_key  -- set only on course, inherited to everything else
+Metadata can also live in the xml files, but anything defined in the policy file overrides anything in the xml.  This is primarily for backwards compatibility, and you should probably  not use both.  If you do leave some metadata tags in the xml, you should be consistent (e.g. if `display_name`s stay in xml, they should all stay in xml).
+   - note, some xml attributes are not metadata.  e.g. in `<video youtube="xyz987293487293847"/>`, the `youtube` attribute specifies what video this is, and is logically part of the content, not the policy, so it should stay in the xml.
 
-Everything:
-  display_name
-  format   (maybe only content containers, e.g. "Lecture sequence", "problem set", "lab", etc. )
-  start  -- modules will not show up to non-course-staff users before the start date (in production)
-  hide_from_toc  -- if this is true, don't show in table of contents for the course.  Useful on chapters, and chapter subsections that are linked to from somewhere else.
-
-Used for problems
-graceperiod
-showanswer
-rerandomize
-graded
-due
+Another example policy file:
+    {
+        "course/2012": {
+            "graceperiod": "1 day",
+            "start": "2012-10-15T12:00",
+            "display_name": "Introduction to Computer Science I",
+            "xqa_key": "z1y4vdYcy0izkoPeihtPClDxmbY1ogDK"
+        },
+        "chapter/Week_0": {
+            "display_name": "Week 0"
+        },
+        "sequential/Pre-Course_Survey": {
+            "display_name": "Pre-Course Survey",
+            "format": "Survey"
+        }
+    }
 
 
-These are _inherited_ : if specified on the course, will apply to everything in the course, except for things that explicitly specify them, and their children.
-        'graded', 'start', 'due', 'graceperiod', 'showanswer', 'rerandomize',
-        # TODO (ichuang): used for Fall 2012 xqa server access
-        'xqa_key',
 
-Example sketch:
-<course start="tue">
-  <chap1> -- start tue
-    <problem>   --- start tue
-  </chap1>
-  <chap2 start="wed">  -- start wed
-   <problem2 start="thu">  -- start thu
-   <problem3>      -- start wed
-  </chap2>
-</course>
+## Deprecated formats
+
+If you look at some older xml, you may see some tags or metadata attributes that aren't listed above.  They are deprecated, and should not be used in new content.  We include them here so that you can understand how old-format content works.
+
+### Obsolete tags:
+
+* `section` : this used to be necessary within chapters.  Now, you can just use any standard tag inside a chapter, so use the container tag that makes the most sense for grouping content--e.g. `problemset`, `videosequence`, and just include content directly if it belongs inside a chapter (e.g. `html`, `video`, `problem`)
+
+* There used to be special purpose tags that all basically did the same thing, and have been subsumed by `customtag`.  The list is `videodev, book, slides, image, discuss`.  Use `customtag` in new content.  (e.g. instead of `<book page="12"/>`, use `<customtag impl="book" page="12"/>`)
+
+### Obsolete attributes
+
+* `slug` -- old term for `url_name`.  Use `url_name`
+* `name` -- we didn't originally have a distinction between `url_name` and `display_name` -- this made content element ids fragile, so please use `url_name` as a stable unique identifier for the content, and `display_name` as the particular string you'd like to display for it.
 
 
-STATIC LINKS:
+# Static links
 
-if your content links (e.g. in an html file)  to "static/blah/ponies.jpg", we will look for this in YOUR_COURSE_DIR/blah/ponies.jpg.  Note that this is not looking in a static/ subfolder in your course dir.  This may (should?) change at some point.
+if your content links (e.g. in an html file)  to `"static/blah/ponies.jpg"`, we will look for this in `YOUR_COURSE_DIR/blah/ponies.jpg`.  Note that this is not looking in a `static/` subfolder in your course dir.  This may (should?) change at some point.   Links that include `/course` will be rewritten to the root of your course in the courseware (e.g. `courses/{org}/{course}/{url_name}/` in the current url structure).  This is useful for linking to the course wiki, for example.
+
+# Tips for content developers
+
+* We will be making better tools for managing policy files soon.  In the meantime, you can add dummy definitions to make it easier to search and separate the file visually.  For example, you could add:
+
+    "WEEK 1" : "##################################################",
+
+before the week 1 material to make it easy to find in the file.
+
+* Come up with a consistent pattern for url_names, so that it's easy to know where to look for any piece of content.  It will also help to come up with a standard way of splitting your content files.  As a point of departure, we suggest splitting chapters, sequences, html, and problems into separate files.
+
+* A heads up: our content management system will allow you to develop content through a web browser, but will be backed by this same xml at first.  Once that happens, every element will be in its own file to make access and updates faster.
+
+* Prefer the most "semantic" name for containers: e.g., use problemset rather than vertical for a problem set.  That way, if we decide to display problem sets differently, we don't have to change the xml.