Friday, November 14, 2014

SageMathCloud Notifications are Now Better

I just made live a new notifications systems for SageMathCloud, which I spent all week writing.

These notifications are what you see when you click the bell in the upper right. This new system replaces the one I made live two weeks ago. Whenever somebody actively *edits* (using the web interface) any file in any project you collaborate on, a notification will get created or updated. If a person *comments* on any file in any project you collaborate on (using the chat interface to the right), then not only does the notification get updated, there is also a little red counter on top of the bell and also in the title of the SageMathCloud tab. In particular, people will now be much more likely to see the chats you make on files.

NOTE: I have not yet enabled any sort of daily email notification summary, but that is planned.

Some technical details: Why did this take all week? It's because the technology that makes it work behind the scenes is something that was fairly difficult for me to figure out how to implement. I implemented a way to create an object that can be used simultaneously by many clients and supports realtime synchronization.... but is stored by the distributed Cassandra database instead of a file in a project. Any changes to that object get synchronized around very quickly. It's similar to how synchronized text editing (with several people at once) works, but I rethought differential synchronization carefully, and also figured out how to synchronize using an eventually consistent database. This will be useful for implementing a lot other things in SageMathCloud that operate at a different level than "one single project". For example, I plan to add functions so you can access these same "synchronized databases" from Python processes -- then you'll be able to have sage worksheets (say) running on several different projects, but all saving their data to some common synchronized place (backed by the database). Another application will be a listing of the last 100 (say) files you've opened, with easy ways to store extra info about them. It will also be easy to make account and project settings more realtime, so when you change something, it automatically takes effect and is also synchronized across other browser tabs you may have open. If you're into modern Single Page App web development, this might remind you of Angular or React or Hoodie or Firebase -- what I did this week is probably kind of like some of the sync functionality of those frameworks, but I use Cassandra (instead of MongoDB, say) and differential synchronization.

I BSD-licensed the differential synchronization code that I wrote as part of the above.

Friday, October 17, 2014

A Non-technical Overview of the SageMathCloud Project

What problems is the SageMathCloud project trying to solve? What pain points does it address? Who are the competitors and what is the state of the technology right now?

What problems you’re trying to solve and why are these a problem?

Computational Education: How can I teach a course that involves mathematical computation and programming?
Computational Research: How can I carry out collaborative computational research projects?
Cloud computing: How can I get easy user-friendly collaborative access to a remote Linux server?

What are the pain points of the status quo and who feels the pain?

Student/Teacher pain:
- Getting students to install software needed for a course on their computers is a major pain; sometimes it is just impossible, due to no major math software (not even Sage) supporting all recent versions of Windows/Linux/OS X/iOS/Android.
- Getting university computer labs to install the software you need for a course is frustrating and expensive (time and money).
- Even if computer labs worked, they are often being used by another course, stuffy, and students can't possibly do all their homework there, so computation gets short shrift. Lab keyboards, hardware, etc., all hard to get used to. Crappy monitors.
- Painful confusing problems copying files around between teachers and students.
- Helping a student or collaborator with their specific problem is very hard without physical access to their computer.
Researcher pain:
- Making backups every few minutes of the complete state of everything when doing research often hard and distracting, but important for reproducibility.
- Copying around documents, emailing or pushing/pulling them to revision control is frustrating and confusing.
- Installing obscuring software is frustarting and distracting from the research they really want to do.
Everybody:
- It is frustrating not having LIVE working access to your files wherever you are. (Dropbox/Github doesn't solve this, since files are static.)
- It is difficult to leave computations running remotely.

Why is your technology poised to succeed?

When it works, SageMathCloud solves every pain point listed above.
The timing is right, due to massive improvements in web browsers during the last 3 years.
I am on full sabbatical this year, so at least success isn't totally impossible due to not working on the project.
I have been solving the above problems in less scalable ways for myself, colleagues and students since the 1990s.
SageMathCloud has many users that provide valuable feedback.
We have already solved difficult problems since I started this project in Summer 2012 (and launched first version in April 2013).

Who are your competitors?

There are no competitors with a similar range of functionality. However, there are many webapps that have some overlap in capabilities:

Mathematical overlap: Online Mathematica: "Bring Mathematica to life in the cloud"
Python overlap: Wakari: "Web-based Python Data Analysis"; also PythonAnywhere
Latex overlap: ShareLaTeX, WriteLaTeX
Web-based IDE's/terminals: target writing webapps (not research or math education): c9.io, nitrous.io, codio.com, terminal.com
Homework: WebAssign and WebWork

Right now, SageMathCloud gives away for free far more than any other similar site, due to very substantial temporary financial support from Google, the NSF and others.

What’s the total addressable market?

Though our primary focus is the college mathematics classroom, there is a larger market:

Students: all undergrad/high school students in the world taking a course involving programming or mathematics
Teachers: all teachers of such courses
Researchers: anybody working in areas that involve programming or data analysis

Moreover, the web-based platform for computing that we're building lends itself to many other collaborative applications.

What stage is your technology at?

The site is up and running and has 28,413 monthly active users
There are still many bugs
I have a precise todo list that would take me at least 2 months fulltime to finish.

Is your solution technically feasible at this point?

Yes. It is only a matter of time until the software is very polished.
Morever, we have compute resources to support significantly more users.
But without money (from paying customers or investment), if growth continues at the current rate then we will have to clamp down on free quotas for users.

What technical milestones remain?

Infrastructure for creating automatically-graded homework problems.
Fill in lots of details and polish.

Do you have external credibility with technical/business experts and customers?

Business experts: I don't even know any business experts.
Technical experts: I founded the Sage math software, which is 10 years old and relatively well known by mathematicians.
Customers: We have no customers; we haven't offered anything for sale.

Market research?

I know about math software and its users as a result of founding the Sage open source math software project, NSF-funded projects I've been involved in, etc.

Is the intellectual property around your technology protected?

The IP is software.
The website software is mostly new Javascript code we wrote that is copyright Univ. of Washington and mostly not open source; it depends on various open source libraries and components.
The Sage math software is entirely open source.

Who are the team members to move this technology forward?

I am the only person working on this project fulltime right now.

Everything: William Stein -- UW professor
Browser client code: Jon Lee, Andy Huchala, Nicholas Ruhland -- UW undergrads
Web design, analytics: Harald Schilly -- Austrian grad student
Hardware: Keith Clawson

Why are you the ideal team?

We are not the ideal team.
If I had money maybe I could build the ideal team, leveraging my experience and connections from the Sage project...

Thursday, October 16, 2014

Public Sharing in SageMathCloud, Finally

SageMathCloud (SMC) is a free (NSF, Google and UW supported) website that lets you collaboratively work with Sage worksheets, IPython notebooks, LaTeX documents and much, much more. All work is snapshotted every few minutes, and copied out to several data centers, so if something goes wrong with a project running on one machine (right before your lecture begins or homework assignment is due), it will pop up on another machine. We designed the backend architecture from the ground up to be very horizontally scalable and have no single points of failure.

This post is about an important new feature: You can now mark a folder or file so that all other users can view it, and very easily copy it to their own project.

This solves problems:

Problem: You create a "template" project, e.g., with pre-installed software, worksheets, IPython notebooks, etc., and want other users to easily be able to clone it as a new project. Solution: Mark the home directory of the project public, and share the link widely.

Problem: You create a syllabus for a course, an assignment, a worksheet full of 3d images, etc., that you want to share with a group of students. Solution: Make the syllabus or worksheet public, and share the link with your students via an email and on the course website. (Note: You can also use a course document to share files with all students privately.) For example...

Problem: You run into a problem using SMC and want help. Solution: Make the worksheet or code that isn't working public, and post a link in a forum asking for help.

Problem: You write a blog post explaining how to solve a problem and write related code in an SMC worksheet, which you want your readers to see. Solution: Make that code public and post a link in your blog post.

Here's a screencast.

Each SMC project has its own local "project server", which takes some time to start up, and serves files, coordinates Sage, terminal, and IPython sessions, etc. Public sharing completely avoids having anything to do with the project server -- it works fine even if the project server is not running -- it's always fast and there is no startup time if the project server isn't running. Moreover, public sharing reads the live files from your project, so you can update the files in a public shared directory, add new files, etc., and users will see these changes (when they refresh, since it's not automatic).
As an example, here is the cloud-examples github repo as a share. If you click on it (and have a SageMathCloud account), you'll see this:

What Next?

There is an enormous amount of natural additional functionality to build on top of public sharing.

For example, not all document types can be previewed in read-only mode right now; in particular, IPython notebooks, task lists, LaTeX documents, images, and PDF files must be copied from the public share to another project before people can view them. It is better to release a first usable version of public sharing before systematically going through and implementing the additional features needed to support all of the above. You can make complicated Sage worksheets with embedded images and 3d graphics, and those can be previewed before copying them to a project.
Right now, the only way to visit a public share is to paste the URL into a browser tab and load it. Soon the projects page will be re-organized so you can search for publicly shared paths, see all public shares that you have previously visited, who shared them, how many +1's they've received, comments, etc.

Also, I plan to eventually make it so public shares will be visible to people who have not logged in, and when viewing a publicly shared file or directory, there will be an option to start it running in a limited project, which will vanish from existence after a period of inactivity (say).

There are also dozens of details that are not yet implemented. For example, it would be nice to be able to directly download files (and directories!) to your computer from a public share. And it's also natural to share a folder or file with a specific list of people, rather than sharing it publicly. If somebody is viewing a public file and you change it, they should likely see the update automatically. Right now when viewing a share, you don't even know who shared it, and if you open a worksheet it can automatically execute Javascript, which is potentially unsafe. Once public content is easily found, if somebody posts "evil" content publicly, there needs to be an easy way for users to report it.
Sharing has been thought about a great deal during the last few years in the context of sites such as Github, Facebook, Google+ and Twitter. With SMC, we've developed a foundation for interactive collaborative computing in a browser, and will introduce sharing on top of that in a way that is motivated by your problems. For example, as with Github or Google+, when somebody makes a copy of your publicly shared folder, this copy should be listed (under "copies") and it could start out public by default. There is much to do.

One reason it took so long to release the first version of public sharing is that I kept imagining that sharing would happen at the level of complete projects, just like sharing in Github. However, when thinking through your problems, it makes way more sense in SMC to share individual directories and files. Technically, sharing at this level works works well for read-only access, not for read-write access, since projects are mapped to Linux accounts. Another reason I have been very hesitant to support sharing is that I've had enormous trouble over the years with spammers posting content that gets me in trouble (with my University -- it is illegal for UW to host advertisements). However, by not letting search engines index content, the motivation for spammers to post nasty content is greatly reduced.

Imagine publicly sharing recipes for automatically gradable homework problems, which use the full power of everything installed in SMC, get forked, improved, used, etc.

Wednesday, October 1, 2014

SageMathCloud Course Management

SageMathCloud now has some very rudimentary course management functionality. Though still very basic, it makes it much, much easier to make files available to students, collect homework, etc., entirely using SageMathCloud (without having to use email or any other submissions systems or github to share files). To get started, create a new course by clicking on +New, then typing the name of your course and click "Course":

Course documents allow you to manage a list of students, create projects for each of them, share homework and folders with them, collect homework, and grade and return it to students.

Add Students

To add a student to your course, click on the Students tab, then type a student's name or email address in the "Add student" box to the right and press enter or click the button. Searching for an email address is best, since you can be certain that the person you're adding is really a student in your course (instead of an unknown SageMathCloud user with the same name); moreover, if your student doesn't already have an account, they will receive an invitation via email. (NOTE: There is currently no way to add dozens of students at once.)

Once you add a student, click on Create Project next to your student's name to create their project. You own the project, and they will be added as a collaborator, and invited by email if they do not yet have an account.

Don't worry, student projects are hidden by default from your main project listing.

(To delete a student, click to the right of the student. You can toggle whether deleted students are shown in settings.)

Add Assignments

To create an assignment, first click New in the upper left of your project to create a new folder, and create or add files to it, as usual. Click on the Assignments tab of the course, then search for the folder by typing some part of its name in the box on the far right. Click to select the folder and it will be added to your list of assignments. To make copies of this folder available to all of your students whose projects you have created, click the Assign button. NOTE: You can share arbitrary folders with any contents with your students -- folders don't have to contain "assignments", and may contain anything, Sage worksheets, IPython notebooks, LaTeX documents, etc.

Collecting and Grading Assignments

To collect an assignment from your students, click Collect to the right of an assignment to collect it from all students. (NOTE: There is currently no way to schedule collection to happen at a specific time -- it happens when you click the button. Click it again to update the collected files.)
Once the assignments are collected, click and select a student to jump to the folder that contains the collected version of a student's assignment. Edit the files there, indicating grades on each problem, etc. NOTE: There is no special support yet for recording grades, knowing which homework you have graded already, etc.
When you are done grading, click Return Graded to return the graded homework to the students. If the homework folder is called homework1, then the graded version will appear in the student's project as homework1-graded.

Course Settings

Set the title and description of the course in the Settings tab. When you change these, the new title and description propagates automatically to all student projects for this course.

Other

A Video Tutorial

Technical Remarks

The underlying file format of a .course file is a plain text file with one line in JSON format for each student, shared assignment, and for settings.

Wednesday, August 27, 2014

What is SageMathCloud: let's clear some things up

[PDF version of this blog post]

"You will have to close source and commercialize Sage. It's inevitable." -- Michael Monagan, cofounder of Maple, told me this in 2006.

SageMathCloud (SMC) is a website that I first launched in April 2013, through which you can use Sage and all other open source math software online, edit Latex documents, IPython notebooks, Sage worksheets, track your todo items, and many other types of documents. You can write, compile, and run code in most programming languages, and use a color command line terminal. There is realtime collaboration on everything through shared projects, terminals, etc. Each project comes with a default quota of 5GB disk space and 8GB of RAM.

SMC is fun to use, pretty to look at, frequently backs up your work in many ways, is fault tolerant, encourages collaboration, and provides a web-based way to use standard command-line tools.

The Relationship with the SageMath Software

The goal of the SageMath software project, which I founded in 2005, is to create a viable free open source alternative to Magma, Mathematica, Maple, and Matlab. SMC is not mathematics software -- instead, SMC is best viewed by analogy as a browser-based version of a Linux desktop environment like KDE or Gnome. The vast majority of the code we write for SMC involves text editor issues (problems similar to those confronted by Emacs or Vim), personal information management, support for editing LaTeX documents, terminals, file management, etc. There is almost no mathematics involved at all.

That said, the main software I use is Sage, so of course support for Sage is a primary focus. SMC is a software environment that is being optimized for its users, who are mostly college students and teachers who use Sage (or Python) in conjunction with their courses. A big motivation for the existence of SMC is to make Sage much more accessible, since growth of Sage has stagnated since 2011, with the number one show-stopper obstruction being the difficulty of students installing Sage.

Sage is Failing

Measured by the mission statement, Sage has overall failed. The core goal is to provide similar functionality to Magma (and the other Ma's) across the board, and the Sage development model and community has failed to do this across the board, since after 9 years, based on our current progress, we will never get there. There are numerous core areas of research mathematics that I'm personally familiar with (in arithmetic geometry), where Sage has barely moved in years and Sage does only a few percent of what Magma does. Unless there is a viable plan for the areas to all be systematically addressed in a reasonable timeframe, not just with arithmetic geometry in Magma, but with everything in Mathematica, Maple., etc, we are definitely failing at the main goal I have for the Sage math software project.

I have absolutely no doubt that money combined with good planning and management would make it possible to achieve our mission statement. I've seen this hundreds of times over at a small scale at Sage Days workshops during the last decade. And let's not forget that with very substantial funding, Linux now provides a viable free open source alternative to Microsoft Windows. Just providing Sage developers with travel expenses (and 0 salary) is enough to get a huge amount done, when possible. But all my attempts with foundations and other clients to get any significant funding, at even the level of 1% of the funding that Mathematica gets each year, has failed. For the life of the Sage project, we've never got more than maybe 0.1% of what Mathematica gets in revenue. It's just a fact that the mathematics community provides Mathematica $50+ million a year, enough to fund over 600 fulltime positions, and they won't provide enough to fund one single Sage developer fulltime.

But the Sage mission statement remains, and even if everybody else in the world gives up on it, I HAVE NOT. SMC is my last ditch strategy to provide resources and visibility so we can succeed at this goal and give the world a viable free open source alternative to the Ma's. I wish I were writing interesting mathematical software, but I'm not, because I'm sucking it up and playing the long game.

The Users of SMC

During the last academic year (e.g., April 2014) there were about 20K "monthly active users" (as defined by Google Analytics), 6K weekly active users, and usually around 300 simultaneous connected users. The summer months have been slower, due to less teaching.

Numerically most users are undergraduate students in courses, who are asked to use SMC in conjunction with a course. There's also quite a bit of usage of SMC by people doing research in mathematics, statistics, economics, etc. -- pretty much all computational sciences. Very roughly, people create Sage worksheets, IPython notebooks, and Latex documents in somewhat equal proportions.

What SMC runs on

Technically, SMC is a multi-datacenter web application without specific dependencies on particular cloud provider functionality. In particular, we use the Cassandra database, and custom backend services written in Node.js (about 15,000 lines of backend code). We also use Amazon's Route 53 service for geographically aware DNS. There are two racks containing dedicated computers on opposites sides of campus at University of Washington with 19 total machines, each with about 1TB SSD, 4TB+ HDD, and 96GB RAM. We also have dozens of VM's running at 2 Google data centers to the east.

A substantial fraction of the work in implementing SMC has been in designing and implementing (and reimplementing many times, in response to real usage) a robust replicated backend infrastructure for projects, with regular snapshots and automatic failover across data centers. As I write this, users have created 66677 projects; each project is a self-contained Linux account whose files are replicated across several data centers.

The Source Code of SMC

The underlying source of SMC, both the backend server and frontend client, is mostly written in CoffeeScript. The frontend (which is nearly 20,000 lines of code) is implemented using the "progressive refinement" approach to HTML5/CSS/Javascript web development. We do not use any Javascript single page app frameworks, though we make heavy use of Bootstrap3 and jQuery. All of the library dependencies of SMC, e.g., CodeMirror, Bootstrap, jQuery, etc. for SMC are licensed under very permissive BSD/MIT, etc. libraries. In particular, absolutely nothing in the Javascript software stack is GPL or AGPL licensed. The plan is that any SMC source code that will be open sourced will be released under the BSD license. Some of the SMC source code is not publicly available, and is owned by University of Washington. But other code, e.g., the realtime sync code, is already available.
Some of the functionality of SMC, for example Sage worksheets, communicate with a separate process via a TCP connection. That separate process is in some cases a GPL'd program such as Sage, R, or Octave, so the viral nature of the GPL does not apply to SMC. Also, of course the virtual machines are running the Linux operating system, which is mostly GPL licensed. (There is absolutely no AGPL-licensed code anywhere in the picture.)

Note that since none of the SMC server and client code links (even at an interpreter level) with any GPL'd software, that code can be legally distributed under any license (e.g., from BSD to commercial).
Also we plan to create a fully open source version of the Sage worksheet server part of SMC for inclusion with Sage. This is not our top priority, since there are several absolutely critical tasks that still must be finished first on SMC, e.g., basic course management.

The SMC Business Model

The University of Washington Center for Commercialization (C4C) has been very involved and supportive since the start of the projects. There are no financial investors or separate company; instead, funding comes from UW, some unspent grant funds that were about to expire, and a substantial Google "Academic Education Grant" ($60K). Our first customer is the "US Army Engineer Research and Development Center", which just started a support/license agreement to run their own SMC internally. We don't currently offer a SaaS product for sale yet -- the options for what can be sold by UW are constrained, since UW is a not-for-profit state university. Currently users receive enhancements to their projects (e.g., increased RAM or disk space) in exchange for explaining to me the interesting research or teaching they are doing with SMC.

The longterm plan is to start a separate for-profit company if we build a sufficient customer base. If this company is successful, it would also support fulltime development of Sage (e.g., via teaching buyouts for faculty, support of students, etc.), similar to how Magma (and Mathematica, etc.) development is funded.

In conclusion, in response to Michael Monagan, you are wrong. And you are right.

Tuesday, August 26, 2014

You don't really think that Sage has failed, do you?

I just received an email from a postdoc in Europe, and very longtime contributor to the Sage project. He's asking for a letter of recommendation, since he has to leave the world of mathematical software development (after a decade of training and experience), so that he can take a job at hedge fund. He ends his request with the question:

> P.S. You don't _really_ think that Sage has failed, do you?

After almost exactly 10 years of working on the Sage project, I absolutely do think it has failed to accomplish the stated goal of the mission statement: "Create a viable free open source alternative to Magma, Maple, Mathematica and Matlab.". When it was only a few years into the project, it was really hard to evaluate progress against such a lofty mission statement. However, after 10 years, it's clear to me that not only have we not got there, we are not going to ever get there before I retire. And that's definitely a failure.

Here's a very rough quote I overheard at lunch today at Sage Days 61, from John Voight, who wrote much quaternion algebra code in Magma: "I'm making a list of what is missing from Sage that Magma has for working with quaternion algebras. However, it's so incredibly daunting, that I don't want to put in everything. I've been working on Magma's quaternion algebras for over 10 years, as have several other people. It's truly daunting how much functionality Magma has compared to Sage."

The only possible way Sage will not fail at the stated mission is if I can get several million dollars a year in money to support developers to work fulltime on implementing interesting core mathematical algorithms. This is something that Magma has had for over 20 years, and that Maple, Matlab, and Mathematica also have. That I don't have such funding is probably why you are about to take a job at a hedge fund. If I had the money, I would try to hire a few of the absolute best people (rather than a bunch of amateurs), people like you, Robert Bradshaw, etc. -- we know who is good. (And clearly I mean serious salaries, not grad student wages!)

So yes, I think the current approach to Sage has failed. I am going to try another approach, namely SageMathCloud. If it works, maybe the world will get a free open source alternative to Magma, Mathematica, etc. Otherwise, maybe the world never ever will. If you care like I do about having such a thing, and you're teaching course, or whatever, maybe try using SageMathCloud. If enough people use SageMathCloud for college teaching, then maybe a business model will emerge, and Sage will get proper funding.

Thursday, July 31, 2014

SageMathCloud -- history and status

2005: I made first release the SageMath software project, with the goal to create a viable open source free alternative to Mathematica, Magma, Maple, Matlab.

2006: First web-based notebook interface for using Sage, called "sagenb". It was a cutting edge "AJAX" application at the time, though aimed at a small number of users.

2007-2009: Much work on sagenb. But it's still not scalable. Doesn't matter since we don't have that many users.

2011-: Sage becomes "self sustaining" from my point of view -- I have more time to work on other things, since the community has really stepped up.

2012: I'm inspired by the Simons Foundation's (and especially Jim Simon's) "cluelessness" about open source software to create a new online scalable web application to (1) make it easier for people to get access to Sage, especially on Windows, and (2) generate a more longterm sustainable revenue stream to support Sage development. (I was invited to a day-long meeting in NYC at Simon's headquarters.)

2012-2013: Spent much of 2012 and early 2013 researching options, building prototypes, some time talking with Craig Citro and Robert Bradshaw (both at Google), and launched SageMathCloud in April 2013. SMC got some high-profile use, e.g., by UCLA's 400+ student calculus course.

2014: Much development over the last 1.5 years. Usage has also grown. There is some growth information here. I also have useful google analytics data from the whole time, which shows around 4000 unique users per week, with an average session duration of 97 minutes (see attached). Number of users has actually dropped off during the summer, since there is much less teaching going on.

SMC itself is written mostly in CoffeeScript using Node.js on the backend. There's a small amount of Python as well.

It's a highly distributed multi-data center application. The database is Cassandra. The backend server processes are mostly Node.js processes, and also nginx+haproxy+stunnel.

A copy of user data is stored in every data center, and is snapshotted every few minutes, both via :

ZFS -- for rolling snapshots that vanish after a month -- and via
bup -- for snapshots that remain forever, and are consistent across data centers.

These snapshots are critical for making it possible to trust collaborators on projects to not (accidentally) destroy your work. It is not possible for users to delete the bup snapshots, by design.
Here's what it does: realtime collaborative editing of Latex docs, IPython notebooks, Sage worksheets; use the command line terminal; have several people collaborate on a project (=a Linux account).
The main applications seem to be:

teaching courses with a programming or math software components -- where you want all your students to be able to use something, e.g., IPython, Julia, etc, and don't want to have to deal with trying to get them to install said software themselves. Also, you want to easily be able to share files with students, see their work in realtime, etc. It's a much, much easier for people to get going that with naked VM's they have to configure -- and also I provide cross-data center replication.
collaborative research mathematics -- all co-authors of a paper work together in an SMC project, both writing the paper there and doing computations.

Active development work right now:

course management for homework (etc)
administration functionality (mainly motivated by self-hosting and better moderation)
easy history slider to see all pasts states of a document
switching from bootstrap2 to bootstrap3.

Thursday, June 12, 2014

SageMathCloud Task Lists

I've added task list functionality to SageMathCloud (SMC), so you can keep track of a list of things to do related to a project, paper, etc. Task lists are saved as files on the filesystem, so they can be backed up in the usual way, automatically generated, etc. I doubt anybody is going to use SMC just for the tasks lists, but for people already using SMC in order to use Sage, write LaTeX documents, use IPython notebooks, etc., having a convenient integrated task list should come in handy.
To create a task list, in a project, click "+New", name the task list, then click the "Task List" button. Here's what a task list looks like:

The Design

I used the task list quite a bit when implementing the task list, and significantly modified the interface many, many times. I also tried out numerous other todo list programs for inspiration. I eventually settled on the following key design choices, some of which are different than anything I've seen. In particular, the design targets highly technical users, which is not something I saw with any other todo list programs I tried.

Markdown editor: The task description is formatted using client-side rendered Github flavored markdown (using marked), including [ ] for checkboxes. I also include full MathJax support, and spent quite a while working around various subtleties of mixing mathjax and markdown. I'll be rewriting Sage worksheets to use this code. The editor itself is Codemirror 4 in markdown mode, so it respects whatever theme you choose, has nice features like multiple cursors, paren matching, vim/emacs modes, etc. Multiple people can edit the same task at once and everybody will see the changes as they are made (note: I haven't implemented showing other cursors.)
Relative dates and times: All dates and times are shown relative to right now. If something is due in 20 hours, it says "due about 20 hours from now". I also included a sortable column with the last time when a task was edited, also shown relative to the current time. This display uses the timeago jquery plugin. You can of course click on the due date to see the actual date.
Hashtags: As I was implementing (and removing) features such as priorities, a way to indicate which task you are currently working on, folders, etc, I decided that hashtags can provide every feature. Moreover, they are very text editor/programmer friendly. When editing a task, if you put #foo, then it is interpreted as a hashtag, which you can click on to show only tasks containing that tag. To disambiguate with markdown headings, to make a heading you have to include a whitespace, so # foo. I haven't added autocomplete for hashtags yet, but will. You can easily click on hashtags anywhere to select them, or use the bar at the top.
User-specified task order: The default sort order for tasks is custom. There is a drag handle so you can explicitly move tasks up and down in order to indicate how important they are to you (or whatever else). You can also click an up hand or down hand to knock the currently selected task to the bottom of the list of displayed tasks.

Of course, I still have an enormous list of functionality I would like to add, but that will have to wait. For example, I need to enable a chat associated to each task list, like the chats associated to Sage worksheets and other files. I also want to make it so one can select a range of tasks and copy them, move them, paste them into another list, etc. It would also be nice to be able to export task lists to HTML, PDF, etc., which should be fairly easy using pandoc. I'm also considering making a note list, which is like a task list but without the due date or "done" box. Because of all the infrastructure already there, it would be easy to add code evaluation functionality, thus creating something like worksheets, but from a different point of view (with maybe hashtags determining the Python process).

Databases and Differential Synchronization

One interesting thing I noticed when implementing task lists is that there are many similarities with the original sagenb.org design (and also IPython notebook), in which a worksheet is a list of cells that get evaluated, can be manually reordered, etc. Similarly, a task list is a list of "cells" that you edit, manually reorder, etc. With sagenb we had longstanding issues involving the order of each cell and assigning an integer numerical id (0, 1, 2, ...) to the cells, which resulted in things like cells seemingly getting randomly reordered, etc. Also, having multiple simultaneous viewers with automatic synchronization is very difficult with that model. For task lists, I've introduced some new models to address these issues.

A natural way to store a task list is in a database, and I initially spent some time coming up with a good database schema and implementing basic lists using Cassandra for the backend. However, I couldn't pursue this approach further, since I was concerned about how to implement realtime synchronization, and also about the lack of easily backing up complete task lists via snapshots, in git, etc. So instead I created an "object database" API built on top of a file that is synchronized across clients (and the server) using differential synchronization. The underlying file format for the database is straightforward -- there is one line in JSON format for each record in the database. When objects are changed, the file is suitably modified, synchronized to other clients, and events are triggered.

Since differential synchronization has no trouble with files that have "a few thousand lines", this approach works well for our purposes (since personal or shared todo lists are typically fairly short). Also, having one line per record is at least more git friendly than something like a sqlite database. I'm considering rewriting my implementation of IPython notebook sync on top of this abstraction.
Since I view the task list as a database, each task has a globally unique uuid. Also, instead of viewing the task order as being defined by an integer 0,1,2,3, which leads to all manner of bugs and programming misery, race conditions, etc., instead we view the order as being determined by floating point positions. So to insert a task between tasks with positions 1 and 2, we just give the task position 1.5.

Thursday, June 5, 2014

The Official Go Tutorial as a (41-page) SageMathCloud worksheet

Do you like using interactive SageMathCloud worksheets and want to learn the basics of the Go Programming language? I've added a %go magic to SMC worksheets, and translated the official Go tutorial into a single long worksheet.

Open a SageMathCloud project and restart the project server if necessary (project settings --Restart).
Click +New and paste in this URL (then press enter): https://github.com/sagemath/cloud-examples.git
You'll get a large collection of example worksheets in a directory cloud-examples. Navigate to the "Go" subdirectory and open go.sagews.

You can also directly browse a PDF of the worksheet here: https://github.com/sagemath/cloud-examples/blob/master/go/go.pdf?raw=true

Monday, June 2, 2014

Update to SageMathCloud - Codemirror 4.2, etc.

I've made the following updates to https://cloud.sagemath.com:

User interface changes (refresh your browser to get)

Sage Worksheets

Don't show spinner unless computation is taking more than a second.
Streamlined evaluation a little bit (never get stuck with the grey spinner when you're online)
2d graphics now display using svg by default, since browser bugs have been fixed

Upgrade from Codemirror 3.20 to CodeMirror version 4.2, which is much better, faster, and has new features:

Multiple cursors (use control or command+click to create multiple cursors)
Sublime keyboard bindings
New editor features (you can turn these off in account settings):
- Auto brackets: automatically close brackets, quotes, etc.
- Show trailing whitespace: show spaces at ends of lines

Here's a 2-minute screencast that illustrates some of the above UI features: http://youtu.be/ykb12MGHOuk

Task Lists: There is now a preliminary task list implementation. To use it, make a file that ends in .tasks.

Task editing now uses full screen width
Fixed task deletion bug
Markdown list autocomplete

Backend changes (restart project server to get):

Automatically delete project log entries when the log grows beyond 7500 lines. In some cases of very active projects, the log would get enormous (say 30MB+) and just loading it would slow down everything for a while.
Clear a certain logfile that was getting huge whenever the project server is restarted.

Tuesday, May 6, 2014

Update to Differential Synchronization in SageMathCloud

I've just pushed out a major update to how synchronization works in https://cloud.sagemath.com.

This change is pretty significant behind the scenes, but the only difference you should notice is that everything should be better. In particular:

evaluation of code in Sage worksheet should feel a little snappier and more robust,
various random and hard to reproduce issues with synchronized editing should be fixed, e.g. chat messages out of order, etc.
everything should generally be a bit faster and more scalable overall.

Here's a short technical description of what changed. The basic architecture of SageMathCloud is that there are many web browsers connected to many hubs, which are in turn connected to your project (and to many other projects too):

  [web browser] <- websocket ----\/
  [web browser] <------------> [hub]<------ tcp -------\/
                                                     [project]
  [web browser] <------------> [hub]<------------------/\

Until today, the differential synchronization implementation involved having a copy of the document you're editing on:

each hub pictured above,
in each browser, and
in the project itself.

In particular, there were three slightly different implementations of differential synchronization running all over the place. The underlying core code is the same for all three, but the way it is used in each case is different, due to different constraints. The implementations:

browser: running in a web browser, which mainly has to worry about dealing with the CodeMirror editor and a flakie Internet connection.
hub: running in a node.js server that's also handling a lot of other stuff, including worrying about auth, permissions, proxying, logging, account creation, etc.
project: running in the project, which doesn't have to worry about auth or proxying or much else, but does have to worry about the filesystem.

Because we're using Node.js, all three implementations are written in the same language (CoffeeScript), and run the same underlying core code (which I BSD licensed at https://github.com/sagemath/cloud/blob/master/diffsync.coffee). The project implementation was easiest to write, since it's very simple and straightforward, and has minimal constraints. The browser implementation is mainly difficult, since the Internet comes and goes (as laptops suspend/resume), and it this involves patching and diff'ing a CodeMirror editor instance; CodeMirror is difficult, because it views the document as a line instead of a single long string, and we want things to work even for documents with hundreds of thousands of lines, so converting back and forth to a string is not an option! Implementing the hub part of synchronization is the hardest, for various reasons -- and debugging it is particularly hard. Moreover, computing diffs can be computationally expensive if the document is large, so doing anything involving differential sync on the hub can result in nontrivial locking cpu usage, hence slower handling of other user messages (node.js is single threaded). The hub part of the above was so hard to get right that it had some nasty locking code, which shouldn't be needed, and just looked like a mess.

A lot of issues people ran into with sync involved two browsers connected to different hubs, who then connected to the same document in a project. The two hubs' crappy synchronization would appear to work right in this part of the picture "[web browser] <------------> [hub]", but have problems with this part "[hub]<-------------->[project]", which would lead to pain later on. In many cases, the only fix was to restart the hub (to kill its sync state) or for the user to switch hubs (by clearing their browser cookies).

Change: I completely eliminated the hub from the synchronization picture. Now the only thing the hub does related to sync is forward messages back and forth between the web browser and local hub. Implementing this was harder than one might think, because the the project considered each client to be a single tcp connection, but now many clients can connect a project via the same tcp connection, etc.

With this fix, if there are any bugs left with synchronization, they should be much easier to debug. The backend scalability and robustness of sync have been my top priorities for quite a while now, so I'm happy to get this stuff cleaned up, and move onto the next part of the SMC project, which is better collaboration and course support.

Thursday, May 1, 2014

What can SageMathCloud (SMC) do?

The core functionality of SageMathCloud:

Color command-line Terminal with many color schemes, which several people can interact with at once, with state that survives browser refresh.

Editing of documents: with syntax highlighting, auto-indent, etc., for files with the following extensions:

c, c++, cql, cpp, cc, conf, csharp, c#, coffee, css, diff, dtd, e, ecl, f, f90, f95, h, hs, lhs, html, java, jl, js, lua, m, md, mysql, patch, gp, go, pari, php, py, pyx, pl, r, rst, rb, ru, sage, sagews, scala, sh, spyx, sql, txt, tex, toml, bib, bbl, xml, yaml.

(It's easy for me to add more, as CodeMirror supports them.) There are many color schemes and Emacs and Vim bindings.

Sage Worksheets: a single document interactive way to evaluate Sage code. This is highly extensible, in that you can define % modes by simply making a function that takes a string as input, and use %default_mode to make that mode the default. Also, graphics actually work in the %r automatically, exactly as in normal R (no mucking with devices or png's).
IPython notebooks: via an IPython session that is embedded in an iframe. This is synchronized, so that multiple users can interact with a notebook simultaneously, which was a nontrivial addition on top of IPython.
LaTeX documents: This fully supports sagetex, bibtex, etc. and the LaTeX compile command is customizable. This also has forward and inverse search, i.e., double click on preview to get to point in tex file and alt-enter in tex file to get to point in LaTeX document. In addition, this editor will work fine with 150+ page documents by design. (Editing multiple document files are not properly supported yet.)
Snapshots: the complete state of all files in your project are snapshotted (using bup, which is built on git) every 2 minutes, when you're actively editing a file. All of these snapshots are also regularly backed up to encrypted disks offsite, just in case. I plan for these highly efficient deduplicated compressed snapshots to be saved indefinitely. Browse the snapshots by clicking "Snapshots" to the right when viewing your files or type cd ~/.snapshots/master in the terminal.
Replication: every project is currently stored in three physically separate data centers; if a machine or data center goes down, your project pops up on another machine within about one minute. A computer at every data center would have to fail for your project to be inaccessible. I've had zero reports of projects being unavailable since I rolled out this new system 3 weeks ago (note: there was a project that didn't work, but that was because I had set the quota to 0% cpu by accident).

Sage

The Sage install contains the following extra packages (beyond what is standard in Sage itself). When you use Sage or IPython, this will all be available.

basemap, biopython, biopython, bitarray, brian, cbc, chomp, clawpack, cluster_seed, coxeter3, cryptominisat, cunningham_tables, database_cremona_ellcurve, database_gap, database_jones_numfield, database_kohel, database_odlyzko_zeta, database_pari, database_symbolic_data, dot2tex, fabric, gap_packages, gnuplotpy, greenlet, guppy, h5py, httplib2, kash3, lie, lrs, lxml, mahotas, mercurial, mpld3, munkres, mysql-python, nauty, netcdf4, neuron, normaliz, nose, nose, numexpr, nzmath, oct2py, p_group_cohomology, pandas, paramiko, patsy, patsy, phc, plotly, psutil, psycopg2, pybtex, pycryptoplus, pyface, pymongo, pyproj, pyx, pyzmq, qhull, quantlib, redis, requests, rpy2, scikit_learn, scikits-image, scimath, shapely, simpy, snappy, statsmodels, stein-watkins-ecdb, tables, theano, topcom, tornado, traits, xlrd, xlwt, zeromq

R

Also, I install the following extra packages into the R that is in Sage:

KernSmooth, Matrix, Rcpp, cairo, car, circular, cluster, codetools, e1071, fields, ggplot2, glmnet, lattice, mgcv, mvtnorm, plyr, reshape2, rpart, stringr, survival, zoo

It's Linux

SMC can do pretty much anything that doesn't require X11 that can be done with an Ubuntu-14.04 Linux can be done. I've pre-installed the following packages, and if people want others, just let me know (and they will be available to all projects henceforth):

vim git wget iperf dpkg-dev make m4 g++ gfortran liblzo2-dev libssl-dev libreadline-dev  libsqlite3-dev libncurses5-dev git zlib1g-dev openjdk-7-jdk libbz2-dev libfuse-dev pkg-config libattr1-dev libacl1-dev par2 ntp pandoc ssh python-lxml  calibre  ipython python-pyxattr python-pylibacl software-properties-common  libevent-dev xfsprogs lsof  tk-dev  dstat emacs vim texlive texlive-* gv imagemagick octave mercurial flex bison unzip libzmq-dev uuid-dev scilab axiom yacas octave-symbolic quota quotatool dot2tex python-numpy python-scipy python-pandas python-tables libglpk-dev python-h5py zsh python3 python3-zmq python3-setuptools cython htop ccache python-virtualenv clang libgeos-dev libgeos++-dev sloccount racket libxml2-dev libxslt-dev irssi libevent-dev tmux sysstat sbcl gawk noweb libgmp3-dev ghc  ghc-doc ghc-haddock ghc-mod ghc-prof haskell-mode haskell-doc subversion cvs bzr rcs subversion-tools git-svn markdown lua5.2 lua5.2-*  encfs auctex vim-latexsuite yatex spell cmake libpango1.0-dev xorg-dev gdb valgrind doxygen haskell-platform haskell-platform-doc haskell-platform-prof  mono-devel mono-tools-devel ocaml ocaml-doc tuareg-mode ocaml-mode libgdbm-dev mlton sshfs sparkleshare fig2ps epstool libav-tools python-software-properties software-properties-common h5utils libnetcdf-dev netcdf-doc netcdf-bin tig libtool iotop asciidoc autoconf bsdtar attr  libicu-dev iceweasel xvfb tree bindfs liblz4-tool tinc  python-scikits-learn python-scikits.statsmodels python-skimage python-skimage-doc  python-skimage-lib python-sklearn  python-sklearn-doc  python-sklearn-lib python-fuse cgroup-lite cgmanager-utils cgroup-bin libpam-cgroup cgmanager cgmanager-utils cgroup-lite  cgroup-bin r-recommended libquantlib0 libquantlib0-dev quantlib-examples quantlib-python quantlib-refman-html quantlib-ruby r-cran-rquantlib  libf2c2-dev libpng++-dev libcairomm-1.0-dev r-cran-cairodevice x11-apps mesa-utils libpangox-1.0-dev

I've also put extra effort (beyond just apt-get) to install the following:

polymake, dropbox, aldor/"AXIOM", Macaulay2, Julia, 4ti2

Functionality that is currently under development

We're working hard on improving SageMathCloud right now.

Streamlining document sync: will make code evaluation much faster, eliminate some serious bugs when the network is bad, etc.
Geographic load balancing and adding data centers, so that, e.g., if you're in Europe or Asia you can use SMC with everything happening there. This will involve DNS load balancing via Amazon Route 53, and additionally moving projects to run on the DC that is nearest you on startup, rather than random. Right now all computers are in North America.
Mounting a folder of one project in another project, in a way that automatically fixes itself in case a machine goes down, etc. Imagine mounting the projects of all 50 students in your class, so you can easily assign and collect homework, etc.
Homework assignment and grading functionality with crowdsourcing of problem creation, and support for peer and manual grading.
BSD-licensed open source single-project version of SMC.
Commercial software support and instructions for how to install your own into SMC (e.g., Mathematica, Matlab, Magma, etc.)
ssh access into projects easily

Friday, April 25, 2014

The SageMathCloud Roadmap

Everything below is subject to change.

Implementation Goals

(by April 27) Major upgrades -- update everything to Ubuntu 14.04 and Sage-6.2. Also upgrade all packages in SMC, including Haproxy, nginx, stunnel, etc.
(by May 4) Streamline doc sync: top priority right now is to clean up sync, and eliminate bugs that show up when network is bad, many users, etc.
(by May 30) Snapshots:
- more efficient way to browse snapshot history (timeline view)
- browse snapshots of a single file (or directory) only
(by May 30) User-owned backups
- way to download complete history of a project, i.e,. the underlying bup/git repository with snapshot history.
- way to update offline backup, getting only the changes since last download.
- easy way to download all current files as a zip or tarball (without the snapshots).
(by June 30) Public content
- Ability to make a read-only view of the project visible publicly on the internet. Only works after the account is at least n days old.
- By default, users will have a "report spammer" button on each page. Proven 'good users' will have button removed. Any valid reported users will be permanently banned.
- Only users with validated .edu accounts will be allowed to publish for starters. Maybe allow gmail.com at some point.
(by June 30) Fix all major issues (none major) that are listed on the github page: https://github.com/sagemath/cloud/issues
(by July 31) Group/social features:
- Support for mounting directories between projects
- Group management: combine a group of users and projects into a bigger entity:
  - a University course -- see feature list below
  - a research group: a collection of projects with a theme that connects them, where everybody has access to everybody else's projects
- A feed that shows activity on all projects that users care about, with some ranking. Better notifications about chat messages and activity

Commercial Products

We plan four distinct products of the SMC project: increased quotas, enhanced university course support, license and support to run a private SMC cloud, supported open source BSD-licensed single-user version of SMC (and offline mode).

PRODUCT: Increase the quota for a specific project (launch by Aug 2014)
- cpu cores
- RAM
- timeout
- disk space
- number of share mounts
Remarks:
- There will be an option in the UI to change each of the above parameters that some project collabs see (maybe only owners initially).
- Within moments of making a change it goes live and billing starts.
- When the change is turned off, billing stops. When a project is not running it is not billed. (Obviously, we need to add a stop button for projects.)
- There is a maximum amount that the user can pre-spend (e.g., $500 initially).
- At the end of the month, the user is given a link to a Univ of Washington website and asked to pay a certain amount, and register there under the same email as they use with SMC.
- When they pay, SMC receives an email and credits their account for the amount they pay.
- There will also be a limit soon on the number of projects that can be associated with an account (e.g., 10); pay a small monthly to raise this.
PRODUCT: University course support (launch by Aug 2014 in time for Fall 2014 semester)
- Free for the instructor and TA's
- Each student pays $20 in exchange for:
  - one standard project (they can upgrade quotas as above), which TA and instructor are automatically collabs on
  - student is added as collaborator to a big shared project
  - in student's private project they get homework assignments (assigned, collected)
- Instructor's project has all student projects as mounted shares
- Instructor has a student data spreadsheet with student grades, project ids (links), etc.
- Powerful modern tool for designing homework problems that can be automatically graded, with problems shared in a common pool, with ratings, and data about their usage.
- A peer grading system for more advanced courses.
- Tools to make manual grading more fun.
PRODUCT: License to run a private SMC cloud in a research lab, company, supercomputer, etc. (launch a BETA version by July 2014, with caveats about bugs).
- base fee (based on organization size)
- technical support fee
- site visit support: install, run workshop/class
PRODUCT: Free BSD-licensed single-user account version of SMC (launch by December 2014)
- a different way to do LaTeX editing, manage a group of IPython notebooks, use Sage worksheets, etc.
- be included with Sage.
- be included in many Linux distros
- the doc synchronization code, local_hub, CoffeeScript client, terminal.
- mostly Node.js application (with a little Python for Sage/IPython).
- ability to sync with a cloud-hosted SMC project.
- sell pre-configured (or just support) for a user to install standalone-SMC on some cloud host such as EC2 or Digital Ocean

Tuesday, April 15, 2014

SageMathCloud's new storage architecture

Keywords: ZFS, bup, rsync, Sage

SageMathCloud (SMC) is a browser-based hosted cloud computing environment for easily collaborating on Python programs, IPython notebooks, Sage worksheets and LaTeX documents. I spent the last four months wishing very much that less people would use SMC. Today that has changed, and this post explains some of the reasons why.

Consistency Versus Availability

Consistency and availability are competing requirements. It is trivial to keep the files in a SageMathCloud project consistent if we store it in exactly one place; however, when the machine that project is on goes down for any reason, the project stops working, and the users of the project are very unhappy. By making many copies of the files in a project, it's fairly easy to ensure that the project is always available, even if network switches in multiple data centers completely fail, etc. Unfortunately, if there are too many users and the synchronization itself puts too heavy of a load on the overall system, then machines will fail more frequently, and though projects are available, files do not stay consistent and data is lost to the user (though still "out there" somewhere for me to find).

Horizontal scalability of file storage and availability of files are also competing requirements. If there are a few compute machines in one place, then they can all mount user files from one central file server. Unfortunately, this approach leads to horrible performance if instead the network is slow and has high latency; it also doesn't scale up to potentially millions of users. A benchmark I care about is downloading a Sage binary (630MB) and extracting it (creating over 70,000 files); I want this to take at most 3 minutes total, which is hard using a networked filesystem served over the general Internet between data centers. Instead, in SMC, we store the files for user projects on the compute machines themselves, which provides optimal speed. Moreover, we use a compressed filesystem, so in many cases read and write speeds are nearly twice as fast as they might be otherwise.

New Architecture of SageMathCloud

An SMC project with id project_id consists of two directories of files, replicated across several machines using rsync:

The HOME directory: /projects/project_id
A bup repository: /bup/bups/project_id

Users can also create files they don't care too much about in /scratch, which is a compressed and deduplicated ZFS filesystem. It is not backed up in any way, and is local to that compute.

The /projects directory is one single big ZFS filesystem, which is both lz4 compressed and deduplicated. ZFS compression is just plain awesome. ZFS deduplication is much more subtle, as deduplication is tricky to do right. Since data can be deleted at any time, one can't just use a bloom filter to very efficiently tell whether data is already known to the filesystem, and instead ZFS uses a much less memory efficient data structure. Nonetheless, deduplication works well in our situation, since the compute machines all have sufficient RAM (around 30-60GB), and the total data stored in /projects is well under 1TB. In fact, right now most compute machines have about 100GB stored in /projects.
The /bup/bups directory is also one single big ZFS filesystem; however, it is neither compressed nor deduplicated. It contains bup repositories, where bup is an awesome git-based backup tool written in Python that is designed for storing snapshots of potentially large collections of arbitrary files in a compressed and highly deduplicated way. Since the git pack format is already compressed and deduplicated, and bup itself is highly efficient at deduplication, we would gain almost nothing by using compression or deduplication directly on this ZFS filesystem. When bup deduplicates data, it does so using a sliding window through the file, unlike ZFS which simply breaks the file up into blocks, so bup does a much better job at deduplication. Right now, most compute machines have about 50GB stored in /bup/bups.

When somebody actively uses a project, the "important" working files are snapshotted about once every two minutes. These snapshots are done using bup and stored in /bup/bups/project_id, as mentioned above. After a snapshot is successfully created, the files in the working directory and in the bup repository are copied via rsync to each replica node. The users of the project do not have direct access to /bup/bups/project_id, since it is of vital importance that these snapshots cannot be corrupted or deleted, e.g., if you are sharing a project with a fat fingered colleague, you want peace of mind that even if they mess up all your files, you can easily get them back. However, all snapshots are mounted at /projects/project_id/.snapshots and browseable by the user; this uses bup's FUSE filesystem support, enhanced with some patches I wrote to support file permissions, sizes, change times, etc. Incidentally, the bup snapshots have no impact on the user's disk quota.

We also backup all of the bup archives (and the database nodes) to a single large bup archive, which we regularly backup offsite on encrypted USB drives. Right now, with nearly 50,000 projects, the total size of this large bup archive is under 250GB (!), and we can use it efficiently recover any particular version of any file in any project. The size is relatively small due to the excellent deduplication and compression that bup provides.

In addition to the bup snapshots, we also create periodic snapshots of the two ZFS filesystems mentioned above... just in case. Old snapshots are regularly deleted. These are accessible to users if they search around enough with the command line, but are not consistent between different hosts of the project, hence using them is not encouraged. This ensures that even if the whole replication/bup system were to somehow mess up a project, I can still recover everything exactly as it was before the problem happened; so far there haven't been any reports of problems.

Capacity

Right now there are about 6000 unique weekly users of SageMathCloud and often about 300-400 simultaneous users, and there are nearly 50,000 distinct projects. Our machines are at about 20% disk space capacity, and most of them can easily be expanded by a factor of 10 (from 1TB to 12TB). Similarly, disk space for our Google compute engine nodes is $0.04 GB / month. So space-wise we could scale up by a factor of 100 without too much trouble. The CPU load is at about 10% as I write this, during a busy afternoon with 363 clients connected very actively modifying 89 projects. The architecture that we have built could scale up to a million users, if only they would come our way...

Wednesday, February 12, 2014

What is SageMathCloud?

The two main reasons for existence of SageMathCloud (SMC) are...

Goal 1. Increase resource for Sage: Generate a different longterm revenue stream to support development of Sage, i.e., open source mathematical software. By "different", I mean different than government and foundation grants and donations, which are relatively limited for primarily pure mathematics software development, which is what Sage specializes in. Even in my wildest dreams, it is very unlikely Sage will get more than a million dollars a year in funding (and in practice it gets a lot less); however, a successful commercial product with wide adoption has the potential to generate significantly more than a million dollars a year in revenue -- of course most would go back into the product... but when the product is partly Sage, that's fine. The National Science Foundation (and other donors) have played a major part during the last 8 years in funding Sage, but I think everybody would benefit from another funding source.

Goal 2. Increase the usage of Sage: The number of unique visitors per month to http://sagemath.org grew nicely from 2005 (when I started Sage) until Summer 2011, after which point it has remained fairly constant at 70,000 unique visitors. There is no growth at all: it was 70,332 in Jan 2011, and it was 70,449 last month (Jan 2014), both with a bounce rate of about 50%. A significant obstruction to growth is accessible, which SMC helps to address for certain users (last month the SMC website has 17,700 unique visitors with a bounce rate of about 30%).

Here's an actual email I received from somebody literally as I was writing this, which I think illustrates how SMC addresses the second goal:

    Hey William,

    Today I stopped by cloud.sagemath.com because 
    I wanted to do some computation with sage, and 
    cloud is announced in a big way on sagemath.org

    This is after a lengthy hiatus from computing
    with sage ( maybe a year ).

    Using cloud.sagemath.com completely blew my 
    mind.  At first I did not really understand 
    why sagenb was ditched after all the work that 
    went into it.  But man, cloud is really a 
    pleasure to use !

    I just wanted to share the joy :)

    Thanks for all that you do !

Licensing and Reuse of the SageMathCloud Codebase

The design and coding of SageMathCloud (SMC) has been mostly supported by University of Washington (UW). Due to goal 1 above, I have been working from the start (before a line of code was written) with the commercialization/tech transfer office of UW, who (because of 1) are not enthusiastic about simply open source the whole SMC codebase, as a condition for their help with commercialization. Some of SMC is open sourced, mainly the code that runs on the VM's and some of the HTML5 client that runs on the browser. We also plan to make the HTML5 client and a mini server BSD licensed, and include them with Sage (say) as a new local graphical interface. Of course SMC builds on top of many standard open source libraries and tools (e.g., CodeMirror, Cassandra, ZFS, Node.js, etc.).

There is, however, a large amount of interesting backend code, which is really the "cloud" part of SMC, and which we do not intend to release as open source. We do intend to sell licenses (with support) for the complete package, when it is sufficiently polished, since many organizations want to run their own private SMC servers, mainly for confidentiality reasons.

Goal 2 above mainly impacts how we market SMC. However, it's easy to completely ignore Sage and still get a lot of value out of SMC. I just glanced at what people are doing as I write this, and the result seems pretty typical: latex'ing documents, some Sage worksheets, some IPython notebooks, editing a perl script.

It's important to understand how SMC is different than other approaches to cloud computing. It's designed to make certain things very easy, but they are quite different things than what "traditional" cloud stacks like OpenStack are designed to make easy. SMC is supposed to make the following easy:

using Sage and IPython, both command line and notebook interfaces.
writing a paper using LaTeX (possibly with a specific private list of collaborators),
editing source code, e.g., developing Python/C/etc., libraries., again possibly with realtime collaboration.
creating collaborative "projects", which are really a Linux account on a machine, and provide isolation from other projects.
backups: all data is automatically snapshotted frequently
high availability: failure of a machine (or even whole data center) results in at most a few minutes of lost time/work.
speed: files are stored on a compressed local filesystem, which is snapshotted and replicated out regularly; thus the filesystem feels fast and is scalable, as compared to a networked filesystem.

The above design goals are useful for certain target audiences, e.g., people doing Sage/Python/etc. development, teachers and students in courses that make use of Sage/Python/etc., collaborative math research projects. SMC is designed so that a large number of people can make simultaneous small use of ever-expanding resources. SMC should also fully support the "social networks" that form in this context. At the same time, it's critical that SMC have excellent uptime and availability (and offsite backups, just in case), so that people can trust it. By trust, I don't mean so much in the sense of "trust it with proprietary info", but in the sense of "trust it to not just loose all my data and to be there when I'm giving a talk/teaching a class/need to do homework/etc.".

However, exactly the above design goals are at odds with some of goals of large-scale scientific/supercomputing. The following are not design goals of SMC:

supercomputing -- have large data that many distributed processes operate on: exactly what people often do on supercomputers (or with Hadoop, etc.)
traditional "cloud computing" -- dynamically spin up many VM's, run computations on them; then destroy them. With SMC, things tend to get created but not destroyed (e.g., projects and files in them), and a full VM is much too heavy given the number of users and type of usage that we have already (and plan to have).

What happens in practice with SMC is that people run smaller-scale computations on SMC (say things that just take a few cores), and when they want to run something bigger, they ssh from SMC to other resources they have (e.g., a supercomputer account) and launch computations there. All project collaborators can see what anybody types in a terminal, which can be helpful when working with remote compute clusters.

Anyway, I hope this helps to clarify what exactly SMC actually is.