LSST Science Advisory Committee meeting
			September 25, 2017

  Attending: Michael Strauss, Mario Juric, Anze Slosar, Beth Willman,
  Gordon Richards (acting as proxy for Niel Brandt), Zeljko Ivezic,
  David Kirkby, Jason Kalirai, Josh Simon, Timo Anguita, Charles Liu,
  Rachel Bean 

  This one-hour phone conference was focused on the LSST plans for the
  Science Platform, the set of connected interfaces by which LSST
  science users will interact with the data.  Mario Juric gave a
  presentation, juric_science_platform.pdf, in which he described
  these plans in some detail.  These are also described in a design
  document at http://ls.st/lse-319 . 

The principal questions and recommendations that came out of this
discussion are as follows: 
   -We are eager to see a plan describing how Level 3 resources will
   be allocated in practice.  The model whereby users can use
   substantial LSST resources for carrying out their scientific
   analyses is an attractive one, but will lead to frustration if
   people often find themselves limited by the resourses available. 
   -We would also like to see a description of how Level 3 codes and
   results can be incorporated into Level 2.  Making a code robust
   enough to be run in Level 2 will take real resources from the
   Project; how will the decision to do so be made? 
   -There will be a real need to match LSST data with external
   datasets at every waveband.  It would be good to have a coherent
   plan for which datasets will be directly accessible through the
   LSST science platform, and the extent to which it will be possible,
   through the platform, to access external databases via VO
   protocols.  
   -The LSST limits access to those with data right privileges
   (scientists in the US, Chile, and named external contributors from
   other countries).  How will this work/be enforced if the LSST data
   are accessible from other databases via the VO protocols?  
   -We are concerned about the long-term future of the JupyterLab
   platform, and would like reassurance that the LSST would be able to
   adapt if JupyterLab were to disappear (or the needs of the LSST science
   community were to change).  
   -We are also concerned about the long-term persistence of user
   results on the science platform, given that the current plan has
   data releases no longer being available after several years. 
   -These tools are designed for professional scientists, and are
   distinct to those meant for the general public in the EPO effort.
   We do want to make sure that they will be accessible to scientists
   who are not at R1 universities; we want to avoid barriers for those
   at smaller colleges, for example. 

  The interface described here is relevant for the Level 2 data
releases.  These will happen roughly once per year (twice in the first
year of operation), and will include catalogs of measured properties
of detected objects, as well as calibrated images.  One will also be
able to access the data from the alerts (Level 1) after they are
stuffed into the database, which happens at the end of each night.  
Access to the alerts as they happen (which will be readied 60
seconds after the end of each visit) will be through the web API (or
the event brokers) that are designed to handle them in real time.

  There are three approaches to accessing the data; think of these as
  three views of the same underlying dataset, that are designed to
  allow the user to move seamlessly between them: 
   (1) A data portal, similar to that familiar to users of IRSA and
   MAST, whereby SQL queries can be entered, the LSST images can be
   viewed, and simple plots can be made.  This is designed to be an
   exploratory tool, for first looks at the data. A prototype,
   "Firefly", is already in operation, and has SDSS Stripe82 and
   NEOWise data loaded (HSC data coming soon).  
   (2) An interactive JupyterLab environment (what Jupyter notebooks
   are in the process of evolving into), in which more
   sophisticated queries and analyses of the data can be run (in
   Python or other languages), using LSST computational resources.
   Most publishable scientific analyses will happen at this level.
   The user will have access to the software stack of the LSST image
   processing pipelines.   
   (3) A series of web APIs whereby larger-scale and more
   CPU-intensive analyses can be run on external supercomputer
   systems (including those at NCSA and other resources that one can
   apply to via the NSF.)  Note there are no plans at the moment to
   have external supercomputer time explicitly allocated to LSST
   applications.  

  The system is designed to allow users to work and share code and
results in groups.  It remains an issue of policy to be decided
whether people will be able to make their code world-readable.  

  The LSST allots 10% of its compute resources during operations to
  the facilities running these tools, to allow users to carry out
  science and create auxiliary data products of their own.  This is of
  order 18 Teraflops, with 4 petabytes of storage.  The LSST science
  platform is scaled for ~7500 users, of whom 100 folks would be on at
  any given time.  The system has the ability to temporarily reassign
  resources from the data processing cluster (which is 10x bigger than
  the community resources) when needed at times of peak community
  demand.   

  Much of our discussion centered on how this 10% would be allocated
  in practice.  The details of this have not been decided.  Each user
  will be given a disk and CPU quota by default, with a mechanism to
  request/allow for additional allocations when needed.  There will
  also be tools to allow the user to estimate the scope (in disk size
  or CPU) the query or processing job they are about to run will
  require.  We want to make sure it is difficult to make mistakes! 
  A question: will (unpersisted) scratch space, separate and
  presumably larger than the individual quotas, be available as well
  for people to experiment?  The system will have the ability to link
  directly with cloud drives (e.g., dropbox, google drive).  

  The "analysis cores” on which all thsi will be run will be exposed
as a batch system to which jobs can be submitted using Pegasus and
HTCondor. There are efforts under way to understand whether jobs can
be run within Docker containers, to ease deployment of user
applications, but no promises can be made at this point.
  AWS/OpenStack type functionality (the ability of the users to request
one or more VMs with certain configuration or properties) is not being
planned for.

  Note that the firefly portal is separate from the tools that will be
used in outreach to the general public, as the LSST project does not
have the resources to support the large number of users that the
public will entail.  There will of course be sharing of software tools
between those working on the portal and the EPO team, where
appropriate.

  We are concerned that the boundary between academics and
the public may not always be clearly defined.  We want to make sure
that we do not put up barriers for scientists, e.g., at smaller
colleges; those not at R1 universities.  

  LSST users would have access to a quota disk space where their
queries and results would persist as long as they were needed.  One
question that comes to mind: the current plan has old data releases
lasting for only a few years as new releases make them obselete; does
that mean that an individual user's results will disappear when the
corresponding data release disappears?

  The JupyterLab approach is similar to that developed by the database
team at JHU (SciServer); similar approaches are being used by the DES
and HSC teams, among others.  JupyterLab is what Jupyter Notebooks are
evolving into, and this technology should be stable at least through
the commissioning of LSST.  We are somewhat concerned, however, about
what will happen on longer timescales; what if the company that
supports JupyterLabs folds or goes in a quite different direction 5
years from now?