2012年2月7日火曜日

What Year Didthe Human Genome Project Start

what year didthe human genome project start

Understanding the Book Genome Project

How the Book Genome Project Works:

"The Book Genome Project is an objective, computer-based analysis of the written word, applied evenly across tens of thousands of published books."

The Book Genome Project is so fundamentally different from what most readers are used to, it's easy to be confused about how BookLamp and the Book Genome Project works.  I'm hoping to clarify that a little bit here.

To start, BookLamp does not categorize or label books, as you would expect in genre or BISAC codes, nor do so through human or community tagging.  Instead we do the exact opposite: We ignore genre and super-classifications and instead only pay attention to the page-by-page components that the author combined to make up the book.  We don't look at what category the book is in, but instead the DNA elements that are in the book, and how that makes one book similar to another regardless of what shelf it sits on in the library or bookstore.

 

The Librarian With Perfect Recall:

"… even the best of us [humans] will have a hard time perfectly recalling what happened on page 37, paragraph 4 of the book we read 3,749 books ago."


The Book Genome Project uses computers that have been trained to guess at what it thinks a human will say if faced with the a similar problem.  Describing it as "computer modeling" and "machine learning" makes something that is fundamentally straight forward sound complicated and scary.  Simply put, we trained the computers to read and look for elements of writing style and theme – though differently than a person would – and translate that into an opinion that is consistent across thousands and thousands of books.  In other words, each time the computer looks at a scene, it asks itself, "If I were human, how Dense (among others) would I rate this particular scene?"

In a perfect world, a skilled and trained human would be able to do this.  That becomes an issue, though, because we're not only interested in rating just a single book, but every word in every scene in every chapter in every book we can get our hands on.

This quickly becomes a problem for a human; even the best of us will have a hard time perfectly recalling what happened on page 37, paragraph 4 of the book we read 3,749 books ago.  And when many humans try to do this, the simple truth is that you have to have many, many people work together in order to get an accurate picture of all the books available for discovery.  And by many, many people, I'm not talking about 5 or 6 million people over a few years; I'm talking about hundreds of millions of people month in and month out – a number that simply isn't feasible.  If you don't have that many people, what happens is that some books – like Harry Potter – get lots of ratings, and most books get virtually none, and disappear into the Social Void, which is what we call that space in social networks where invisible books go to be lonely.


Consequently, when we say that a book has 65% Vampires, it's because the computer is telling us with a great deal of Apples-to-Apples information that the book has more Vampires in it than 65% of the other books in our corpus that also have Vampires in it. And if you start layering that information – such as knowing that one book has 65% Vampires and 15% Forests, compared to another with 63% Vampires and 15% Urban Environments – you get a sense of why this information is valuable when comparing titles.

 

Jurassic Park as an Example:

Even if you had hundreds of millions of trained human readers, there are some things that are simply impossible for humans to do at accurately at scale.  As an example, let's look at a writing style graph of the book Jurassic Park, by Michael Crichton.

Both the movie and the book of Jurassic Park focus on technology, DNA, and the security systems of the island.  In other words, they spend a good portion of the book talking about the science behind cloning, and the wonders of the park itself.  Then, about 43% of the way into the book, the power gets turned off to the fences, the dinosaurs get out, and people get eaten.  The book shifts over to an action-adventure novel.

The graph above maps the Pacing and Density writing style variables from beginning to end of Jurassic Park.  It's sort of like a writing style time-line for the book.  What you see is that at the start of the novel, as Michael Crichton spends much of his time focusing on these technologies, the Density and Pacing scores stay near each other.  Then, about the point the dinosaurs escape, the Pacing goes up, the Density falls, and the book becomes easier to read.


This is an example of the author choosing to change their style of writing to match the contents of the story.  So, if you do start reading Jurassic Park, don't stop reading until at least about 45% of the way through the book, because the action really picks up.

It is possible for a human to create this level of detail for a single book, but doing so for 1,000,000 books – and doing so in a way that compares Apples to Apples in all of them – is literally an impossible task.  Considering that more than 300,000 books are published each year, this is a big problem for future book discovery.

 

Discovery in the Next Generation:

This level of granularity is important.  The Book Genome Project is not really concerned with whether a book is really highly Dense or not, but instead that it is either more or less dense than the other books around it.  Knowing that allows you to say, "Book A is similar in Density to Book B."  You can't create an objective map about where one book fits compared to the others unless you have a perfect understanding of every page in every book.

Because much of our work is in the publishing industry, where the discussion tends to revolve around metadata, we refer to what we do as Multidimensional Metadata.  Let me define that for you:

Multidimensional Metadata is:

"Any metadata that is a generational leap in DEPTH and SCOPE beyond the capabilities of a publisher to assign manually, or the crowd to describe effectively."


In practicality, by "depth" we mean that you have to pay attention to information beyond the surface of the book – data found equally on page 2, 3, 4, and 250.  We look DEEP into the book, from beginning to end.  By "scope" we mean that we have equal data across the entire scope of the corpus.  Social networks tend to have lots of data on the really popular books, and insufficient data on the vast majority of books on the market.  We collect the same data on every book in our database, regardless of how popular or well known it is.  Because it's a computer-based analysis, the site is equally effective with a single user as it is with millions.  We like to say, "Our books introduce each other."  This is true.  The content of the books themselves are the connecting threads between them.

In future articles on our blog, we'll talk more about what we call the Social Void and the Glass Castle, ways of describing the content hole most people are not even aware exist.  We'll talk about the reasons that the future of book discovery lies in the combination of both human powered information AND the content based approach used by the Book Genome Project.  But for now, if you have additional questions or comments, please check out our FAQ, or feel free to contact us.

In the mean time, best of luck,

Aaron Stanton – Founder and CEO



These are our most popular posts: what year didthe human genome project start

10 Science Holidays to Get Your Geek On

There are, of course, many unofficial days of observance throughout the year, but how many science-themed examples can you really name off the top of your head, save the obvious Earth day and Arbor Day? Let us help .... It was also in 2003 that the U.S. Senate and the House of Representatives declared the month of April Human Genome Month, and the 25th DNA Day (they did not, however, make it an official, annual holiday). ... Featured; All; Start a new thread ... read more

Mr. Grabas Weekly Blog: January 29, 2012

Homework will be to begin studying for a quiz on biotechnology, covering restriction enzymes, gel electrophoresis, the Human Genome Project, recombinant DNA, stem cells, cloning, gene therapy, and the polymerase chain ... read more

Why we need the Assemblathon - The Assemblathon

Assemblathon 2, using real data from three vertebrate species, started in June 2011 and will finish at the end of the year. The UC Davis Genome Center plan to organize ... Whereas the human genome project took about ten years before it was published – longer before all chromosomes were actually finished – we can now generate enough raw sequence data to produce a similar genome sequence in just ten days. And thats only considering the output from a single ... read more

Support and some questions raised about proposed reorganization ...

Under the new plan, these will be reorganized into a division of intramural research; two divisions carrying out managerial and policy functions; and four new divisions of extramural research in genome science, genomic medicine, ... Founded as an organizational home for the Human Genome Project 15 years ago, NHGRI today has a budget of US$500 million, supports 17 extramural projects including the Cancer Genome Atlas and the Large-Scale Sequencing ... read more

Related Posts



0 コメント:

コメントを投稿