Hacker Newsnew | past | comments | ask | show | jobs | submit | thesz's commentslogin

  > You recognize that you haven't really needed strong mathematical (or coding) skills to create models for some time.
And then there goes something like this [1], where researchers failed to control for p-value: "In this particular setting, emergent abilities claims are possibly infected by a failure to control for multiple comparisons. In BIG-Bench alone, there are ≥220 tasks, ∼40 metrics per task, ∼10 model families, for a total of ∼10^6 task-metric-model family triplets, meaning probability that no task-metric-model family triplet exhibits an emergent ability by random chance might be small."

[1] https://arxiv.org/abs/2304.15004


  > No one has ever made a purchasing decision based on how good your code is.
People make purchasing decisions on the availability of source code all the time, preferring source code available and be able to use it. It is safe to assume that they can perform purchase decisions on the quality of source code, given all is equal.

  > How can the same model predict egg prices in Italy, and global inflation in a reliable way?
For one, there's Benford's law: https://en.wikipedia.org/wiki/Benford%27s_law

So, predict sign (branch predictors in modern CPUs also use neural networks of sorts), exponent (most probably it changes slowly) and then predict mantissa using Benford's law.


  > The Delta Cycle logic is actually quite similar to functional reactive programming. It separates how a value changes from when a process responds to that change.
This is what I use when I play with hardware simulation in Haskell:

  type S a = [a]

  register :: a -> S a -> S a
  register a0 as = a0:as

  -- combinational logic can be represented as typical pure
  -- functions and then glued into "circuits" with register's
  -- and map/zip/unzip functions.
This thing also separates externally visible events recorded in the (infinite) list of values from externally unobservable pure (combinational) logic. But, one can test combinational logic separately, with property based testing, etc.

Automated Mathematician was what lead to Eurisko: https://en.wikipedia.org/wiki/Eurisko

Eurisko demonstrated superhuman abilities to play strategy games in early 1980-th, and even used strategies from VLSI place-and-route task in planning fleet placement in games. This is knowledge transfer between tasks.



The blind spot exploiting strategy you link to was found by an adverserial ML model...

There is an HIGGS dataset [1]. As name suggest, it is designed to apply machine learning to recognize Higgs bozon.

[1] https://archive.ics.uci.edu/ml/datasets/HIGGS

In my experiments, linear regression with extended (addition of squared values) attributes is very much competitive in accuracy terms with reported MLP accuracy.


The LHC has moved on a bit since then. Here's an open dataset that one collaboration used to train a transformer:

https://opendata-qa.cern.ch/record/93940

if you can beat it with linear regression we'd be happy to know.


Thanks.

The paper [1] referenced in your link follows the lagacy of the paper on the HIGGS dataset, and does not operate with quantities like accuracy and/or perplexity. HIGGS dataset paper provided area under ROC, from which one had to approximate accuracy. I used accuracy from the ADMM paper [2] to compare my results with. As I checked later, area under ROC in [1] mostly agrees with [2] SGD training results on HIGGS.

  [1] https://arxiv.org/pdf/2505.19689
  [2] https://proceedings.mlr.press/v48/taylor16.pdf
I think that perplexity measure is appropriate there in [1] because we need to discern between three outcomes. This calls for softmax and for perplexity as a standard measure.

So, my questions are: 1) what perplexity should I target when dealing with "mc-flavtag-ttbar-small" dataset? And 2) what is the split of train/validate/test ratio there?


For better or worse the people working on this don't really use perplexity or accuracy to evaluate models. The target is whatever you'd get for those metrics if you used the discriminants that were provided in the dataset (i.e. the GN2v01 values).

As for why accuracy and perplexity aren't reported: the experiments generally choose a threshold to consider something a "b-hadron" (basically picking a point along the ROC curve) and quantify the TPR and FPR at that point. There are reasons for this, mostly that picking a standard point lets them verify that the simulation actually reflects data. See, for example, the FPR [1] and TPR [2] "calibrations".

It's a good point, though, the physicists should probably try harder to report standard metrics that the rest of the ML community uses.

[1]: https://arxiv.org/pdf/2301.06319

[2]: https://arxiv.org/abs/1907.05120


Perplexity, aka measuring how much a network is sure about its answer. Which might be wrong. It would not pass the pier review of any particle physics journal. (Real) science is about being right, not about being sure about itself.

And this problem is a joke compared to a real problem. We are talking about going from 40 MHz to 100 kHz incoming data stream, after which a second layer of real-time selection reduces the data to 1 kHz which is processed, cleaned, elaborated into high level features that you have in that dataset. But if you think you can do better, apply for a CERN job, come here and enlighten us!

  > The model cares about what you're saying, not what language you're saying it in.
What is the number of languages model is trained upon? And what is the number of training set sentences? I believe that these numbers are vastly different and cosine similarity is overwhelmingly biased by number of sentences.

What if we equalize number of languages and number of sentences in the training set? A galaxy-wise LLM, so to say.

Also, model can't help but care about language because your work shows divergence of cosine similarity at the decoding (output) stage(s).


And then we discover that DNA in (not only brain) cells are ideal quantum computers, DNA's reactions generate coherent light (as in lasers) used to communicate between cells and single dendrite of cerebral cortex' neuron can compute at the very least a XOR function which requires at least 9 coefficients and one hidden layer. Neurons have from one-two to dozens of thousands of dendrites.

Even skin cells exchange information in neuron-like manner, including using light, albeit thousands times slower.

This switches complexity of human brain to "86 billions quantum computers operating thousands of small neural networks, exchanging information by lasers-based optical channels."


  > ...generate answers near the center of existing thought.
This is right in the Wikipedia's article on universal approximation theorem [1].

[1] https://en.wikipedia.org/wiki/Universal_approximation_theore...

"n the field of machine learning, the universal approximation theorems (UATs) state that neural networks with a certain structure can, in principle, approximate any continuous function to any desired degree of accuracy. These theorems provide a mathematical justification for using neural networks, assuring researchers that a sufficiently large or deep network can model the complex, non-linear relationships often found in real-world data."

And then: "Notice also that the neural network is only required to approximate within a compact set K {\displaystyle K}. The proof does not describe how the function would be extrapolated outside of the region."

NNs, LLMs included, are interpolators, not extrapolators.

And the region NN approximates within can be quite complex and not easily defined as "X:R^N drawn from N(c,s)^N" as SolidGoldMagiKarp [2] clearly shows.

[2] https://github.com/NiluK/SolidGoldMagikarp


It has been proven that recurrent neural networks are Turing complete [0]. So for every computable function, there is a neural network that computes it. That doesn't say anything about size or efficiency, but in principle this allows neural networks to simulate a wide range of intelligent and creative behavior, including the kind of extrapolation you're talking about.

[0] https://www.sciencedirect.com/science/article/pii/S002200008...


I think you cannot take the step from any turing machine being representable as a neural network to say anything about the prowess of learned neural networks instead of specifically crafted ones.

I think a good example are calculations or counting letters: it's trivial to write turing machines doing that correctly, so you could create neural networks, that do just that. From LLM we know that they are bad at those tasks.


So for every computable function, there is a neural network that computes it. That doesn't say anything about size or efficiency

It also doesn't say anything about finding the desired function, rather than a different function which approximates it closely on some compact set but diverges from it outside that set. That's the trouble with extrapolation: you don't know how to compute the function you're looking for because you don't know anything about its behaviour outside of your sample.


Turing conpleteness is not associated with crativity or intelligence in any ateaightforward manner. One cannot unconditionally imply the other.


No, but unless you find evidence to suggest we exceed the Turing computable, Turing completeness is sufficient to show that such systems are not precluded from creativity or intelligence.

I believe that quantum oracles are more powerful than Turing oracles, because quantum oracles can be constructed, from what I understand, and Turing oracles need infinite tape.

Our brains use quantum computation within each neuron [1].

[1] https://www.nature.com/articles/s41598-024-62539-5


There's no evidence to suggest a quantum computer exceeds the Turing computable.

The difference is quantum oracles can be constructed [1] and Turing oracle can't be [2]: "An oracle machine or o-machine is a Turing a-machine that pauses its computation at state "o" while, to complete its calculation, it "awaits the decision" of "the oracle"—an entity unspecified by Turing "apart from saying that it cannot be a machine" (Turing (1939)."

  [1] https://arxiv.org/abs/2303.14959
  [2] https://en.wikipedia.org/wiki/Turing_machine

This is meaningless. A Turing machine is defined in terms of state transitions. Between those state transitions, there is a pause in computation at any point where the operations takes time. Those pauses are just not part of the definition because they are irrelevant to the computational outcome.

And given we have no evidence that quantum oracles exceeds the Turing computable, all the evidence we have suggests that they are Turing machines.


  > This is meaningless.
Turing machines grew from the constructive mathematics [1], where proofs are constructions of the objects or, in other words, algorithms to compute them.

  [1] https://en.wikipedia.org/wiki/Constructivism_(philosophy_of_mathematics)#Constructive_mathematics
Saying that there is no difference between things that can be constructed (quantum oracles) and things that are given and cannot be constructed (Turing oracles - they are not even machines of any sort) is a direct refutation of the very base of the Turing machine theoretical base.

That's an irrelevant strawman. It tells us nothing about how create such a system ... how to pluck it out of the infinity of TMs. It's like saying that bridges are necessarily built from atoms and adhere to the laws of physics--that's of no help to engineers trying to build a bridge.

And there's also the other side of the GP's point--Turing completeness not necessary for creativity--not by a long shot. (In fact, humans are not Turing complete.)


No, twisting ot to be about how to create such a system is the strawman.

> Turing completeness not necessary for creativity--not by a long shot.

This is by far a more extreme claim than the others in this thread. A system that is not even Turing complete is extremely limited. It's near impossible to construct a system with the ability to loop and branch that isn't Turing complete, for example.

>(In fact, humans are not Turing complete.)

Humans are at least trivially Turing complete - to be Turing complete, all we need to be able to do is to read and write a tape or simulation of one, and use a lookup table with 6 entries (for the proven minimal (2,3) Turing machine) to choose which steps to follow.

Maybe you mean to suggest we exceed it. There is no evidence we can.


  > A system that is not even Turing complete is extremely limited.
Agda is not Turing-complete, yet it is very useful.

P.S. everything in the response is wrong ... this person has no idea what it means to be Turing complete.

> all we need to be able to do is to read and write a tape or simulation of one

An infinite tape. And to be Turing complete we must "simulate" that tape--the tape head is not Turing complete, the whole UTM is.

> A system that is not even Turing complete is extremely limited.

PDAs are not "extremely limited", and we are more limited than PDAs because of our very finite nature.


> P.S. everything in the response is wrong ... this person has no idea what it means to be Turing complete.

I know very well what it means to be Turing complete. All the evidence so far, on the other hand suggests you don't.

> An infinite tape. And to be Turing complete we must "simulate" that tape--the tape head is not Turing complete, the whole UTM is.

An IO port is logically equivalent to infinite tape.

> PDAs are not "extremely limited", and we are more limited than PDAs because of our very finite nature.

You can trivially execute every step in a Turing machine, hence you are Turing equivalent. It is clear you do not understand the subject at even a basic level.


> You can trivially execute every step in a Turing machine, hence you are Turing equivalent. It is clear you do not understand the subject at even a basic level.

LOL. Such projection. Humans are provably not Turing Complete because they are guaranteed to halt.


Judging from what I read, their work is subject to regular hardware constraints, such as limited stack size. Because paper describes a mapping from regular hardware circuits to the continuous circuits.

As an example, I would like to ask how to parse balanced brackets grammar (S ::= B <EOS>; B ::= | BB | (B) | [B] | {B};) with that Turing complete recurrent network and how it will deal with precision loss for relatively short inputs.

Paper also does not address training (i.e., automatic search of the processors' equations given inputs and outputs).


No, the size of those networks that would be capable of that are infeasible. That's a common fallacy. You hint at this but then dismiss it.

Mathematically possible != actually possible.


> approximate any continuous function

It wouldn't surprise me if many interesting functions we'd like to approximate aren't continuous at all.


This is one of the reasons current AI tech is so poor at learning physical world dynamics.

Relationships in the physical world are sparse, metastable graphs with non-linear dynamics at every resolution. And then we measure these dynamics using sparse, irregular sampling with a high noise floor. It is just about the worst possible data model for conventional AI stacks at a theoretical level.


This is what softmax [1] is for.

[1] https://en.wikipedia.org/wiki/Softmax_function


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: